This post is a personal experience. As it was mentioned before I’m more like a spaghetti ccode developer than professional developer. Keep this in mind while evaluating the story of text-to-speech service selection.
I’ve done a lot of writings in the past. Some people are happy with text format, but some would like to listen to what I have to say. We all know that podcasts have been a big thing lately. Those have been coming from all over - some good some not so good. I also wanted to have text and sound of my 100 Days of DX learning experiment. I was considering to record the texts on my own machine and use those. That sounded like too much work for the effort. This is not my job and I still have 7 kids at home to take care of. I needed something simple.
Lets use “Synthesizer as a Service” in text-to-speech
Then I thought of services (API) which would turn given english text to voice and offer mp3 sound file for it. I did some experimentation with this concept a few years back and I had terrible experiences. So I had my doubts about this, but hey let’s give it a try. Let’s take this as a learning experiment as well.
I asked my friends in Twitter for advice. I asked for text-to-speech API recommendations. Google, AWS, Microsoft and some other smaller players were suggested. You know the normal offering. I went to Google service first. After clicking around for a while I found a site where I can try the service with some text. I must have arrived to API console since I was expected to find out how the API works with given documentation. I tried it for a few times and got error pages. Too fucking hard to get a first positive experience. MrMurphy hitted Google here since I landed on wrong page to start with.
Show me first that the result has value for me
I just wanted to hear what the result sounds! As a first step I don’t want to fiddle with your API! Since I was skeptical about how natural the voice would sound in the recording I wanted to hear the result first. If the result would be good enough for my purposes, I would select the service and even pay for it. Regardless of the service provider. But for sure I will not select a service based on what kind of parameters are given in API call or how well it is documented. That would be the next step while evaluating a service. Advise for service provider is that convince me that the end result has value for me, then I will find out how to use your API and how to add that to my development CI/CD process.
Lets fiddle and try Amazon Polly
I have couple of Alexa’s at home and I have been happy with so next to try is Amazon Polly. I could have chosen any of the solutions provided by big players, but Amazon was recommended by some of my peers in Twitter. I trust peers more than any official often bullshit marketing sites or testimonials. That latter thing is a curious thing. Testimonials in sites are 99% from real customers. In a way they are the same as getting recommendation in Twitter or other social media (or IRC). Testimonials are planned and polished messages. Twitter reply from a peer developer is personal. I value personal messages more than mass media.
Amazon Polly’s Asynchronous Synthesis feature overcomes the challenge of processing a larger text document by changing the way the document is both synthesized and returned. When a synthesis request is made by submitting input text using the StartSpeechSynthesisTask, Amazon Polly queues the requests, and then asynchronously processes them in the background as soon as the system resources are available. Amazon Polly then uploads the resulting speech or speech marks stream directly to your (required) Amazon Simple Storage Service (Amazon S3) bucket, and notifies you about the completed file’s availability through your (optional) SNS topic.
Amazon offers a simple UI to input own text and hear how the result sounds. You can change the voice from female/male and some options among those.
Notice the value proposition in the page “Listen, customize, and download speech. Integrate when you’re ready.” That is simple and genius. This is exactly what I was talking about above. One of the great features of Amazon Polly is that it offers S3 Bucket as the storage for the results. Any longer text must be anyway processed directly to S3. I did synthesize some of the 100DaysDX posts with the service and added links to mp3 files in the frontpage post listing.
Automate
To use Polly and S3, you need to have boto3 SDK (or that is one option). Boto provides an easy to use, object-oriented API, as well as low-level access to AWS services. Amazon uses the term “synthesize” when converting text to speech in their cloud. Adding a small script to automatically synthesize the texts in blog can be created by modifying the given example
import boto3
import time
polly_client = boto3.Session(
aws_access_key_id=’’,
aws_secret_access_key=’’,
region_name='eu-west-2').client('polly’)
response = polly_client.start_speech_synthesis_task(VoiceId='Joanna',
OutputS3BucketName='synth-books-buckets',
OutputS3KeyPrefix='key',
OutputFormat='mp3',
Text = 'This is a sample text to be synthesized.')
taskId = response['SynthesisTask']['TaskId']
print "Task id is {} ".format(taskId)
task_status = polly_client.get_speech_synthesis_task(TaskId = taskId)
print task_status
If you would want to go Pro, you would like to use SSML with Polly.
To go further with the mp3 files, I could upload the files to Soundcloud (with API) so that podcasters would be able to listen those from app directly.
import soundcloud
# create a client object with access token
client = soundcloud.Client(access_token='YOUR_ACCESS_TOKEN')
# upload audio file
track = client.post('/tracks', track={
'title': 'This is my sound',
'asset_data': open('file.mp3', 'rb')
})
# print track link
print track.permalink_url
The above should give you an example what app development is today. It’s chaining APIs together to create wanted application or reach. That is why simple and easy to use APIs with self-service are key elements of success in Business to Developer (B2D) marketing.
Conclusion
When building a product, be sure to offer self-service tools to get look and feel of the result created with yout API/service. Preferably let me use my own material as input. Pseudo content might work occasionally, but to get real personal touch in the testing let me use own material. Learning and testing your product should not cost me anything - offer freemium.
I don’t want to do my first experiements always on code level. In some cases (like the one here) I am primarily concerned about the result and quality (“Listen, customize, and download speech”). If and when I’m satisfied with the result, I expect to find code level examples (“Integrate when you’re ready.”) and profound documentation with getting started to inject your products value to my product at code level. In some cases that is not even needed and I’m happy with the UI solution.
Third and last thing I check is the cost structure. If that is good for me, then I’m done. I have selected my service to use. Otherwise I start again with another service and expect to do all as self-service without discussing your sales people.
Some more to read from 100 Days DX
- #100 - The biggest open resource on Developer eXperience so far
- #99 - Hackers - the top 1% of engineers
- #98 - Invalid Open API spec files ruins the developer experience
- #97 - Lightweight API evaluation framework
- #96 - Embracing open source community driven tools development
- #95 - Exploring the multilevel nature of API Design Guides
While you are here...
Check out Platform of Trust in which I've worked to enable best possible developer experience!
Comments