An In-Depth Look at Text-to-Speech in Linux

Ken Starks
It’s been an interesting two weeks, talking about and looking into why text-to-speech (TTS) is such a mess in Linux. I’ve spoken with seventeen of you; seventeen who know a bit about software programming. “A bit” is a purposeful understatement. Some of you have forgotten more about software programming than I will ever know. That being the case, I have learned a bit about why TTS in Linux is next to worthless. For those who are just joining into the conversation, let me catch you up quickly.
Door number twoLate last year, I was told that the area treated for throat cancer in 2012 was exhibiting pre-cancerous activity. I was told that it could remain “pre-cancerous” for twenty years, or it could again form into the cancer that tried to kill me in 2012. If that happened and it remained unattended, it would kill me in a matter of months. My options ranged from doing nothing and taking my chances, all the way to having my larynx removed to be done with this throat cancer monster once and for all. I picked door number two.
I began researching my options as a soon-to-be voiceless person. In preparing for a life without voice, there were several scenarios in which I failed to consider:
  • What if I was caught in a dire situation and my only option for survival was to call out for help?
  • What if I was witnessing someone in a dire situation and I could not call out to warn them?
  • What if I needed to travel and could not find or ask directions?
  • I cannot use the telephone in an emergency.
  • In that most verbal interaction is done on the fly, I will miss out on a lot.
As a side note of interest, I now carry a small pocket-sized but extremely loud air horn wherever I go. Silly? Not if I ever need to use it.
Air hornI’ve already experienced the third example on my list and it worked out okay. I was able to stop people and ask them questions via my sloppy writing on a notepad. But the results of the other examples could range from being merely inconvenient to deadly. I’m hoping for mere inconvenience.
It’s the day-to-day, moment-to-moment things that I will primarily miss out on, be that badinage or booby trap, and it’s already proving to be well on the wrong side of convenient. This is why I took up the banner of speech for the speechless. Since I use Linux full time, my focus turned immediately to computer-centered solutions to this problem. Go figure.
I’ll paraphrase my previous article on this matter: TTS in Linux sucks.
Now before you gather into torch and pitchfork hoards, let me walk that back a few steps. The fact that TTS in Linux sucks is not necessarily a reflection upon the developers. This specific problem can be found within the tools the average open source developer has at his disposal. Or the lack of tools in this case. Let’s begin examining TTS in the Linuxsphere and hear the difference.
eSpeak or no eSpeak…that is the question.
In the beginning of trying to find a new voice, I researched the options in Linux and was excited by the number of TTS solutions Linux offers. But one by one they were ticked off my list for the same reason: They suck.
Let’s talk about the first option I tried:
eSpeak is an application found in many distro’s repositories and on sites such as GitHub. Since it’s easy to install, I anxiously clicked on it and watched as the application popped into existence on my desktop. What a nice GUI, consisting of a simple text field with the few needed configuration commands found under an “edit” field at the top of the screen. Brevity on the face is nice.
But the brevity of the simple GUI lost its charm when I actually tried using the app. This is what I heard when I clicked “play” within eSpeak. Listen for yourself. Really…This is what we have? And again, let me say…this isn’t the developer’s fault. He or she did the best that could be done with the tools available for use.
It does get a little better, because there were better voice options for the developer. My Google+ buddy Neil Munro took on the nasty task of getting better voices working within eSpeak using Mbrola. Even though he succeeded where I failed after days of futzing with it, even he said that the process of getting voices installed from Mbrola into eSpeak was absolutely ridiculous.
First off, as a computer user, you should not have to install a separate application in order to get the first application to work. Dependencies are one thing. Entire other applications is far and away another. My unanswered question, as a layman mind you, is why can’t the voices be coded into the application instead of users having to install it separately? However, doing so did improve the voice…not much mind you, but with a striking difference. Here is how it sounded with a Mbrola voice installed.
The difference is striking to say the least. But even with that much improvement, my personal opinion is that it isn’t ready for prime time. That’s said after spending a complete afternoon using the application on my Nexus 7. It does not handle punctuation well at all and it just skips some words that it cannot pronounce properly.
It wasn’t until almost a week into this TTS discovery voyage that I found Mary. Not a human named Mary; a text-to-speech engine named Mary. Mary is an open source application with the downside, for many, of being a Java app. That may be a downside for you, but for me, with my options being limited day by disappointing day, Mary just might be the girl of my dreams. Listen to what Mary has to say, with my thanks to the Moody Blues’ “Days of Future Passed.” What an amazing difference. Amazing.
How this application is executed, I’m not sure. Is it to be baked into a website? That’s what I’m gathering. Is there any way this app can become part of installed and usable software in Linux? I would appreciate a professional answering this for me: Is she only available as a web application or is there any hope for the speechless to use her on their personal computers? I am extremely excited about hearing what some of my Java programming buddies have to say.
In my travels I also found a Windows only app called CoolSpeech. I may give this a run in Crossover Office or Wine sometime in the coming week. From just a few minutes playing with it, I’m not particularly impressed when I compare it to Mary TTS or my personal online subscription tool, SpokenText.net.
SpokenText is by far the best I have examined to date. My Reglue presentation at MIT for LibrePlanet was done via a pre-recorded Ogg file from SpokenText as I stood at the podium and “talked” about what we do at Reglue. I was on edge the whole time. In my mind, this seemed like a ridiculous idea. However, the response to my presentation was heartening. I have submitted my white paper for Texas Linux Fest in San Marcos this summer; we’ll see how that’s received. Here is a sample of SpokenText.
I ended up paying $90.00 for an annual subscription to use the site to record my text-to-speech. I find it a bit ironic that the absolute best TTS to be had is the one most easy to use…that is if you pay for it. To me, 90 bucks was well worth the price of admission, especially as I was under the time gun to get my presentation ready for LibrePlanet 2015. Those who might want to listen to my LibrePlanet presentation for Reglue can do so here. Turn your volume down a bit though, there are some annoying clips during the first third of the file.
Android app Speech Assistant
Android app Speech Assistant
So after two weeks of working to find a suitable Linux TTS application, I am of mixed emotion. I am excited that the technology exists, as evidenced by a couple of the apps mentioned above. There is also a huge jump in synthetic speech generally, as demonstrated by the TV game show “Jeopardy,” where the world was introduced to “Watson.” Unfortunately, as wonderful as that synthetic voice sounds, it is guarded 360 degrees and 24/7 by a multitude of patents and exorbitant cost. IBM could throw us a bone by releasing that technology as FOSS, but I ain’t holding my breath.
Therein lies our problem. Our developers cannot use the majority of these voices because they are available only under extremely expensive licensing. And what normal Linux developer or software programmer can afford that?
All in all, Google may again upend the entire thing by their offering to allow Android apps to run in Chrome on most every device. My personal text-to-speech tool is called simply Speech Assistant, and it’s an Android app. It’s the best I have found so far and it works the way I need a text-to-speech tool to work. I use it on my Nexus 7. Hopefully the global community of tens of thousands of open source developers can find a way to make it work on Linux.

No comments :