June 1, 2007

DCampSouth Unconference 2007

Tomorrow I'll be attending the DCampSouth unconference tomorrow here in Raleigh, North Carolina. You can check out the Wikipedia definition of an unconference, but basically it's a small conference where the content is created by participants in real time, not planned out in detail beforehand. I've never been to one before, and I'm looking forward to it!

From their website: "The goal of DCampSouth is to bring together people of all professions interested in user experience and design, to help foster communication between professions, and help foster community. Oh, and to have a lot of fun in the process."

I'm going to host a small session, giving a quick overview of the open source speech technology I've been uncovering lately. I'm hoping to meet some cool design and UI folks and just generally geek out!

(P.S. If you're attending, you ought to sign up on the DCampSouth social network on CrowdVine!)

May 31, 2007

Overview of Open Source Speech Software

I've been doing tons of Googling and RSS-reading lately. It turns out that there is plenty of speech technology out there, a lot of it open source, but all difficult to find on the web. I'm not sure why that is. I think that voice has just been under the radar for a while.

In any case, here is a list of software and related sites for speech technology. I hope to play with each of these and give my impressions here. If you're familiar with any of them, please leave me a note!

Speech Recognition Engines

CMU Sphinx
* cmusphinx.sourceforge.net/html/cmusphinx.php
* sourceforge.net/projects/cmusphinx
* www.speech.cs.cmu.edu/
* Several versions, C & Java
* sphinx2, sphinx3, sphinx4, and PocketSphinx
* BSD license

Julius
* julius.sourceforge.jp/en/
* C/C++
* custom open source license
* mostly Japanese, but supports English

HTK 3
* htk.eng.cam.ac.uk/
* C/C++
* software & code are free for internal use, but distribution of any kind is prohibited

ISIP ASR
* Institute for Signal and Information Processing: Automatic Speech Recognition
* www.ece.msstate.edu/research/isip/projects/speech/index.html
* C/C++
* Public Domain

Snack Sound Toolkit
* www.speech.kth.se/snack/
* C/C++ plus Python & TCL
* BSD license
* Snack for Ruby: rbsnack.sourceforge.net/

Open Mind Speech
* freespeech.sourceforge.net/
* Previously called "FreeSpeech"
* appears dead, last release 2002

Speech Synthesis

Festival
* www.cstr.ed.ac.uk/projects/festival/
* C/C++
* very popular on Linux, built into several apps like KDE's KSayIt and Gnome's gnome-speech
* BSD-ish license

Flite
* www.speech.cs.cmu.edu/flite/
* "Festival Lite"
* a lighter-weight, version of Festival written by CMU
* written entirely in C rather than C++
* BSD license

FreeTTS
* freetts.sourceforge.net/docs/index.php
* Java port of Flite
* BSD license

Voice & Speech Corpora Sources

FestVox
* www.festvox.org
* from CMU, they provide documentation and scripts to create your own voices

VoxForge
* www.voxforge.org/
* GPL-licensed collection of voice recordings and their transcriptions (called "speech corpora")
* can be used in most of the speech recognition engines listed above (Sphinx, Julius, HTK3, and ISIP; possibly Snack)

Cepstral, LLC
* www.cepstral.com/
* high-quality synthetic voices that are Festival compatible
* commercial, but much higher quality than most free voices
* great dynamic samples on their website

May 28, 2007

O'Reilly on Google Speech and Gestures to Speech

A couple of O'Reilly Radar articles came across my desk today. One about an NSF grant to translate American Sign Language to verbal speech and another about Google's real reason for the GOOG-411 service. Tim O'Reilly finds the truly multi-modal interface of gesture recognition to speech synthesis interesting, and he thinks Google's real reason for the 411 service is to harvest voice data. Both are short and interesting reads. It's on the horizon, folks.

May 20, 2007

Speech Recognition & Synthesis in Science Fiction

Last month everyone and their brother blogged about Michael Schmitz's article on Human Computer Interaction in Science Fiction Movies. However, this blog wasn't up and running then, and I feel I should include it for completeness.

Regarding speech interfaces, especially in Star Trek, Schmitz writes, "In almost all movies the speech interface is conversational and intuitive, the difficulties especially of speech recognition and evaluation are never considered." No kidding! This is a big reason why I'm motivated to work on this project -- Why the hell hasn't it been invented already? It looked so easy on the big screen when I was a kid!

I've been going back and rewatching some of the Star Trek: The Next Generation episodes (a guilty geeky pleasure), and although the interaction there was definitely conversational, it was very rigid and formal, like most Star Trek speech. Terms like "standby," "hold execution," and "resume program" were the norm. These terms are standard sci-fi/Star Trek jargon and do not feel out of place on screen. However, I'm pondering the affect of this formal, terse jargon in a real-life speech interface. Would it make for a more natural discussion with a computer? It certainly makes the computer's job easier.

For example, I might say to my friend, "Hey, dude, what's goin' on?" But to the computer, I would say, "Status?" Terse, efficient... yet cold and rigid. However, I suggest keeping this seperation of common speech versus command speech is useful. It's my opinion that a speech interface should not be someone's "friend" or "companion," but a guide, a tool, an assistant. Using a secondary, more formal, mode of speech might reinforce this implicit seperation of "friend" versus "computer."

May 14, 2007

Riddle of the Sphinx: What Do I Need to Recognize Speech?

The most well-known open source speech recognizer is Carnegie Melon's Sphinx project, called CMUSphinx. However, Sphinx is a large and mature project, and as a primarily academic tool, it has many different experimental versions lying around. Just reading the project descriptions confuses me!

What do you need from the Sphinx suite? What is the best set of tools for your specific task? Here's a quick map to help!

1) A Trainer
C'mon get on the SphinxTrain! No matter what recognizer (also called a decoder) you choose from the list below, you must use SphinxTrain to set up your language data, called training your recognizer. SphinxTrain is written in C but also requires a few Perl scripts. I'm not sure why, but there is a directory for Python code in the source tree too. I'll look into that further later.

2) A Recognizer
This is where it gets really confusing. There are really only three choices, with a fourth special-case choice that seems unlikely.

2a) sphinx3 (C/Perl)
Sphinx3 is basically what you're going to want in most applications and experiments. It is the most mature, most used, and (I'm guessing) best supported version of the tool. It's written in C and has Perl script requirements. From my early readings in this area, it looks like when people say "CMUSphinx" or "Sphinx," they really mean "sphinx3."

2b) sphinx4 (Java)
Sphinx4 is sphinx3's little brother, growing up strong and fast, and written in Java. Sphinx4 wants to at least have feature parity with sphinx3, and is at least as fast in most aspects, a tad slower in others, and faster in still others. I'm not really sure why you would want to choose sphinx4 over sphinx3, unless you or your platform are more comfortable with Java instead of C. At this point sphinx4 is still seeing active development, whereas sphinx3 seems to have cooled just a bit. This is all anecdotal from the Subversion logs, though, so I may be wrong.

2c) PocketSphinx (embedded)
If you're wanting to use speech recognition on an embedded system like a handheld tool, mobile phone, automobile dashboard appliance, or other small form-factor computer, you'll want to use PocketSphinx, sphinx3's baby brother.

2d) sphinx2 (old-school)
Sphinx2 is more optimized for speed than accuracy and is not currently seeing much development. Everyone recommends you go to sphinx3 instead. Sphinx2 is still around though for a few experiments where blazingly fast real-time recognition is needed for tiny grammars.

Summary
So, you'll definitely need SphinxTrain, no matter what. After that, you'll most likely choose sphinx3 as the recognizer. I think this would put you in the largest majority of Sphinx users. However, sphinx4 is up and coming, and is a strong contender to overtake sphinx3 someday. If you're doing embedded work, use PocketSphinx, and in almost all cases you can ignore sphinx2.

Let me know which recognizer you use and for what projects!

May 10, 2007

Voice Recognition vs Speech Recognition

The first thing to know when exploring human voice technology is that although people commonly say "voice recognition," they actually mean "speech recognition." Speech recognition is interpreting a string of sounds into meaningful words and sentences. Voice recognition, or more appropriately, speaker recognition, is what you do when you pick up the phone, hear two words, and realize it's your Mother on the other end.

My impression is speech technology professionals avoid using the term "voice recognition." However, if you're in marketing, this term seems very popular with the general populace, so know your audience!