Main

How To Archives

May 14, 2007

Riddle of the Sphinx: What Do I Need to Recognize Speech?

The most well-known open source speech recognizer is Carnegie Melon's Sphinx project, called CMUSphinx. However, Sphinx is a large and mature project, and as a primarily academic tool, it has many different experimental versions lying around. Just reading the project descriptions confuses me!

What do you need from the Sphinx suite? What is the best set of tools for your specific task? Here's a quick map to help!

1) A Trainer
C'mon get on the SphinxTrain! No matter what recognizer (also called a decoder) you choose from the list below, you must use SphinxTrain to set up your language data, called training your recognizer. SphinxTrain is written in C but also requires a few Perl scripts. I'm not sure why, but there is a directory for Python code in the source tree too. I'll look into that further later.

2) A Recognizer
This is where it gets really confusing. There are really only three choices, with a fourth special-case choice that seems unlikely.

2a) sphinx3 (C/Perl)
Sphinx3 is basically what you're going to want in most applications and experiments. It is the most mature, most used, and (I'm guessing) best supported version of the tool. It's written in C and has Perl script requirements. From my early readings in this area, it looks like when people say "CMUSphinx" or "Sphinx," they really mean "sphinx3."

2b) sphinx4 (Java)
Sphinx4 is sphinx3's little brother, growing up strong and fast, and written in Java. Sphinx4 wants to at least have feature parity with sphinx3, and is at least as fast in most aspects, a tad slower in others, and faster in still others. I'm not really sure why you would want to choose sphinx4 over sphinx3, unless you or your platform are more comfortable with Java instead of C. At this point sphinx4 is still seeing active development, whereas sphinx3 seems to have cooled just a bit. This is all anecdotal from the Subversion logs, though, so I may be wrong.

2c) PocketSphinx (embedded)
If you're wanting to use speech recognition on an embedded system like a handheld tool, mobile phone, automobile dashboard appliance, or other small form-factor computer, you'll want to use PocketSphinx, sphinx3's baby brother.

2d) sphinx2 (old-school)
Sphinx2 is more optimized for speed than accuracy and is not currently seeing much development. Everyone recommends you go to sphinx3 instead. Sphinx2 is still around though for a few experiments where blazingly fast real-time recognition is needed for tiny grammars.

Summary
So, you'll definitely need SphinxTrain, no matter what. After that, you'll most likely choose sphinx3 as the recognizer. I think this would put you in the largest majority of Sphinx users. However, sphinx4 is up and coming, and is a strong contender to overtake sphinx3 someday. If you're doing embedded work, use PocketSphinx, and in almost all cases you can ignore sphinx2.

Let me know which recognizer you use and for what projects!

May 31, 2007

Overview of Open Source Speech Software

I've been doing tons of Googling and RSS-reading lately. It turns out that there is plenty of speech technology out there, a lot of it open source, but all difficult to find on the web. I'm not sure why that is. I think that voice has just been under the radar for a while.

In any case, here is a list of software and related sites for speech technology. I hope to play with each of these and give my impressions here. If you're familiar with any of them, please leave me a note!

Speech Recognition Engines

CMU Sphinx
* cmusphinx.sourceforge.net/html/cmusphinx.php
* sourceforge.net/projects/cmusphinx
* www.speech.cs.cmu.edu/
* Several versions, C & Java
* sphinx2, sphinx3, sphinx4, and PocketSphinx
* BSD license

Julius
* julius.sourceforge.jp/en/
* C/C++
* custom open source license
* mostly Japanese, but supports English

HTK 3
* htk.eng.cam.ac.uk/
* C/C++
* software & code are free for internal use, but distribution of any kind is prohibited

ISIP ASR
* Institute for Signal and Information Processing: Automatic Speech Recognition
* www.ece.msstate.edu/research/isip/projects/speech/index.html
* C/C++
* Public Domain

Snack Sound Toolkit
* www.speech.kth.se/snack/
* C/C++ plus Python & TCL
* BSD license
* Snack for Ruby: rbsnack.sourceforge.net/

Open Mind Speech
* freespeech.sourceforge.net/
* Previously called "FreeSpeech"
* appears dead, last release 2002

Speech Synthesis

Festival
* www.cstr.ed.ac.uk/projects/festival/
* C/C++
* very popular on Linux, built into several apps like KDE's KSayIt and Gnome's gnome-speech
* BSD-ish license

Flite
* www.speech.cs.cmu.edu/flite/
* "Festival Lite"
* a lighter-weight, version of Festival written by CMU
* written entirely in C rather than C++
* BSD license

FreeTTS
* freetts.sourceforge.net/docs/index.php
* Java port of Flite
* BSD license

Voice & Speech Corpora Sources

FestVox
* www.festvox.org
* from CMU, they provide documentation and scripts to create your own voices

VoxForge
* www.voxforge.org/
* GPL-licensed collection of voice recordings and their transcriptions (called "speech corpora")
* can be used in most of the speech recognition engines listed above (Sphinx, Julius, HTK3, and ISIP; possibly Snack)

Cepstral, LLC
* www.cepstral.com/
* high-quality synthetic voices that are Festival compatible
* commercial, but much higher quality than most free voices
* great dynamic samples on their website

About How To

This page contains an archive of all entries posted to Hollow Voice in the How To category. They are listed from oldest to newest.

Misc is the next category.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34