Project Overview:
SONIC is a toolkit for enabling research and development of
new algorithms for continuous speech recognition. Since March of
2001, SONIC has been used as our test bed for integrating new ideas and
for supporting research activities that include speech recognition as
core components at the Center for Spoken Language Research.
While not a general HMM modeling toolkit, SONIC is specifically
designed for speech recognition research with careful attention applied
for speed and efficiency needed for real-time use in live applications.
SONIC utilizes state-of-the-art statistical acoustic and language
modeling methods. The system acoustic models are decision-tree
state-clustered Hidden Markov Models (HMMs) with associated gamma
probability density functions to model state-durations. The
recognizer implements a two-pass search strategy. The first pass
consists of a time-synchronous, beam-pruned Viterbi token-passing
search through a lexical prefix tree. Cross-word acoustic models
and trigram or four-gram language models are applied in the first pass
of search. During the second pass, the resulting word-lattice is
converted into a word-graph. One can dump lattices in Finite
State Machine (FSM) format compatible with AT&T tools. Also one can
generate n-best lists using the A* algorithm. Finally, word-posterior
probabilities can be calculated to provide a measure of word confidence.
The recognizer also incorporates both model-based and feature-based
speaker adaptation methods. Model-based adaptation methods include:
Maximum Likelihood Linear Regression (MLLR), Structured MAP Linear
Regressions (SMAPLR). In addition, SONIC also includes
implementation of feature-based adaptation methods such as Vocal Tract
Length Normalization (VTLN), cepstral mean & variance
normalization, and Constrained MLLR (CMLLR). Finally, advanced
language-modeling strategies such as concept language models can be
used to improve performance for dialog system recognition tasks.
System Features:
Phonetic Aligner
- Provides word, phone, and HMM state-level boundaries for
acoustic training.
- Decision-tree based trainable letter-to-sound prediction
module
- Multilingual lexicon support
- API for integration into applications (e.g.,
lip-synchronization)
Phonetic Decision Tree Acoustic Trainer
- Estimates parameters of state-clustered continuous
density Hidden Markov Models
- Incorporates phonetic position & context questions
- Distributed / parallel acoustic trainer (multiple
machines, operating systems, CPUs)
Core Recognizer
- Token-passing based recognition using a lexical-prefix
tree search
- Cross-word triphone models & up to 4-gram language
model in first-pass
- HMM state durations modeled using Gamma distributions
- N-best list output; Lattice dumping, and second-pass
rescoring functionality
- Word confidence computed from word-posteriors of
word-graph
- Class-based, word-based, and concept-based n-gram
language models
- Dynamically switched statistical language models (dialog
state-conditioned LMs)
- Keyword & regular expression based grammar spotting
with confidence
- Phonetic fast-match constrained ASR search for improved
decoding speed
- Mel Frequency Cepstral Coefficient (MFCC) feature representation
Speaker Adaptation
- (Confidence Weighted) Maximum Likelihood Linear
Regression (MLLR)
- Lattice-based MLLR (Lattice-MLLR)
- Constrained MLLR (CMLLR)
- Structural Maximum a Posterior Linear Regression (SMAPLR)
- Vocal Tract Length Normalization (VTLN)
- Cepstral mean and variance normalization
Live-Mode Recognition and Voice Activity Detection
- API for streaming audio for keyword/grammars and
continuous speech recognition
- Internal, HMM-based voice activity detection method
Language Portability
- Designed to incorporate new phone sets and foreign
vocabularies.
- SONIC has been ported to German, Spanish, French,
Italian, Croatian, Arabic, Russian, Portuguese, Korean, Turkish, and Japanese
Speech Compression Interface
- Libraries and APIs for utilizing the Speex subband
CELP coding system are provided within SONIC.
- SONIC in server mode accepts both raw PCM and compressed
bitstreams.
Application Programming Interface (API)
- API environment for linking and designing speech enabled
applications.
- Batch-mode and simulated live-mode example
- Socket-based Tcl/Tk client / C-code server example
Supported Operating Systems
- Linux, Sun Solaris, Microsoft Windows, and Mac OS X
Project Management:
Software Contributors:
Bryan Pellom
|
Research Assistant Professor,
CSLR / Department of Computer Science
|
Kadri
Hacioglu
|
Research Associate,
CSLR / Institute for Cognitive Science
|
Wayne Ward
|
Research Professor
CSLR / Institute for Cognitive Science
|
Umit Yapanel
|
PhD student,
Eletrical Engineering
|
Min Tang
|
PhD student,
Computer Science
|
Andreas Hagen
|
PhD student,
Computer Science
|
Daniel Cer
|
PhD student,
Computer Science / Institute for Cognitive Science
|
Nicholas
Romanyshyn
|
Senior, Undergraduate,
Computer Science
|
Past Software Contributors:
Ruhi Sarikaya
|
PhD, Electrical Engineering, 2000
currently with IBM Watson Speech Group
|
Keith Herold
|
MS, Computer Science / Linguistics
currently with Lumenvox Speech Group
|
Recent Related Publications:
Core System Description
Bryan Pellom, Kadri Hacioglu, " Recent Improvements in the CU SONIC ASR System for Noisy Speech: The SPINE Task", in Proceedings of IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), Hong Kong, April, 2003
Bryan Pellom, " SONIC: The University of Colorado Continuous Speech Recognizer", University of Colorado, tech report
#TR-CSLR-2001-01, Boulder, Colorado, March, 2001
Children's Speech Recognition
R. Cole, S. van Vuuren, B. Pellom, K. Hacioglu, J. Ma, J. Movellan, S.
Schwartz, D. Wade-Stein, W. Ward, J. Yan , " Perceptive Animated Interfaces: First Steps Toward a New Paradigm for Human Computer Interaction", in Proceedings of the IEEE: Special Issue on Human
Computer Interaction, vol. 91, no. 9, pp. 1391-1405, September, 2003
Andreas Hagen, Bryan Pellom, Ronald Cole, " Children's Speech Recognition with Application to Interactive Books and Tutors", in IEEE
Automatic Speech Recognition and Understanding (ASRU) Workshop, St.
Thomas, USA, December, 2003
Andreas Hagen, Daniel Connors, Bryan L. Pellom , " The Analysis and Design of Architecture Systems for Speech Recognition on Modern Hand-Held Computing Devices", First IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign & System Synthesis, pp.
65-70, Newport Beach, California, USA, October, 2003
Accented Speech Recognition
Ayako Ikeno, Bryan Pellom, Dan Cer, Ashley Thornton, Jason Brenier, Dan
Jurafsky, Wayne Ward, William Byrne , " Issues in Recognition of Spanish-Accented Spontaneous English", in ISCA & IEEE Workshop on
Spontaneous Speech Processing and Recognition, Tokyo, Japan, April,
2003
Distributed Speech Recognition
Kadri Hacioglu, Bryan Pellom, " A Distributed Architecture for Robust Automatic Speech Recognition", in Proceedings of IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong
Kong, April, 2003
Multilingual Speech Recognition
Kadri Hacioglu, Bryan Pellom, Tolga Ciloglu, Ozlem Ozturk, Mikko
Kurimo, Mathias Creutz, " On Lexicon Creation for Turkish LVCSR", in
Eurospeech - Interspeech, pp. 1165-1168, Geneva, Switzerland,
September, 2003
B. L. Pellom and R. Cole, " The CSLR
International Workshop ," (Sponsored by NSF), Summer 2003.
SONIC was ported to French, German, Italian, and Spanish (Mexican
& Chilean) for use in interactive books and tutors.
O. Salor, B. L. Pellom, T. Ciloglu, K. Hacioglu, M.
Demirekler, " On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language", in International
Conference on Spoken Language Processing (ICSLP), pp. 349-352, Denver,
Colorado, September, 2002
Call Center Speech Recognition
Min Tang, Bryan Pellom, Kadri Hacioglu, " Call-Type Classification and Unsupervised Training for the Call Center Domain", in IEEE Automatic
Speech Recognition and Understanding (ASRU) Workshop, St. Thomas, USA,
December, 2003
Confidence Assessment for Speech Recognition
Kadri Hacioglu & Wayne Ward, " A Concept Graph based Confidence Measure", Proc. IEEE International Conference on Acoustic,
Speech, and Signal Processing (ICASSP), Orlando Florida, May 2002.
Kadri Hacioglu, Wayne Ward, " A Word Graph Interface for a Flexible Concept Based Speech Understanding Framework," Proc.
Eurospeech, Aalborg Denmark, September 2001.
Kadri Hacioglu, Wayne Ward, " Dialog-Context Dependent Language Modeling Using N-grams and Stochastic Context-Free Grammars",
Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Salt Lake City, May 2001.
Robust Speech Recognition
Umit Yapanel, John H.L.Hansen, " A new perspective on Feature Extraction for Robust In-vehicle Speech Recognition", Proceedings of
Eurospeech'03, Geneva, Sept. 2003
Umit Yapanel, Xian Xian Zhang, J.H.L. Hansen, " High Performance Digit Recognition In Real Car Environments," Inter. Conf.
on Spoken Language Processing (ICSLP)., vol. 2, pp. 793-796, Denver,
CO, Sept. 2002
|