Boulder Language Technologies

  • Narrow screen resolution
  • Wide screen resolution
  • Auto width resolution
  • Decrease font size
  • Default font size
  • Increase font size
Sonic: Large Vocabulary Continuous Speech Recognition System PDF Print E-mail

Project Overview:

SONIC is a toolkit for enabling research and development of new algorithms for continuous speech recognition.  Since March of 2001, SONIC has been used as our test bed for integrating new ideas and for supporting research activities that include speech recognition as core components at the Center for Spoken Language Research.   While not a general HMM modeling toolkit, SONIC is specifically designed for speech recognition research with careful attention applied for speed and efficiency needed for real-time use in live applications.

SONIC utilizes state-of-the-art statistical acoustic and language modeling methods.  The system acoustic models are decision-tree state-clustered Hidden Markov Models (HMMs) with associated gamma probability density functions to model state-durations.  The recognizer implements a two-pass search strategy.  The first pass consists of a time-synchronous, beam-pruned Viterbi token-passing search through a lexical prefix tree.  Cross-word acoustic models and trigram or four-gram language models are applied in the first pass of search.  During the second pass, the resulting word-lattice is converted into a word-graph.  One can dump lattices in Finite State Machine (FSM) format compatible with AT&T tools. Also one can generate n-best lists using the A* algorithm. Finally, word-posterior probabilities can be calculated to provide a measure of word confidence.

The recognizer also incorporates both model-based and feature-based speaker adaptation methods. Model-based adaptation methods include: Maximum Likelihood Linear Regression (MLLR), Structured MAP Linear Regressions (SMAPLR).  In addition, SONIC also includes implementation of feature-based adaptation methods such as Vocal Tract Length Normalization (VTLN), cepstral mean & variance normalization, and Constrained MLLR (CMLLR).  Finally, advanced language-modeling strategies such as concept language models can be used to improve performance for dialog system recognition tasks.

System Features:

Phonetic Aligner

  • Provides word, phone, and HMM state-level boundaries for acoustic training.  
  • Decision-tree based trainable letter-to-sound prediction module
  • Multilingual lexicon support
  • API for integration into applications (e.g., lip-synchronization)

Phonetic Decision Tree Acoustic Trainer

  • Estimates parameters of state-clustered continuous density Hidden Markov Models
  • Incorporates phonetic position & context questions
  • Distributed / parallel acoustic trainer (multiple machines, operating systems, CPUs)

Core Recognizer

  • Token-passing based recognition using a lexical-prefix tree search
  • Cross-word triphone models & up to 4-gram language model in first-pass
  • HMM state durations modeled using Gamma distributions
  • N-best list output; Lattice dumping, and second-pass rescoring functionality
  • Word confidence computed from word-posteriors of word-graph
  • Class-based, word-based, and concept-based n-gram language models
  • Dynamically switched statistical language models (dialog state-conditioned LMs)
  • Keyword & regular expression based grammar spotting with confidence
  • Phonetic fast-match constrained ASR search for improved decoding speed
  • Mel Frequency Cepstral Coefficient (MFCC) feature representation

Speaker Adaptation

  • (Confidence Weighted) Maximum Likelihood Linear Regression (MLLR)
  • Lattice-based MLLR (Lattice-MLLR)
  • Constrained MLLR (CMLLR)
  • Structural Maximum a Posterior Linear Regression (SMAPLR)
  • Vocal Tract Length Normalization (VTLN)
  • Cepstral mean and variance normalization

Live-Mode Recognition and Voice Activity Detection

  • API for streaming audio for keyword/grammars and continuous speech recognition
  • Internal, HMM-based voice activity detection method

Language Portability

  • Designed to incorporate new phone sets and foreign vocabularies.  
  • SONIC has been ported to German, Spanish, French, Italian, Croatian, Arabic, Russian, Portuguese, Korean, Turkish, and Japanese

Speech Compression Interface

  • Libraries and APIs for utilizing the Speex subband CELP coding system are provided within SONIC. 
  • SONIC in server mode accepts both raw PCM and compressed bitstreams.

Application Programming Interface (API)

  • API environment for linking and designing speech enabled applications.
  • Batch-mode and simulated live-mode example
  • Socket-based Tcl/Tk client / C-code server example

Supported Operating Systems

  • Linux, Sun Solaris, Microsoft Windows, and Mac OS X

Project Management:

Bryan Pellom
Research Assistant Professor,
CSLR / Department of Computer Science

Kadri Hacioglu
Research Associate,
CSLR / Institute for Cognitive Science

Software Contributors:

Bryan Pellom
Research Assistant Professor,
CSLR / Department of Computer Science
Kadri Hacioglu
Research Associate,
CSLR / Institute for Cognitive Science
Wayne Ward
Research Professor
CSLR / Institute for Cognitive Science
Umit Yapanel
PhD student,
Eletrical Engineering
Min Tang
PhD student,
Computer Science
Andreas Hagen
PhD student,
Computer Science
Daniel Cer
PhD student,
Computer Science / Institute for Cognitive Science
Nicholas Romanyshyn
Senior, Undergraduate,
Computer Science

Past Software Contributors:

Ruhi Sarikaya
PhD, Electrical Engineering, 2000
currently with IBM Watson Speech Group
Keith Herold
MS, Computer Science / Linguistics
currently with Lumenvox Speech Group

Recent Related Publications:

Core System Description
Bryan Pellom, Kadri Hacioglu, "pdf Recent Improvements in the CU SONIC ASR System for Noisy Speech: The SPINE Task", in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, April, 2003

Bryan Pellom, "pdf SONIC: The University of Colorado Continuous Speech Recognizer", University of Colorado, tech report #TR-CSLR-2001-01, Boulder, Colorado, March, 2001

Children's Speech Recognition
R. Cole, S. van Vuuren, B. Pellom, K. Hacioglu, J. Ma, J. Movellan, S. Schwartz, D. Wade-Stein, W. Ward, J. Yan , "pdf Perceptive Animated Interfaces: First Steps Toward a New Paradigm for Human Computer Interaction", in Proceedings of the IEEE: Special Issue on Human Computer Interaction, vol. 91, no. 9, pp. 1391-1405, September, 2003 

Andreas Hagen, Bryan Pellom, Ronald Cole, "pdf Children's Speech Recognition with Application to Interactive Books and Tutors", in IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, St. Thomas, USA, December, 2003 

Andreas Hagen, Daniel Connors, Bryan L. Pellom , "pdf The Analysis and Design of Architecture Systems for Speech Recognition on Modern Hand-Held Computing Devices", First IEEE/ACM/IFIP International Conference on Hardware/Software Codesign & System Synthesis, pp. 65-70, Newport Beach, California, USA, October, 2003

Accented Speech Recognition
Ayako Ikeno, Bryan Pellom, Dan Cer, Ashley Thornton, Jason Brenier, Dan Jurafsky, Wayne Ward, William Byrne , "pdf Issues in Recognition of Spanish-Accented Spontaneous English", in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, April, 2003 

Distributed Speech Recognition
Kadri Hacioglu, Bryan Pellom, "pdf A Distributed Architecture for Robust Automatic Speech Recognition", in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, April, 2003 

Multilingual Speech Recognition
Kadri Hacioglu, Bryan Pellom, Tolga Ciloglu, Ozlem Ozturk, Mikko Kurimo, Mathias Creutz, "pdf On Lexicon Creation for Turkish LVCSR", in Eurospeech - Interspeech, pp. 1165-1168, Geneva, Switzerland, September, 2003 

B. L. Pellom and R. Cole, " The CSLR International Workshop ," (Sponsored by NSF), Summer 2003.  SONIC was ported to French, German, Italian, and Spanish (Mexican & Chilean) for use in interactive books and tutors.

O. Salor, B. L. Pellom, T. Ciloglu, K. Hacioglu, M. Demirekler, "pdf On Developing New Text and Audio Corpora and Speech Recognition Tools for the Turkish Language", in International Conference on Spoken Language Processing (ICSLP), pp. 349-352, Denver, Colorado, September, 2002 

Call Center Speech Recognition
Min Tang, Bryan Pellom, Kadri Hacioglu, "pdf Call-Type Classification and Unsupervised Training for the Call Center Domain", in IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, St. Thomas, USA, December, 2003 

Confidence Assessment for Speech Recognition
Kadri Hacioglu & Wayne Ward, "pdf A Concept Graph based Confidence Measure", Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Orlando Florida, May 2002. 

Kadri Hacioglu, Wayne Ward, "pdf A Word Graph Interface for a Flexible Concept Based Speech Understanding Framework," Proc. Eurospeech, Aalborg Denmark, September 2001.

Kadri Hacioglu, Wayne Ward, "pdf Dialog-Context Dependent Language Modeling Using N-grams and Stochastic Context-Free Grammars", Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, May 2001.

Robust Speech Recognition
Umit Yapanel, John H.L.Hansen, "pdf A new perspective on Feature Extraction for Robust In-vehicle Speech Recognition", Proceedings of Eurospeech'03, Geneva, Sept. 2003 

Umit Yapanel, Xian Xian Zhang, J.H.L. Hansen, "pdf High Performance Digit Recognition In Real Car Environments," Inter. Conf. on Spoken Language Processing (ICSLP)., vol. 2, pp. 793-796, Denver, CO, Sept. 2002 

< Prev