EFFICIENT PHONE-BASED RECOGNITION
ENGINES FOR CHINESE AND ENGLISH ISOLATED
COMMAND APPLICATIONS
Xavier MENÉNDEZ-PIDAL, Lei DUAN, Jingwen LU, Beatriz DUKES, Michael EMONTS, Gustavo
HERNÁNDEZ ÁBREGO, Lex OLORENSHAW
Spoken Language Technology Group, SONY NSCA,
San José, CALIFORNIA
{Xavier, Lei, Jingwen, Beatriz, Mike, Gustavo, Lexo} @slt.sel.sony.com
ABSTRACT
In this paper we present a flexible and efficient
approach to perform an accurate speech recognition
interface for isolated command applications in three
different
languages: Mandarin, Cantonese and
English. The paper analyzes and discusses the
different trade-offs necessary to obtain an accurate,
real-time system with low memory requirements.
Areas addressed are design of the training database,
and Hidden Markov Model (HMM) units used by the
recognizer (monophones versus triphones).
1. INTRODUCTION
A speech recognition interface to control small
computer devices like command car navigation,
telephone, or robot systems is discussed in this paper.
In such devices with limited CPU and memory
resources, the implementation of a flexible real-time
and accurate speech recognition interface is an open
engineering design issue. The 2 restrictions in our
recognizer, memory and maximum computational
cost, are introduced next. For the system analyzed,
we had a maximum memory size of 0.5 Megabytes.
Speech recognition is a cumbersome process even
for small command applications, but can be
accelerated by pruning techniques such as beam
search strategies in a standard Viterbi search engine.
During the Viterbi decoding no more than 300
Gaussians per frame can be estimated to meet the
CPU limitations of our usual hardware. In the paper
we analyze and compare 2 recognizer designs based
on monophone and triphone models to obtain a real-
time system with limited memory requirements. Also,
we discuss different alternatives to improve system
accuracy and portability using an appropriate training
and testing database design.
2. SYS