The Mod9 ASR Engine supports models to perform speech recognition (converting audio to text), pronunciation generation, and natural language processing (punctuation, capitalization, number formatting, and disfluency removal).
Below we describe the formats and layout required to use models with the Engine.
Models must be stored in a directory accessible to the Engine at run
time. The default directory is
/opt/mod9-asr/models/ but it can be specified when you start the
Engine using the
--models.path option. Below the models directory
should be subdirectories named
nlp/. Models related to automatic speech
recognition are stored under the
asr/ directory, grapheme-to-phoneme conversion models are under
g2p/, and natural
language processing models are stored under the
The models themselves are further sub-directories under each of these.
For example, the
en-US_phone ASR model for
performing speech recognition on US English telephone data. It consists of
files and directories under
/opt/mod9-asr/models/asr/en-US_phone/ (which is more specifically an alias to
Automatic Speech Recognition (ASR) models
The Mod9 ASR Engine can use automatic speech recognition models that are compatible
with models produced and used by the Kaldi
toolkit. By default, the Engine looks for files in very specific
locations and sets various configuration variables to sensible default
values for most Kaldi models. You can, however, override the defaults
(typically by creating or modifying
In the sections below, files and directories are listed relative to
the model directory described above (e.g.
Required ASR model files
Only 5 files are absolutely required to run an ASR model on the Engine. The
locations listed below are the default where the Engine will look for
the models. See the section on
conf/model.conf below for overriding
the default locations.
am/final.mdl- The acoustic model. In Kaldi recipes, this will typically be stored at a path named something like
exp/chain/tdnn7q/final.mdl. It consists of the architecture and parameters describing the neural network used for acoustic modeling. Note that only
chainmodels are supported.
graph/HCLG.fst- The decode graph. In Kaldi recipes, this will typically be stored in the graph directory, which is typically named something like
exp/chain/tdnn7q/graph. It consists of an FST that represents the language model, lexicon, and state structure of the model.
graph/words.txt- A mapping between word IDs and word symbols. It will typically be in the same directory as
graph/phones/word_boundary.int- A mapping between phone IDs and where in a word the phone can occur. This can usually be found in
phones/word_boundary.intunder the same directory where
conf/mfcc.conf- Feature configuration file. Although the Engine can theoretically operate without this file, the defaults will seldom be correct. We strongly recommend using
conf/mfcc.conf. In Kaldi recipes, it can generally be found in
Optional ASR model files
There are several optional files. If these files are not provided, some functionality in the Engine may be disabled.
am/tree- A description of the low-level phonetic units used in the system. In Kaldi recipes, this would typically be found in the same directory as
treeis not provided, custom words and custom grammars cannot be used.
graph/phones.txt- A mapping between phone IDs and phone symbols. This can usually be found in the same directory as
phones.txtisn't provided, custom words and custom grammars cannot be used.
conf/model.conf- A Kaldi-style configuration file. There is no directly equivalent file in typical Kaldi recipes; rather, the same information is usually encoded in scripts. Full documentation on the file can be found in comments in the models that ship with the Engine (e.g. asr/en/conf/model.conf).
Of particular note,
you can override where the Engine looks for files. For example, if the
de_small had a file
contained the line
--am=exp1/expanded.mdl, the Engine would look for
the acoustic model in
than the default
metadata.json- A json file with various metadata used by the Engine. Most of the fields in this file are informational only; however, the "language" field is used to match automatic speech recognition models with natural language processing models. See models that ship with the Engine for examples (e.g. asr/en/metadata.json).
ivectors are features related to speakers, and are very commonly used in research systems. Although they provide a statistically significant performance boost in many benchmarks, we find they tend to be brittle -- when the audio isn't a very good match to the training data, accuracy can suffer. Also, if speakers aren't on their own channel, ivectors can fail spectacularly.
The Engine supports ivectors if the directory
/opt/mod9-asr/models/asr/mod9/en-US_phone-ivector/ivector/) and all required
files exist under that directory. In Kaldi recipes, the files can
typically be found under a directory named something like
exp/chain/extractor. The required files are
Compression and encryption of model files
The Engine transparently supports gzip compression of most Kaldi model
files. This is most useful for
HCLG.fst, which can shrink by a
factor of 2 with compression. Note that the gzip file format can be automatically detected,
so the default name for the graph is still
HCLG.fst and not
Some files in models shipped by Mod9 have been encrypted. The Engine will transparently unencrypt these files. As with compression, the file names do not need to be modified when they are encrypted; the encryption format is automatically detected.. These files may not be used other than in accordance with the licensing terms.
©2019-2022 Mod9 Technologies (Version 1.9.3)