Custom Pronunciations | Mod9 ASR Engine

[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Mod9 ASR Engine: Custom Pronunciations

Several requests in the Engine including add-words, custom grammars in the recognize request, and add-grammar (currently in beta) take a list of words as an option. For example, you could add the words "xcommand", "mod9", and "janin" to the "en_video" model with the following command:

echo '{
  "command": "add-words",
  "asr-model": "en_video",
  "words": [
    { "word": "xcommand" },
    { "word": "mod9", "soundslike": "mod nine" },
    { "word": "janin", "phones": "JH AE N IH N"}
  ]
}' | jq -c . | nc $HOST $PORT

The "word" field contains the spelling of the word, and is what the Engine outputs. It can be any text including numerals or unicode (in utf8 encoding).

Note that for "xcommand", no additional information is provided. In this case, the Engine will compute pronunciations automatically. In general, the Engine does a pretty good job of computing pronunciations automatically, but accuracy can be improved by providing additional information in the options. Also, if the "word" field contains anything other than "a" through "z", the automatically generated pronunciation can be inaccurate.

For "mod9", the option "soundslike" was provided. This gives the Engine a hint for how to pronounce "mod9". This is particularly useful for words that contain anything other than the letters "a" through "z". The "soundslike" field should contain only letters and spaces. It's best if "soundslike" is composed of common words, each with a single pronunciation.

You can also explicitly provide the pronunciation in the "phones" field. This is demonstrated in the example for "janin". For ASR models trained by Mod9, the phones field describes how the word is spoken and must be a phonetic sequence from CMUdict, which is itself a subset1 of ARPAbet. We do not support lexical stress.

Although automatically generated pronunciations and "soundslike" typically produce good result, using "phones" will produce the most accurate transcriptions. The next section describes how to audit automatically generated pronunciations.

Listing Pronunciations

The "pronounce-words" command will return what the Engine would use for pronunciations given the "words" option. It is useful so you can manually audit the automatically generated pronunciations and use the "phones" field in the actual calls to e.g. "add-words".

For example:

echo '{
  "command": "pronounce-words",
  "words": [
    {"word": "xcommand"},
    {"word": "mod9", "soundslike": "mod nine"},
    {"word": "janin", "phones": "JH AE N IH N"}
  ]
}' | jq -c . | nc $HOST $PORT

This returns:

{ "status": "completed", "words": [
  { "word": "xcommand", "phones": "Z K AH M AE N D" },
  { "word": "xcommand", "phones": "Z K AA M AH N D" },
  { "word": "xcommand", "phones": "EH K S K AH M AE N D" },
  { "word": "mod9", "phones": "M AO D N AY N" },
  { "word": "janin", "phones": "JH AE N IH N" }
]}

Note that the Engine returned multiple automatically generated pronunciations for "xcommand". This is both because x is often pronounced as in "xylophone", and because of inherent ambiguity in pronunciation generation. Although incorrect pronunciations don't typically hurt recognition that much as long as the correct pronunciation is also present, for highest accuracy, it's best to select the correct pronunciations and use them in future commands. For example:

echo '{
  "command": "add-words",
  "asr-model": "en_video",
  "words": [
    { "word": "xcommand", "phones": "EH K S K AH M AE N D" },
    { "word": "mod9", "phones": "M AO D N AY N" },
    { "word": "janin", "phones": "JH AE N IH N" }
  ]
}' | jq -c . | nc $HOST $PORT

This command also supports the g2p-options or g2p-cost request options for more advanced functionality.

[top]

Footnotes

1: CMUdict is a subset of ARPAbet with the following phonemes absent: AX, AXR, DX, EL, EM, EN, H, IX, NX, Q, UX, WH.


©2019-2022 Mod9 Technologies (1.9.5)