[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

ASR Engine: Deployment Guide

The Mod9 ASR Engine consists of a Linux server, along with compatible models and support software, that enables clients to send commands and audio over a network and receive the written transcript of what was said in the audio.

This document describes how to run the Engine from the perspective of an operator who controls the server.

The accompanying TCP reference documentation describes how clients may communicate with the Engine, including: specifying various commands and request options, sending audio data, and interpreting various response formats.

Quick start
System requirements
Obtaining the Engine
Starting the Engine
Shutting down the Engine
Models
Configuring the Engine
Advanced Docker operation
Engine performance
Release notes
Support contacts

Quick start

This section assumes that you have opted to run Engine in a Docker container.

Download the Docker image; it's several gigabytes, so this may be slow depending on your Internet connection.
```
docker pull mod9/asr
```
Note that referencing the image as mod9/asr is shorthand for mod9/asr:latest.
Start the container as a foreground process in the terminal:
```
docker run -it --rm -p 9900:9900 mod9/asr
```
To stop this one-off Engine container, simply type CTRL+C:
- the -it flag allows your keyboard to interrupt the foreground process;
- the --rm flag will remove (i.e. clean up) the stopped Docker container. By default the Engine listens on the container's port 9900; the -p flag maps this to your host machine's port 9900.

This "quick start" setup is suitable for local experimentation and debugging. You can now have local clients communicate with this Engine server by setting localhost and 9900 respectively for:

HOST and PORT variables in the reference documentation examples for ad-hoc command-line usage with nc.
ASR_ENGINE_HOST and ASR_ENGINE_PORT environment variables with the Python SDK or REST API.

The remainder of this documentation describes further details of the Engine's operation, as well as considerations for an operator when deploying the server in a production environment.

System requirements

The minimum system requirements for running the ASR Engine with default settings is 8GiB of memory, an x86 compatible chip, and Linux from kernel 4.0 onward. We recommend a modern Intel chip (e.g. Xeon) for best performance.

The memory requirements can grow depending on how the Engine is started and how clients call the services the Engine provides. See section Configuring the Engine, below.

Note also that Mod9 can provide other versions of the Engine that trade off accuracy for speed and/or memory.

[top]

Obtaining the Engine

The ASR Engine is licensed by Mod9 Technologies. For more information about licensing, contact sales@mod9.com.

The preferred distribution mechanism is through a Docker image containing the Engine and all the required support files. We can also provide the Engine as a tarball containing a Linux executable (compatible with most Linux distributions) plus required support files. The following two subsections describe each of these processes in more detail.

Downloading and starting the Engine Docker container

Docker is a system for running software in environments known as containers. Mod9 Technologies provides images that can be used to create containers that are run by the Docker software, which can be installed on most operating systems.

To download the Docker image, run the command:

docker pull mod9/asr

This will download the latest version of the image; for a specific version, specify mod9/asr:2.0.0-rc1 for example. Since the image contains the speech recognition models, which can be quite large, this download can take a few minutes.

Next, you can start a container. This does not start the Engine itself, but rather launches a new docker container of which the Engine is a part. You can think of this as a virtual machine that isolates the Engine from the physical machine.

docker run -it --rm --network=host mod9/asr:latest bash

This command starts the container, and allows you to connect to it using any port. It opens a bash shell where you can enter commands to run in the container. You can stop the container by exiting the shell.

Note that this method of starting a container allows you to use the same commands to start the Engine regardless of whether the Engine is running in a container or from the tarball. However, we do not recommend this method for production; see Advanced Docker operation below.

Extracting the Engine tarball

If you elected to receive the Engine as a tarball, you should have access to the file mod9-asr-engine_linux.tar.gz on the machine where you want to run the Engine. Next, select a directory to store the Engine and support files. It can be anything, but we recommend using /opt/mod9-asr. Copy mod9-asr-engine_linux.tar.gz to that directory.

Next, to extract the Engine and support files, run:

cd /opt/mod9-asr
tar xzf mod9-asr-engine_linux.tar.gz

This should create the executable /opt/mod9-asr/bin/engine and support files, including /opt/mod9-asr/models.

[top]

Starting the Engine

Whether you are running in a Docker container or from the tarball, starting the Engine follows the same pattern. From the command line and in the directory where the executable engine is installed (typically /opt/mod9-asr), run the command:

engine

This will start the Engine with the default English model (en) and listen on port 9900. See sections Models and Port selection below for more information.

[top]

Shutting down the Engine

The Engine accepts a client shutdown command, which directs the Engine to stop accepting new recognition requests and begin a "graceful" shutdown, in which it waits for all active requests to complete before exiting.

# The Engine will wait for all active requests to finish and exit with code 0.
$ echo '{"command":"shutdown"}' | nc $HOST $PORT
{"status":"completed","warning":"Shutdown requested. The Engine will no longer accept new recognition requests."}

The shutdown command accepts an optional timeout field, which directs the Engine to forcefully shut down after the timeout duration even if there are active requests that have not completed. If all active requests have completed before the timeout duration, the Engine will immediately shut down. A negative timeout value will direct the Engine to wait until all active requests complete before shutting down.

# The Engine will wait up to 10.2 seconds for all active requests to finish. Otherwise, it will forcefully and immediately exit.
$ echo '{"command":"shutdown", "timeout":10.2}' | nc $HOST $PORT
{"status":"completed","warning":"Shutdown requested. The Engine will no longer accept new recognition requests."}

By default, the Engine will shutdown gracefully with an exit code of 0 if all of the active requests have completed, and it will shutdown forcefully with an exit code of 1 if the timeout is reached and other requests are still active. The exit code for a forceful shutdown can be configured when starting the Engine.

# If a shutdown request times out, then the exit code should be 5.
engine --shutdown.timeout-code=5

[top]

Models

The Engine loads various models to process and transcribe audio. A model consists of a directory with various required and optional files. For full details on the model directory and files, including how to use your own models with the Engine, see the Model Formats and Layout documentation.

The models are typically named with a two-letter language code (ISO 639-1), often with dialect and other identifying information. Each model supports a single sampling rate. The sampling rate is a property of digital audio that describes how much data per second is sent.

The models can typically be found under the ./models/ directory, though you can specify a different location at Engine startup with the --models.path flag.

Mod9 provides both speech recognition models and natural language processing models (for punctuation, capitalization, number formatting, etc.).

The speech recognition models include:

`en-US_phone` model

The en-US_phone model is optimized for North American English telephony data. It does best in low noise environments and with accents common to North America. It supports audio sampled at 8kHz.

`en_video` model

The en_video model is optimized for "global" English data. It performs better than en-US_phone on high-noise audio and accented English. This model supports 16kHz audio. The aliases en and english can also be used to refer to this model.

`es-US_phone` model

The es-US_phone model is optimized for Spanish telephony data as spoken in the United States. It does best in low noise environments and with accents common to the United States. It supports audio sampled at 8kHz. The alias es and spanish can also be used to refer to this model.

`mod9/en-US_phone-smaller` model

The mod9/en-US_phone-smaller model is similar to en-US_phone, but takes up significantly less disk space and memory. It is suitable for debugging or in environments where less memory is available. It is slightly less accurate than the en-US_phone model, and will generally use more CPU than en-US_phone at the same settings.

`mod9/en-US_phone-ivector` model

The mod9/en-US_phone-ivector model is similar to en-US_phone, but is trained with the use of speaker-adaptive i-vectors. This can perform slightly better than en-US_phone in situations with long audio segments that are strictly a single speaker; however, it is generally less robust than en-US_phone for more diverse environments in which the audio may have multiple speakers, short duration, long silence, or non-speech sounds. This model produces optimal results when scored on the well-known research benchmark known "Switchboard" (i.e. a subset of the NIST 2000 Hub5 evaluation of conversational telephone speech). Note that the model is "fair" in the sense that it was not trained on any of the test data; some other commercial ASR vendors have been suspected of "cheating" by training their systems on this test data.

The natural language processing models include:

`en` model

The en model is used for English data. It includes punctuation, capitalization, number formatting, and disfluency removal.

`es` model

The es model is used for Spanish data. It includes punctuation and disfluency removal.

Loading models

If you do not specify otherwise, the Engine will load the en speech recognition model and the en natural language processing model at startup. If you want to load other models, use the --models.asr and --models.nlp command line flags with comma-separated model names.

For example, this will instruct the Engine to load the en-US_phone and es speech recognition models and the en and es natural language processing models when starting up.

engine --models.asr=en-US_phone,es --models.nlp=en,es

There is also a grapheme-to-phoneme (G2P) model, which is loaded similarly with --models.g2p and defaults to mod9/en.

You can also start the Engine with no models loaded. This is generally only useful if you allow clients to load/unload models, which requires passing the --models.mutable flag. See section Modifying loaded models, below.

engine --models.mutable --models.asr= --models.nlp= --models.g2p=

[top]

Configuring the Engine

The configurations described above use default settings for a large number of options to the Engine. The following sections describe options to the Engine that you may want to override.

Port selection

By default, the Engine will listen on port 9900. You can override the port by adding --port=<NUMBER> to the command line. For example, to load the default English models and use port 16000, you would start the Engine as follows:

engine --port=16000

Note that ports under 1024 require root privileges.

As a special case, you can also use port 0 to tell the server to pick an unused port at random. Note that you must communicate whichever port the server selects to any clients that want to connect to the Engine. To determine which port the server chose, you can examine the log file for e.g. the line:

Ready to accept requests on port 35647

See section Log files below for more information about log files.

Limiting the number of concurrent requests

The Engine can support multiple simultaneous connections from clients. By default, there is no limit to the number of concurrent requests the Engine handles. Since there is a small amount of overhead associated with additional requests, we provide the ability to limit the total number of simultaneous connections by specifying --limit.requests=<NUMBER> on the command line. For example, to limit the Engine to 40 simultaneous connections, start the Engine with:

engine --limit.requests=40

Clients that attempt to connect after this limit is reached will receive an error message and the command will not run. Note that simple status queries are exempt from the limit.

Limiting the number of `add-grammar` or `add-words` requests

An Engine that allows for ASR model modification (i.e. --models.mutable is set) allows for add-words requests. Each request has a memory footprint that can blow up if there are too many spurious add-words requests. To limit the total number of add-words requests, start the Engine with:

engine --limit.requests.mutable=100

We recommend that clients add multiple words in a single request rather than spreading each word across a different request.

This limit also applies for add-grammar requests, but not for load-model or unload-model.

Limiting the number of concurrent threads

The Engine handles each client's request simultaneously. Each request is handled by one or more threads.

The operating system in the Engine's environment schedules threads to run on the available CPU cores. Although modern operating systems do a very good job handling large workloads even when the number of threads is larger than the number of cores, there is both processing time and memory overhead with each additional thread. As a rule of thumb, expect to use about 25 MB of memory per additional thread, though it will vary depending on the ASR model, settings, and quality of the audio.

Because of this overhead, it is useful to limit the total number of threads the Engine can run concurrently. By default, the Engine runs with no such limit, but this can be set by using the --limit.threads flag on the command line. For example, to limit the Engine to 100 threads, start the Engine with:

engine --limit.threads=100

A client that attempts to send a request that would exceed this limit will receive an error message, and the request will not be processed. Note that simple status queries are exempt from the limit.

Limiting the number of threads per request

Batch-mode recognition requests allow a single client to request multiple threads to process audio in parallel. It is possible to limit the number of concurrent threads allowed to any individual request.

engine --limit.threads-per-request=20

By default, the limit on the number of threads per request is equal to the number of CPUs available. Batch-mode requests are subject to both the individual batch request threads limit and the overall Engine thread limit.

Limiting the memory usage

The Engine can be run with a hard memory limit.

engine --limit.memory-gib=20

The limit specified is the memory usage at which the Engine will reject subsequent requests. For example, if the memory limit is set to 20GiB (gibibytes) and the Engine's memory usage is 20.5GiB, it will reject any subsequent client requests until the memory usage goes under 20GiB. If the memory usage is 19GiB, the Engine will still accept incoming requests. Note that simple status queries are exempt from the limit.

The Engine can also be configured with a soft memory limit.

engine --determinize.max-mem=50000000

This option may be familiar to Kaldi ASR experts, and affects the per-thread memory usage due to lattice determinization. The Engine's default is higher than the default for DeterminizeLatticePhonePrunedOptions.max_mem in the Kaldi library, since this tends to provide better representation of transcript alternatives.

Setting this option to a non-positive integer will disable the memory limit; this is not generally advisable since worst-case scenarios are difficult to characterize and can potentially use several gigabytes of memory per thread.

The --determinize.max-mem limit is considered "soft" because it is enforced when a thread is actually allocating some greater multiple of that memory limit. In practice, this was seen to be up to 10x higher than the specified limit.

Lattice determinization can be retried when the soft limit is reached, with each retry attempt using a successively halved lattice pruning beam. The number of determinization tries can be set with the --determinize.max-tries option, which defaults to 2 (i.e. 1 retry). When the number of determinization tries is exceeded, the Engine will not fail -- instead its "fallback" strategy is to use a linear 1-best lattice that preserves the transcript, but has no other alternatives.

Setting this option to 0 will effectively disable lattice determinization, always using the 1-best linear lattice:

engine --determinize.max-tries=0

This can be helpful to operators who are conservative about limiting memory usage, or aggressively maximizing speed.

Allowing `shutdown` requests

By default, shutdown requests from clients are not allowed. The Engine must be run with shutdown.allowed in order to accept shutdown requests.

engine --shutdown.allowed

Exit code when timeout during shutdown

If shutdown requests are allowed for the Engine, then the exit code when the Engine times out during a shutdown request can be configured. If shutdown requests are not allowed, then this setting is ignored.

engine --shutdown.allowed --shutdown.timeout-code=1

Exit code when handling SIGTERM

When the Engine receives a SIGTERM signal from the operating system, it handles it gracefully by finishing existing requests and then shutting down. The Engine does not have to be run with --shutdown.allowed to use this setting. The exit code when the Engine terminates with a SIGTERM can be configured as follows:

engine --shutdown.sigterm-code=15

Modifying loaded models

Some Engine commands modify the Engine's state, and may affect the results of other client requests. For example, a bias-words request will modify an ASR model and affect the recognition output for any other requests that may be using the modified model. Because of this, the Engine by default will disallow add-words and bias-words commands to protect against untrusted clients. The Engine must be run with models.mutable set in order for these commands to be allowed.

engine --models.mutable

Indexing models

The Engine can index the words in an ASR model's vocabulary, which speeds up bias-words commands.

engine --models.mutable --models.indexed

The --models.indexed flag will instruct the Engine to create an index for each ASR model when it is loaded. Constructing the index introduces a trade-off between memory and speed. For example, an indexed single-word bias-words request can take less than 0.01 second for a relatively large model, and about 1-2 seconds without the index. However, the index itself will take up a few hundred megabytes of memory.

The --models.indexed flag cannot be set if the Engine does not have --models.mutable set.

Checking the Engine's configuration

The Engine's configurations and status can be checked using a get-info command.

# The Engine will respond with information about its configuration and current usage.
echo '{"command":"get-info"}' | nc $HOST $PORT

Log files

In addition to writing messages to the console, the Engine can also log messages to a file. You can control where the file is written and how much information is reported by setting various command line arguments to the Engine.

By default, log files are not generated; they are only printed to the console. To enable log file generation, you must specify the directory where the Engine should write logs by passing --log.directory to the Engine. This will cause the Engine to write logs to that directory in the file engine.log. If this file already exists, the new logs will be appended to the file.

Note that the Engine must have permission to read and write to the directory you provide with --log.directory. If you are running within the mod9/asr Docker image that packages the Engine software in a Linux filesystem, then a directory at /var/opt/mod9-asr/log has already been created with suitable permissions to enable file-based logging. See Advanced Docker operation below for permission issues if you would like to log to another directory, such as mounted from a Docker host filesystem.

The log level controls how much information is reported by the logging system. The levels are, in order, debug, info, warning, error, fatal. As the log level goes from debug to fatal, fewer messages will be reported.

The default log level for both messages printed to the console and to the log file is info. You can change the log level by passing --log.level-console and/or --log.level-file when you start the Engine.

In order to control the amount of disk space logs occupy, the Engine will automatically perform log rotations followed by optional log deletions. In a log rotation, the current log file (engine.log) is closed, renamed to engine_<TIMESTAMP>.log, and a fresh engine.log is opened to receive subsequent log messages. Log rotations occur at midnight, at Engine shutdown, or when the size of the current log file exceeds a limit.

--log.max-file-size

If the size of the current log file exceeds this limit, the Engine will perform a log rotation. Defaults to 10485760 (10MB).

--log.max-dir-size

If the size of the log directory is above this limit during a log rotation, the oldest log files will be deleted until the directory size is below this limit. Defaults to 104857600 (100MB).

The following is an example of starting the Engine with various logging options:

engine --log.directory=/var/opt/mod9-asr/log \
             --log.level-console=warning \
             --log.max-file-size=104857600

The format of the log file is designed to be easy both for a human to read and a machine to parse. Each line contains a single message from the Engine consisting of a set of tags followed by a single tab followed by a human-readable message. The tags are surrounded by brackets and separated by a space. An example of output to the log file is:

[2021-03-11T11:40...] [main:info]     Ready to accept requests on port 9900
[2021-03-11T11:41...] [0000:info]     Accepted request 0000 from 127.0.0.1
[2021-03-11T11:41...] [0000:info]     Parsed JSON request: {"command":"get-info"}
[2021-03-11T11:41...] [0000:info]     Finished processing request 0000 at 127.0.0.1
[2021-03-11T11:41...] [0001:info]     Accepted request 0001 from 127.0.0.1
[2021-03-11T11:41...] [0001:info]     Parsed JSON request: {"command":"get-models-info"}
[2021-03-11T11:41...] [0001:info]     Finished processing request 0001 at 127.0.0.1

Note that these examples have been paraphrased to fit on a line.

The tags are a date/time stamp in ISO 8601 extended format, a request ID represented by an incrementing number or 'main', and the log level the message was logged at. Output to console is similar, but doesn't include the time stamp.

Initial requests file

The Engine supports Customization requests such as add-words and bias-words that modify a model in the Engine. It can be useful to ensure that such requests are sent before other client requests are processed. To facilitate this, the Engine takes the option --initial-requests-file that must point to a file containing one JSON object per line (known as JSON Lines format). Each line is sent as a client request to the Engine sequentially right after any models specified on the command line have finished loading. See TCP for a description of the JSON requests the Engine supports.

Note that requests from the initialization file follow the same rules as requests from a client. For example, --models.mutable must be specified if a request modifies a model (e.g. add-words).

As an example, if you wanted to add the name Nowitzki to the en model and boost the words hoop and basketball, you could put the following requests into a file named startup.jsonl. Note that each JSON request must be on a single line, and that the file format is actually a slight extension of JSONL.

// NOTE: this file format is JSONL (as described at https://jsonlines.org) with minor extensions:
//   - Lines that start with // as the first non-whitespace characters are ignored as comments.
//   - Lines that have only whitespace are ignored.

{"command": "add-words", "asr-model": "en", "words": [{"word": "Nowitzki", "phones": "N OW V IH T S K IY"}]}
{"command": "bias-words", "asr-model": "en", "words": [{"word": "basketball", "bias": 2.3}, {"word": "hoop", "bias": 3}]}

Then you could start the Engine with:

engine --models.mutable --models.asr=en --initial-requests-file=startup.jsonl

When the Engine starts accepting client recognition requests, the word Nowitzki will be in vocabulary, and the words hoop and basketball will have been boosted.

Recognition with local files

Another use of the initial requests file is to run speech recognition and write results to local files. This is enabled using the recognize command with the request options audio-uri and response-uri in an initial requests file. Note that these special options to the recognize command are only available from an initial requests file and not from a client request, and their values must start with the file:// scheme.

To demonstrate, start by downloading hi.wav and swbd.wav:

curl -sLO mod9.io/hi.wav
curl -sLO mod9.io/swbd.wav

Then write the following to an initial requests file recognize.jsonl:

{"command": "recognize", "audio-uri": "file://swbd.wav", "response-uri": "file://swbd.jsonl"}
{"command": "recognize", "audio-uri": "file:///tmp/hi.wav", "response-uri": "file:///tmp/hi.jsonl"}
{"command": "shutdown"}

Finally, run the Engine:

engine --shutdown.allowed --models.asr=en --initial-requests-file=recognize.jsonl

This will run recognition on swbd.wav and write results to swbd.jsonl, both paths relative to the working directory from which the Engine was run. Then it run recognition on /tmp/hi.wav and write results to /tmp/hi.jsonl, where the files are specified with absolute paths.. Finally, it will shutdown the Engine. Note that --shutdown.allowed was passed to enable the shutdown request to proceed.

[top]

Advanced Docker operation

The method of starting the Engine in a Docker container described in section Downloading and Starting the Engine Docker Container is useful for debugging, but not typically how you would use Docker in a production environment. Normally, you would start the Docker container and the Engine at the same time. Also, using --network=host is generally more permissive than necessary.

This section provides some general guidance on starting a Docker container from the command line to run the Mod9 Engine. For full details on Docker, see their documentation at https://docs.docker.com.

An example of starting the Engine in a Docker container from the command line is:

docker run -d \
      --name=engine \
      -p 9900:9900 \
      --user=$(id -u) \
      -v /tmp:/logs:rw \
      mod9/asr \
      engine --log.directory=/logs

Arguments	Description
`-d`	Causes Docker to run in "daemon" mode, where the container starts up but the `docker` command itself returns immediately.
`--name=engine`	Gives the container the name `engine`, which makes it easier to refer to it in later commands.
`-p 9900:9900`	Forwards port 9900 on the host to port 9900 in the container.
`--user=$(id -u)`	Run process in the docker container as the current user on the host.
`-v /tmp:/logs:rw`	Allows access to log files on the host in `/tmp`, which must exist and be writable by the Engine. See Docker Permission Issues below for more on these requirements.
`mod9/asr`	The name of the Docker image. Optionally append a tag, such as `:latest`

All the rest of the arguments are the command that gets run in the container; see section Starting the Engine, above.

Note that, as written, the Docker container will not be removed if the Engine terminates because of an error or through normal termination. This allows easier post-mortem debugging, but you should remember to clean up any stale runs using e.g. docker rm -fv engine.

Docker permission issues

By default, the Mod9 Docker container will run as user mod9 and user id 50000. This can be an issue if you specify --log.directory as a remotely mounted volume and the directory uses stringent permissions since the Engine must be able to write the log files. It is possible to run the Engine with a different user id (e.g. to match the ownership of the directory to which you are writing log files) by passing e.g. --user=$(id -u) to the docker run command you use to start the Docker container. Using --user=$(id -u) will run the Engine with the same user id as the user who started the Engine container. See the Docker documentation at https://docs.docker.com for more information, including other ways of handling user identity within Docker.

[top]

Engine performance

Memory usage

It's important to carefully track the amount of memory used by the Engine. If the Engine runs out of memory, it can crash unexpectedly and drop all connections.

To monitor the memory usage of the Engine's process, you can use a get-info request.

echo '{"command":"get-info"}' | nc $HOST $PORT

The Engine will respond with its memory usage and limit. The memory limit is the minimum of the operator configured limit, the host's available memory, and the container's memory limit if the Engine is run in a Docker container.

  ...
  "memory_gibibytes": {
    "available": 125,
    "limit": 128,
    "used": 3
  },
  ...

To monitor the memory usage of the Engine Docker container, use the docker stats command.

docker stats --no-stream --format "{{.MemUsage}}" $CONTAINER_NAME

The command outputs the container's memory usage and limit.

3.137GiB / 128GiB

Each recognition thread run by the Engine might use up to 25MB of memory. As each thread finishes, it will deallocate and free its used memory. However the Engine's memory usage may still have increased after memory has been freed. This is due to the fact that freed memory may not always be returned to the operating system and memory may become fragmented.

Because of this, memory usage of the Engine will gradually increase with each request until it reaches a plateau. The memory usage at the plateau depends on the models and the type of audio requests sent to the Engine. With the default "en" ASR model and "en" NLP model, the Engine's baseline memory usage (when it has loaded its models, but is not yet processing any requests) starts at 2.4GiB. It will rise with each recognition request until stabilizing at under 3GiB. We aim to keep the Engine's baseline memory plateau below 4GiB for a single Mod9-packaged ASR model. Note that some 3rd-party ASR models from the Kaldi community can sometimes use considerably more memory.

In general, we recommend running the Engine with at least 8GiB of RAM. This allows more headroom for potential worst-case memory usage that can happen for certain requests (particularly: long utterances with alternatives requested).

Processor utilization

As a rule of thumb, the Engine can concurrently process 4-5 real-time streams per CPU.

For pre-recorded audio, each recognition thread will use up all of the CPU it's running on.

NOTE: The Engine will not immediately read the entire audio file into memory. It throttles its reads so that the amount of audio read is only ever slightly ahead of the amount it has already processed or is currently processing.

If there are no other processes running on the same machine, the Engine will be most efficient when the total number of recognition threads is equal to the number of CPUs available. For example, if the Engine is running in a container with 16 CPUs, it will have its maximum throughput when the total number of recognition threads across all batch and non-batch recognitions is 16.

If the Engine over-allocates threads, running more threads than available CPUs, the throughput will not improve, since the threads would be contending contending for the same limited CPU resources. However, over-allocating threads does not cause much performance degradation, since the modern Linux scheduler can context switch very efficiently.

[top]

Release notes

Here is a curated summary of significant changes that are relevant to ASR Engine operators (cf. client release notes):

2.0.0 (2025 Jun ??):
- TODO: documentation about E2E model.
1.9.7 (2025 May 27):
- Improved experimental release of E2E model functionality; ask Mod9 for testing support.
- The engine binary now has an additional dynamic linking dependency on libgomp.so.1.
- Add --limit.requests.gpu (default is 0) to enable E2E model on a single GPU.
- Add --decoder.configurable to enable clients to set arbitrary decoder limits.
1.9.6 (2024 Apr 30):
- Undocumented experimental release of E2E model functionality; ask Mod9 for testing support.
1.9.5 (2024 Mar 22):
- Upgrade various software dependencies, including Intel MKL to version 2024.0.
- Optimize compilation, particularly for --mkl.data-type=bf16, for improved speed.
- Downgrade warning messages about --mkl.data-type=bf16, which is now proven to work well.
1.9.4 (2023 May 13):
- Upgrade Intel MKL to version 2023.1.
1.9.3 (2022 May 03):
- Log info-level message if the Engine was built without an expiration date.
- Improve public documentation for build instructions and code structure.
- Improve private documentation (i.e. code comments) for source code maintainers.
1.9.2 (2022 Apr 25):
- Build instructions and code structure documentation added for source code maintainers.
1.9.1 (2022 Mar 28):
- No changes relevant to operator deployment.
1.9.0 (2022 Mar 21):
- Update the mod9/en-US_phone-ivector ASR model to use a language model trained exclusively on the Fisher and Switchboard training sets, as per standard academic research practice.
- Fixed a bug that could result in incorrectly counting the number of allocated threads.
1.8.1 (2022 Feb 18):
- Improve the stability of the MKL integer optimizations, especially for --mkl.data-type=int16.
1.8.0 (2022 Feb 16):
- Setting --mkl.data-type=int16 option can enable the speed optimization introduced in version 1.7.0.
  - This could theoretically become numerically unstable; in practice, it appears to be well-tuned.
  - Pending further testing, some future version of the Engine may enable this optimization by default.
- Setting --mkl.data-type=int8 enables an experimental optimization that is not currently recommended.
- Setting --mkl.data-type=bf16 enables a stable optimization with lower-precision floating point.
  - On older (w.r.t. 2024) processors, 16-bit Brain floating point will be simulated slowly in software.
  - This leverages the AVX512_E3 instruction set extension, only available on Intel's newest hardware.
  - On next-generation processors (AVX512_E4 with AMX extensions), speed will be even faster!
- Messages logged upon startup should clarify which MKL instruction sets are supported and enabled.
- The --mkl.max-cache-gib option limits the additional memory that is cached for the optimized matrices.
  The entire cache is cleared when it reaches the maximum size, or when any ASR model is unloaded.
- Enabling --mkl.reproducibility is equivalent to setting the MKL_CBWR=COMPATIBLE environment variable.
  This only guarantees reproducibility if --mkl.data-type=f32 is set (currently the default).
- The environment variables ASR_ENGINE_MKL_INT16_* are still supported, but will be deprecated at version 2.
1.7.0 (2022 Feb 02):
- Enable speed optimization by setting the ASR_ENGINE_MKL_INT16_CACHE_GIB environment variable at runtime.
- Related defaults: ASR_ENGINE_MKL_INT16_RANGE_SCALE=0.1 and ASR_ENGINE_MKL_INT16_MIN_COLUMNS=64.
1.6.1 (2022 Jan 06):
- The transcript-alternatives request option is now limited to a maximum of 1000, mitigating potential worst-case memory usage. Clients should be directed to consider phrase-alternatives instead.
- The en-US_phone-benchmark model has been renamed as mod9/en-US_phone-ivector.
1.6.0 (2021 Dec 16):
- Add --initial-requests-file option to process local requests after models are loaded.
- Add --models.g2p option to enable automatically generated pronunciations.
- Add --limit.requests.mutable, replacing the now-deprecated --limit.requests.add-words option. It also applies to add-grammar, and the deprecated option name is retained for backwards-compatibility until version 2.
1.5.0 (2021 Nov 30):
- Add --limit.throttle-factor option to determine when the server may be overloaded.
- Add --limit.throttle-threads option. When clients request batch-threads and the server may be overloaded, allocate up to this many threads in order to mitigate further overloading.
1.4.0 (2021 Nov 04):
- Improve engine binary for better scalability on multi-socket CPU systems.
- Rename the alternative engine-no-tcmalloc binary as engine-glibc-malloc.
1.3.0 (2021 Oct 27):
- Fix bug affecting memory limit determination on newer Linux systems using cgroup v2.
1.2.0 (2021 Oct 11):
- Memory usage is now carefully checked prior to loading models:
  - Estimated memory requirements can be reported in the model package's metadata.json file. All models packaged by Mod9 (including those from Vosk, TUDA, Zamia, and kaldi-asr.org) will have this information.
  - If a model's requirements are not provided, the Engine will conservatively estimate bounds at runtime.
- Add --limit.memory-headroom-gib option, which can mitigate risk of memory exhaustion:
  - Prevents models from loading, or new requests from processing, if available memory is low.
  - The default is 0 for backwards compatibility, but may be changed in the next major release.
- Add options that limit per-request memory usage and connection timeouts:
  - --limit.read-{line,wav-header}-kib: maximum memory used to parse request options or WAV header.
  - --limit.read-stream-kib: application-level (not OS-level) buffering of data prior to ASR processing.
  - --limit.read-line-timeout: requests fail if options line is not received shortly after connection.
  - --limit.read-stream-timeout: requests fail if no new data is read within this period.
- Minor feature improvements and bugs fixed:
  - Acoustic model compilation and optimization cached at load-time, not on-demand for first request.
  - An alternative engine-no-tcmalloc binary is included in the default package, not separately distributed.
1.1.0 (2021 Aug 11):
- The TCMalloc library is now statically linked in the Engine, facilitating non-Docker deployment.
- Add --determinize.max-mem and --determinize.max-tries, advanced options for lattice determinization.
- The default value for --models.path is now "/opt/mod9-asr/models" rather than "./models".
- The Engine now exits immediately if --models.path is not found at startup, rather than when loading models.
1.0.1 (2021 Jul 31):
- Fix bug in which maximum number of concurrent requests was limited.
1.0.0 (2021 Jul 15):
- This first major version release represents the functionality and interface intended for long-term compatibility.
- Deprecate specification of models as positional arguments to the engine binary:
  - The ASR and NLP models are now decoupled, and can be loaded/unloaded separately.
  - The engine loads comma-delimited lists of names with the --models.asr and --models.nlp options.
0.9.1 (2021 May 20):
- Faster NLP models that occupy less memory - 3x improvement.
0.9.0 (2021 May 05):
- Move /opt/mod9 to /opt/mod9-asr inside the Docker image.
- Engine binary is now run as engine instead of ./asr-engine; it is in /opt/mod9-asr/bin, which is on $PATH.
- Support for Kaldi models using i-vectors; cf. ivector-silence-weighting client request option.
- Logging lines are now formatted with an 8-character UUID prefix and sequential request number.

Support contacts

If you have questions, please email us at support@mod9.com; we'd love to help you out.
You may also call (HUH) ASK-ARLO at any time to speak with A Real Live Operator.

[top]

Deployment Guide | Mod9 ASR Engine

ASR Engine: Deployment Guide

Contents

Quick start

System requirements

Obtaining the Engine

Downloading and starting the Engine Docker container

Extracting the Engine tarball

Starting the Engine

Shutting down the Engine

Models

The speech recognition models include:

en-US_phone model

en_video model

es-US_phone model

mod9/en-US_phone-smaller model

mod9/en-US_phone-ivector model

The natural language processing models include:

en model

es model

Loading models

Configuring the Engine

Port selection

Limiting the number of concurrent requests

Limiting the number of add-grammar or add-words requests

Limiting the number of concurrent threads

Limiting the number of threads per request

Limiting the memory usage

Allowing shutdown requests

Exit code when timeout during shutdown

Exit code when handling SIGTERM

Modifying loaded models

Indexing models

Checking the Engine's configuration

Log files

Initial requests file

Recognition with local files

Advanced Docker operation

Docker permission issues

Engine performance

Memory usage

Processor utilization

Release notes

Support contacts

`en-US_phone` model

`en_video` model

`es-US_phone` model

`mod9/en-US_phone-smaller` model

`mod9/en-US_phone-ivector` model

`en` model

`es` model

Limiting the number of `add-grammar` or `add-words` requests

Allowing `shutdown` requests