[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]
The Mod9 ASR Engine consists of a Linux server, along with compatible models and support software, that enables clients to send commands and audio over a network and receive the written transcript of what was said in the audio.
This document describes how to run the Engine from the perspective of an operator who controls the server.
The accompanying TCP reference documentation describes how clients may communicate with the Engine, including: specifying various commands and request options, sending audio data, and interpreting various response formats.
- Quick start
- System requirements
- Obtaining the Engine
- Starting the Engine
- Shutting down the Engine
- Models
- Configuring the Engine
- Advanced Docker operation
- Engine performance
- Release notes
- Support contacts
This section assumes that you have opted to run Engine in a Docker container.
-
Download the Docker image; it's several gigabytes, so this may be slow depending on your Internet connection.
docker pull mod9/asr
Note that referencing the image as
mod9/asr
is shorthand formod9/asr:latest
. -
Start the container as a foreground process in the terminal:
docker run -it --rm -p 9900:9900 mod9/asr
To stop this one-off Engine container, simply type CTRL+C:
- the
-it
flag allows your keyboard to interrupt the foreground process; - the
--rm
flag will remove (i.e. clean up) the stopped Docker container. By default the Engine listens on the container's port 9900; the-p
flag maps this to your host machine's port 9900.
- the
This "quick start" setup is suitable for local experimentation and debugging.
You can now have local clients communicate with this Engine server by setting localhost
and 9900
respectively for:
-
HOST
andPORT
variables in the reference documentation examples for ad-hoc command-line usage withnc
. -
ASR_ENGINE_HOST
andASR_ENGINE_PORT
environment variables with the Python SDK or REST API.
The remainder of this documentation describes further details of the Engine's operation, as well as considerations for an operator when deploying the server in a production environment.
The minimum system requirements for running the ASR Engine with default settings is 8GiB of memory, an x86 compatible chip, and Linux from kernel 4.0 onward. We recommend a modern Intel chip (e.g. Xeon) for best performance.
The memory requirements can grow depending on how the Engine is started and how clients call the services the Engine provides. See section Configuring the Engine, below.
Note also that Mod9 can provide other versions of the Engine that trade off accuracy for speed and/or memory.
The ASR Engine is licensed by Mod9 Technologies. For more information about licensing, contact sales@mod9.com.
The preferred distribution mechanism is through a Docker image containing the Engine and all the required support files. We can also provide the Engine as a tarball containing a Linux executable (compatible with most Linux distributions) plus required support files. The following two subsections describe each of these processes in more detail.
Docker is a system for running software in environments known as containers. Mod9 Technologies provides images that can be used to create containers that are run by the Docker software, which can be installed on most operating systems.
To download the Docker image, run the command:
docker pull mod9/asr
This will download the latest version of the image; for a specific version, specify mod9/asr:1.9.5
for example.
Since the image contains the speech recognition models, which can be quite large, this
download can take a few minutes.
Next, you can start a container. This does not start the Engine itself, but rather launches a new docker container of which the Engine is a part. You can think of this as a virtual machine that isolates the Engine from the physical machine.
docker run -it --rm --network=host mod9/asr:latest bash
This command starts the container, and allows you to connect to it
using any
port.
It opens a bash shell where you can enter commands to run in the container.
You can stop the container by exit
ing the shell.
Note that this method of starting a container allows you to use the same commands to start the Engine regardless of whether the Engine is running in a container or from the tarball. However, we do not recommend this method for production; see Advanced Docker operation below.
If you elected to receive the Engine as a tarball, you should have
access to the file mod9-asr-engine_linux.tar.gz
on the machine where you want
to run the Engine. Next, select a directory to store the Engine and
support files. It can be anything, but we recommend using
/opt/mod9-asr
. Copy mod9-asr-engine_linux.tar.gz
to that directory.
Next, to extract the Engine and support files, run:
cd /opt/mod9-asr
tar xzf mod9-asr-engine_linux.tar.gz
This should create the executable /opt/mod9-asr/bin/engine
and support files,
including /opt/mod9-asr/models
.
Whether you are running in a Docker container or from the tarball,
starting the Engine follows the same pattern. From the command line
and in the directory where the executable engine
is installed
(typically /opt/mod9-asr
), run the command:
engine
This will start the Engine with the default English model (en
) and
listen on port 9900. See sections Models and Port
selection below for more information.
The Engine accepts a client shutdown
command, which directs the Engine to stop accepting new recognition requests
and begin a "graceful" shutdown, in which it waits for all active requests to complete before exiting.
# The Engine will wait for all active requests to finish and exit with code 0.
$ echo '{"command":"shutdown"}' | nc $HOST $PORT
{"status":"completed","warning":"Shutdown requested. The Engine will no longer accept new recognition requests."}
The shutdown
command accepts an optional timeout
field, which directs the Engine
to forcefully shut down after the timeout duration even if there are active requests that have not completed.
If all active requests have completed before the timeout duration, the Engine will immediately shut down.
A negative timeout
value will direct the Engine to wait until all active requests complete before shutting down.
# The Engine will wait up to 10.2 seconds for all active requests to finish. Otherwise, it will forcefully and immediately exit.
$ echo '{"command":"shutdown", "timeout":10.2}' | nc $HOST $PORT
{"status":"completed","warning":"Shutdown requested. The Engine will no longer accept new recognition requests."}
By default, the Engine will shutdown gracefully with an exit code of 0 if all of the active requests have completed, and it will shutdown forcefully with an exit code of 1 if the timeout is reached and other requests are still active. The exit code for a forceful shutdown can be configured when starting the Engine.
# If a shutdown request times out, then the exit code should be 5.
engine --shutdown.timeout-code=5
The Engine loads various models to process and transcribe audio. A model consists of a directory with various required and optional files. For full details on the model directory and files, including how to use your own models with the Engine, see the Model Formats and Layout documentation.
The models are typically named with a two-letter language code (ISO 639-1), often with dialect and other identifying information. Each model supports a single sampling rate. The sampling rate is a property of digital audio that describes how much data per second is sent.
The models can typically be found under the ./models/
directory, though you can specify a different location at Engine
startup with the --models.path
flag.
Mod9 provides both speech recognition models and natural language processing models (for punctuation, capitalization, number formatting, etc.).
The en-US_phone
model is optimized for North American English
telephony data. It does best in low noise environments and with
accents common to North America. It supports audio sampled at 8kHz.
The en_video
model is optimized for "global" English data. It
performs better than en-US_phone
on high-noise audio and accented
English. This model supports 16kHz audio. The aliases en
and
english
can also be used to refer to this model.
The es-US_phone
model is optimized for Spanish telephony data as
spoken in the United States. It does best in low noise environments
and with accents common to the United States. It supports audio
sampled at 8kHz. The alias es
and spanish
can also be used to
refer to this model.
The mod9/en-US_phone-smaller
model is similar to en-US_phone
, but takes
up significantly less disk space and memory. It is suitable for
debugging or in environments where less memory is available. It is
slightly less accurate than the en-US_phone
model, and will
generally use more CPU than en-US_phone
at the same settings.
The mod9/en-US_phone-ivector
model is similar to en-US_phone
, but is
trained with the use of speaker-adaptive i-vectors. This can perform
slightly better than en-US_phone
in situations with long audio segments
that are strictly a single speaker; however, it is generally less robust
than en-US_phone
for more diverse environments in which the audio may
have multiple speakers, short duration, long silence, or non-speech sounds.
This model produces optimal results when scored on the well-known research
benchmark known "Switchboard" (i.e. a subset of the NIST 2000 Hub5 evaluation
of conversational telephone speech).
Note that the model is "fair" in the sense that it was not trained on any of
the test data; some other commercial ASR vendors have been suspected
of "cheating" by training their systems on this test data.
The en
model is used for English data. It includes punctuation,
capitalization, number formatting, and disfluency removal.
The es
model is used for Spanish data. It includes punctuation and
disfluency removal.
If you do not specify otherwise, the Engine will load the en
speech recognition model and the en
natural language processing
model at startup. If you want to load other models, use the --models.asr
and --models.nlp
command line flags with comma-separated model names.
For example, this will instruct the Engine to load the en-US_phone
and es
speech recognition models and the en
and es
natural
language processing models when starting up.
engine --models.asr=en-US_phone,es --models.nlp=en,es
There is also a grapheme-to-phoneme (G2P) model,
which is loaded similarly with --models.g2p
and defaults to mod9/en
.
You can also start the Engine with no models loaded. This is
generally only useful if you allow clients to load/unload models,
which requires passing the --models.mutable
flag. See section
Modifying loaded models, below.
engine --models.mutable --models.asr= --models.nlp= --models.g2p=
The configurations described above use default settings for a large number of options to the Engine. The following sections describe options to the Engine that you may want to override.
By default, the Engine will listen on port 9900
. You can override
the port by adding --port=<NUMBER>
to the command line. For example,
to load the default English models and use port 16000, you would start
the Engine as follows:
engine --port=16000
Note that ports under 1024
require root privileges.
As a special case, you can also use port 0
to tell the server to
pick an unused port at random. Note that you must communicate
whichever port the server selects to any clients that want to connect
to the Engine. To determine which port the server chose, you can
examine the log file for e.g. the line:
Ready to accept requests on port 35647
See section Log files below for more information about log files.
The Engine can support multiple simultaneous connections from
clients. By default, there is no limit to the number of concurrent
requests the Engine handles. Since there is a small amount of overhead
associated with additional requests, we provide the ability to limit
the total number of simultaneous connections by specifying
--limit.requests=<NUMBER>
on the command line. For example, to limit
the Engine to 40 simultaneous connections, start the Engine with:
engine --limit.requests=40
Clients that attempt to connect after this limit is reached will receive an error message and the command will not run. Note that simple status queries are exempt from the limit.
An Engine that allows for ASR model modification (i.e. --models.mutable
is set)
allows for add-words
requests. Each request has a memory footprint that
can blow up if there are too many spurious add-words
requests. To limit
the total number of add-words
requests, start the Engine with:
engine --limit.requests.mutable=100
We recommend that clients add multiple words in a single request rather than spreading each word across a different request.
This limit also applies for add-grammar
requests, but not for load-model
or unload-model
.
The Engine handles each client's request simultaneously. Each request is handled by one or more threads.
The operating system in the Engine's environment schedules threads to run on the available CPU cores. Although modern operating systems do a very good job handling large workloads even when the number of threads is larger than the number of cores, there is both processing time and memory overhead with each additional thread. As a rule of thumb, expect to use about 25 MB of memory per additional thread, though it will vary depending on the ASR model, settings, and quality of the audio.
Because of this overhead, it is useful to limit the total number of
threads the Engine can run concurrently. By default, the Engine runs
with no such limit, but this can be set by using the
--limit.threads
flag on the command line. For example, to limit
the Engine to 100 threads, start the Engine with:
engine --limit.threads=100
A client that attempts to send a request that would exceed this limit will receive an error message, and the request will not be processed. Note that simple status queries are exempt from the limit.
Batch-mode recognition requests allow a single client to request multiple threads to process audio in parallel. It is possible to limit the number of concurrent threads allowed to any individual request.
engine --limit.threads-per-request=20
By default, the limit on the number of threads per request is equal to the number of CPUs available. Batch-mode requests are subject to both the individual batch request threads limit and the overall Engine thread limit.
The Engine can be run with a hard memory limit.
engine --limit.memory-gib=20
The limit specified is the memory usage at which the Engine will reject subsequent requests. For example, if the memory limit is set to 20GiB (gibibytes) and the Engine's memory usage is 20.5GiB, it will reject any subsequent client requests until the memory usage goes under 20GiB. If the memory usage is 19GiB, the Engine will still accept incoming requests. Note that simple status queries are exempt from the limit.
The Engine can also be configured with a soft memory limit.
engine --determinize.max-mem=50000000
This option may be familiar to Kaldi ASR experts, and affects the per-thread memory usage due to lattice determinization.
The Engine's default is higher than the default for
DeterminizeLatticePhonePrunedOptions.max_mem
in the Kaldi library,
since this tends to provide better representation of transcript alternatives.
Setting this option to a non-positive integer will disable the memory limit; this is not generally advisable since worst-case scenarios are difficult to characterize and can potentially use several gigabytes of memory per thread.
The --determinize.max-mem
limit is considered "soft" because it is enforced
when a thread is actually allocating some greater multiple of that memory limit.
In practice, this was seen to be up to 10x higher than the specified limit.
Lattice determinization can be retried when the soft limit is reached,
with each retry attempt using a successively halved lattice pruning beam.
The number of determinization tries can be set with the --determinize.max-tries
option, which defaults to 2 (i.e. 1 retry).
When the number of determinization tries is exceeded, the Engine will not fail
-- instead its "fallback" strategy is to use a linear 1-best lattice that preserves the transcript,
but has no other alternatives.
Setting this option to 0 will effectively disable lattice determinization, always using the 1-best linear lattice:
engine --determinize.max-tries=0
This can be helpful to operators who are conservative about limiting memory usage, or aggressively maximizing speed.
By default, shutdown
requests from clients are not allowed.
The Engine must be run with shutdown.allowed
in order to accept shutdown
requests.
engine --shutdown.allowed
If shutdown
requests are allowed for the Engine, then the exit code when the Engine
times out during a shutdown request can be configured. If shutdown
requests are not
allowed, then this setting is ignored.
engine --shutdown.allowed --shutdown.timeout-code=1
When the Engine receives a SIGTERM signal from the operating system, it handles it
gracefully by finishing existing requests and then shutting down. The Engine does
not have to be run with --shutdown.allowed
to use this setting. The exit code when
the Engine terminates with a SIGTERM can be configured as follows:
engine --shutdown.sigterm-code=15
Some Engine commands modify the Engine's state, and may affect the
results of other client requests. For example, a bias-words
request will modify an ASR model and affect the recognition output for
any other requests that may be using the modified model.
Because of this, the Engine by default will disallow add-words
and bias-words
commands to protect against untrusted clients.
The Engine must be run with models.mutable
set in order for these commands to be allowed.
engine --models.mutable
The Engine can index the words in an ASR model's vocabulary, which speeds up bias-words
commands.
engine --models.mutable --models.indexed
The --models.indexed
flag will instruct the Engine to create an index for each ASR model when it is loaded.
Constructing the index introduces a trade-off between memory and speed.
For example, an indexed single-word bias-words
request can take less than 0.01 second for a relatively large model,
and about 1-2 seconds without the index.
However, the index itself will take up a few hundred megabytes of memory.
The --models.indexed
flag cannot be set if the Engine does not have --models.mutable
set.
The Engine's configurations and status can be checked using a get-info
command.
# The Engine will respond with information about its configuration and current usage.
echo '{"command":"get-info"}' | nc $HOST $PORT
In addition to writing messages to the console, the Engine can also log messages to a file. You can control where the file is written and how much information is reported by setting various command line arguments to the Engine.
By default, log files are not generated; they are only printed to the
console. To enable log file generation, you must specify the directory
where the Engine should write logs by passing --log.directory
to the
Engine. This will cause the Engine to write logs to that directory in
the file engine.log
. If this file already exists, the new logs will be appended to the file.
Note that the Engine must have permission to read and write to the
directory you provide with --log.directory
. If you are running
within the mod9/asr
Docker image that packages the Engine software in a
Linux filesystem, then a directory at /var/opt/mod9-asr/log
has already
been created with suitable permissions to enable file-based logging.
See Advanced Docker operation below for
permission issues if you would like to log to another directory, such as
mounted from a Docker host filesystem.
The log level controls how much information is reported by the
logging system. The levels are, in order, debug
, info
, warning
,
error
, fatal
. As the log level goes from debug
to fatal
, fewer
messages will be reported.
The default log level for both messages printed to the console and to
the log file is info
. You can change the log level by passing
--log.level-console
and/or --log.level-file
when you start the Engine.
In order to control the amount of disk space logs occupy, the Engine
will automatically perform log rotations followed by optional log deletions.
In a log rotation, the current log file (engine.log
) is closed, renamed to engine_<TIMESTAMP>.log
,
and a fresh engine.log
is opened to receive subsequent log messages.
Log rotations occur at midnight, at Engine shutdown, or when the size of the current log file exceeds a limit.
--log.max-file-size
If the size of the current log file exceeds this limit, the Engine will perform a log rotation. Defaults to 10485760 (10MB).
--log.max-dir-size
If the size of the log directory is above this limit during a log rotation, the oldest log files will be deleted until the directory size is below this limit. Defaults to 104857600 (100MB).
The following is an example of starting the Engine with various logging options:
engine --log.directory=/var/opt/mod9-asr/log \
--log.level-console=warning \
--log.max-file-size=104857600
The format of the log file is designed to be easy both for a human to read and a machine to parse. Each line contains a single message from the Engine consisting of a set of tags followed by a single tab followed by a human-readable message. The tags are surrounded by brackets and separated by a space. An example of output to the log file is:
[2021-03-11T11:40...] [main:info] Ready to accept requests on port 9900
[2021-03-11T11:41...] [0000:info] Accepted request 0000 from 127.0.0.1
[2021-03-11T11:41...] [0000:info] Parsed JSON request: {"command":"get-info"}
[2021-03-11T11:41...] [0000:info] Finished processing request 0000 at 127.0.0.1
[2021-03-11T11:41...] [0001:info] Accepted request 0001 from 127.0.0.1
[2021-03-11T11:41...] [0001:info] Parsed JSON request: {"command":"get-models-info"}
[2021-03-11T11:41...] [0001:info] Finished processing request 0001 at 127.0.0.1
Note that these examples have been paraphrased to fit on a line.
The tags are a date/time stamp in ISO 8601 extended format, a request ID represented by an incrementing number or 'main', and the log level the message was logged at. Output to console is similar, but doesn't include the time stamp.
The Engine supports Customization requests such as add-words
and
bias-words
that modify a model in the Engine. It can be useful to ensure that such
requests are sent before other client requests are processed. To facilitate this, the
Engine takes the option --initial-requests-file
that must point to a file containing one
JSON object per line (known as JSON Lines format). Each line is
sent as a client request to the Engine sequentially right after any models specified on the command line have
finished loading. See TCP for a description of the JSON requests the Engine
supports.
Note that requests from the initialization file follow the same rules as requests from a
client. For example, --models.mutable
must be specified if a request modifies a model
(e.g. add-words
).
As an example, if you wanted to add the name Nowitzki
to the en
model and boost the
words hoop
and basketball
, you could put the following requests into a file named
startup.jsonl
. Note that each JSON request must be on a single line, and that the file
format is actually a slight extension of JSONL.
// NOTE: this file format is JSONL (as described at https://jsonlines.org) with minor extensions:
// - Lines that start with // as the first non-whitespace characters are ignored as comments.
// - Lines that have only whitespace are ignored.
{"command": "add-words", "asr-model": "en", "words": [{"word": "Nowitzki", "phones": "N OW V IH T S K IY"}]}
{"command": "bias-words", "asr-model": "en", "words": [{"word": "basketball", "bias": 2.3}, {"word": "hoop", "bias": 3}]}
Then you could start the Engine with:
engine --models.mutable --models.asr=en --initial-requests-file=startup.jsonl
When the Engine starts accepting client recognition requests, the word Nowitzki
will be
in vocabulary, and the words hoop
and basketball
will have been boosted.
Another use of the initial requests file is to run speech recognition and write results to
local files. This is enabled using the recognize
command with the request options
audio-uri
and response-uri
in an initial requests file. Note that these
special options to the recognize
command are only available from an initial requests
file and not from a client request, and their values must start with the file://
scheme.
To demonstrate, start by downloading hi.wav
and swbd.wav
:
curl -sLO mod9.io/hi.wav
curl -sLO mod9.io/swbd.wav
Then write the following to an initial requests file recognize.jsonl
:
{"command": "recognize", "audio-uri": "file://swbd.wav", "response-uri": "file://swbd.jsonl"}
{"command": "recognize", "audio-uri": "file:///tmp/hi.wav", "response-uri": "file:///tmp/hi.jsonl"}
{"command": "shutdown"}
Finally, run the Engine:
engine --shutdown.allowed --models.asr=en --initial-requests-file=recognize.jsonl
This will run recognition on swbd.wav
and write results to swbd.jsonl
, both paths relative to
the working directory from which the Engine was run. Then it run recognition on /tmp/hi.wav
and
write results to /tmp/hi.jsonl
, where the files are specified with absolute paths.. Finally, it
will shutdown the Engine. Note that --shutdown.allowed
was passed to enable the shutdown request
to proceed.
The method of starting the Engine in a Docker container described in
section
Downloading and Starting the Engine Docker Container
is useful for debugging, but not typically how you would use Docker in
a production environment. Normally, you would start the Docker
container and the Engine at the same time.
Also, using --network=host
is generally more permissive than necessary.
This section provides some general guidance on starting a Docker container from the command line to run the Mod9 Engine. For full details on Docker, see their documentation at https://docs.docker.com.
An example of starting the Engine in a Docker container from the command line is:
docker run -d \
--name=engine \
-p 9900:9900 \
--user=$(id -u) \
-v /tmp:/logs:rw \
mod9/asr \
engine --log.directory=/logs
Arguments | Description |
---|---|
-d |
Causes Docker to run in "daemon" mode, where the container starts up but the docker command itself returns immediately. |
--name=engine |
Gives the container the name engine , which makes it easier to refer to it in later commands. |
-p 9900:9900 |
Forwards port 9900 on the host to port 9900 in the container. |
--user=$(id -u) |
Run process in the docker container as the current user on the host. |
-v /tmp:/logs:rw |
Allows access to log files on the host in /tmp , which must exist and be writable by the Engine. See Docker Permission Issues below for more on these requirements. |
mod9/asr |
The name of the Docker image. Optionally append a tag, such as :latest
|
All the rest of the arguments are the command that gets run in the container; see section Starting the Engine, above.
Note that, as written, the Docker container will not be removed if the
Engine terminates because of an error or through normal
termination. This allows easier post-mortem debugging, but you should
remember to clean up any stale runs using e.g. docker rm -fv engine
.
By default, the Mod9 Docker container will run as user mod9
and user
id 50000
. This can be an issue if you specify --log.directory
as a
remotely mounted volume and the directory uses stringent permissions
since the Engine must be able to write the log files. It is possible
to run the Engine with a different user id (e.g. to match the
ownership of the directory to which you are writing log files) by
passing e.g. --user=$(id -u)
to the docker run
command you use to
start the Docker container. Using --user=$(id -u)
will run the
Engine with the same user id as the user who started the Engine
container. See the Docker documentation at https://docs.docker.com
for more information, including other ways of handling user identity
within Docker.
It's important to carefully track the amount of memory used by the Engine. If the Engine runs out of memory, it can crash unexpectedly and drop all connections.
To monitor the memory usage of the Engine's process, you can use a get-info
request.
echo '{"command":"get-info"}' | nc $HOST $PORT
The Engine will respond with its memory usage and limit. The memory limit is the minimum of the operator configured limit, the host's available memory, and the container's memory limit if the Engine is run in a Docker container.
...
"memory_gibibytes": {
"available": 125,
"limit": 128,
"used": 3
},
...
To monitor the memory usage of the Engine Docker container, use the docker stats
command.
docker stats --no-stream --format "{{.MemUsage}}" $CONTAINER_NAME
The command outputs the container's memory usage and limit.
3.137GiB / 128GiB
Each recognition thread run by the Engine might use up to 25MB of memory. As each thread finishes, it will deallocate and free its used memory. However the Engine's memory usage may still have increased after memory has been freed. This is due to the fact that freed memory may not always be returned to the operating system and memory may become fragmented.
Because of this, memory usage of the Engine will gradually increase with each request until it
reaches a plateau. The memory usage at the plateau depends on the models and the type of audio
requests sent to the Engine.
With the default "en"
ASR model and "en"
NLP model, the Engine's baseline memory usage
(when it has loaded its models, but is not yet processing any requests) starts at 2.4GiB.
It will rise with each recognition request until stabilizing at under 3GiB.
We aim to keep the Engine's baseline memory plateau below 4GiB for a single Mod9-packaged ASR model.
Note that some 3rd-party ASR models from the Kaldi community can sometimes use considerably more memory.
In general, we recommend running the Engine with at least 8GiB of RAM. This allows more headroom for potential worst-case memory usage that can happen for certain requests (particularly: long utterances with alternatives requested).
As a rule of thumb, the Engine can concurrently process 4-5 real-time streams per CPU.
For pre-recorded audio, each recognition thread will use up all of the CPU it's running on.
NOTE: The Engine will not immediately read the entire audio file into memory. It throttles its reads so that the amount of audio read is only ever slightly ahead of the amount it has already processed or is currently processing.
If there are no other processes running on the same machine, the Engine will be most efficient when the total number of recognition threads is equal to the number of CPUs available. For example, if the Engine is running in a container with 16 CPUs, it will have its maximum throughput when the total number of recognition threads across all batch and non-batch recognitions is 16.
If the Engine over-allocates threads, running more threads than available CPUs, the throughput will not improve, since the threads would be contending contending for the same limited CPU resources. However, over-allocating threads does not cause much performance degradation, since the modern Linux scheduler can context switch very efficiently.
Here is a curated summary of significant changes that are relevant to ASR Engine operators (cf. client release notes):
-
1.9.5 (2024 Mar 22):
- Upgrade various software dependencies, including Intel MKL to version 2024.0.
- Optimize compilation, particularly for
--mkl.data-type=bf16
, for improved speed. - Downgrade warning messages about
--mkl.data-type=bf16
, which is now proven to work well.
-
1.9.4 (2023 May 13):
- Upgrade Intel MKL to version 2023.1.
-
1.9.3 (2022 May 03):
- Log info-level message if the Engine was built without an expiration date.
- Improve public documentation for build instructions and code structure.
- Improve private documentation (i.e. code comments) for source code maintainers.
-
1.9.2 (2022 Apr 25):
- Build instructions and code structure documentation added for source code maintainers.
-
1.9.1 (2022 Mar 28):
- No changes relevant to operator deployment.
-
1.9.0 (2022 Mar 21):
- Update the
mod9/en-US_phone-ivector
ASR model to use a language model trained exclusively on the Fisher and Switchboard training sets, as per standard academic research practice. - Fixed a bug that could result in incorrectly counting the number of allocated threads.
- Update the
-
1.8.1 (2022 Feb 18):
- Improve the stability of the MKL integer optimizations, especially for
--mkl.data-type=int16
.
- Improve the stability of the MKL integer optimizations, especially for
-
1.8.0 (2022 Feb 16):
- Setting
--mkl.data-type=int16
option can enable the speed optimization introduced in version 1.7.0.- This could theoretically become numerically unstable; in practice, it appears to be well-tuned.
- Pending further testing, some future version of the Engine may enable this optimization by default.
- Setting
--mkl.data-type=int8
enables an experimental optimization that is not currently recommended. - Setting
--mkl.data-type=bf16
enables a stable optimization with lower-precision floating point.- On older (w.r.t. 2024) processors, 16-bit Brain floating point will be simulated slowly in software.
- This leverages the
AVX512_E3
instruction set extension, only available on Intel's newest hardware. - On next-generation processors (
AVX512_E4
with AMX extensions), speed will be even faster!
- Messages logged upon startup should clarify which MKL instruction sets are supported and enabled.
- The
--mkl.max-cache-gib
option limits the additional memory that is cached for the optimized matrices.
The entire cache is cleared when it reaches the maximum size, or when any ASR model is unloaded. - Enabling
--mkl.reproducibility
is equivalent to setting theMKL_CBWR=COMPATIBLE
environment variable.
This only guarantees reproducibility if--mkl.data-type=f32
is set (currently the default). - The environment variables
ASR_ENGINE_MKL_INT16_*
are still supported, but will be deprecated at version 2.
- Setting
-
1.7.0 (2022 Feb 02):
- Enable speed optimization by setting the
ASR_ENGINE_MKL_INT16_CACHE_GIB
environment variable at runtime. - Related defaults:
ASR_ENGINE_MKL_INT16_RANGE_SCALE=0.1
andASR_ENGINE_MKL_INT16_MIN_COLUMNS=64
.
- Enable speed optimization by setting the
-
1.6.1 (2022 Jan 06):
- The
transcript-alternatives
request option is now limited to a maximum of 1000, mitigating potential worst-case memory usage. Clients should be directed to considerphrase-alternatives
instead. - The
en-US_phone-benchmark
model has been renamed asmod9/en-US_phone-ivector
.
- The
-
1.6.0 (2021 Dec 16):
- Add
--initial-requests-file
option to process local requests after models are loaded. - Add
--models.g2p
option to enable automatically generated pronunciations. - Add
--limit.requests.mutable
, replacing the now-deprecated--limit.requests.add-words
option. It also applies toadd-grammar
, and the deprecated option name is retained for backwards-compatibility until version 2.
- Add
-
1.5.0 (2021 Nov 30):
- Add
--limit.throttle-factor
option to determine when the server may be overloaded. - Add
--limit.throttle-threads
option. When clients requestbatch-threads
and the server may be overloaded, allocate up to this many threads in order to mitigate further overloading.
- Add
-
1.4.0 (2021 Nov 04):
- Improve
engine
binary for better scalability on multi-socket CPU systems. - Rename the alternative
engine-no-tcmalloc
binary asengine-glibc-malloc
.
- Improve
-
1.3.0 (2021 Oct 27):
- Fix bug affecting memory limit determination on newer Linux systems using cgroup v2.
-
1.2.0 (2021 Oct 11):
- Memory usage is now carefully checked prior to loading models:
- Estimated memory requirements can be reported in the model package's
metadata.json
file. All models packaged by Mod9 (including those from Vosk, TUDA, Zamia, and kaldi-asr.org) will have this information. - If a model's requirements are not provided, the Engine will conservatively estimate bounds at runtime.
- Estimated memory requirements can be reported in the model package's
- Add
--limit.memory-headroom-gib
option, which can mitigate risk of memory exhaustion:- Prevents models from loading, or new requests from processing, if available memory is low.
- The default is
0
for backwards compatibility, but may be changed in the next major release.
- Add options that limit per-request memory usage and connection timeouts:
-
--limit.read-{line,wav-header}-kib
: maximum memory used to parse request options or WAV header. -
--limit.read-stream-kib
: application-level (not OS-level) buffering of data prior to ASR processing. -
--limit.read-line-timeout
: requests fail if options line is not received shortly after connection. -
--limit.read-stream-timeout
: requests fail if no new data is read within this period.
-
- Minor feature improvements and bugs fixed:
- Acoustic model compilation and optimization cached at load-time, not on-demand for first request.
- An alternative
engine-no-tcmalloc
binary is included in the default package, not separately distributed.
- Memory usage is now carefully checked prior to loading models:
-
1.1.0 (2021 Aug 11):
- The TCMalloc library is now statically linked in the Engine, facilitating non-Docker deployment.
- Add
--determinize.max-mem
and--determinize.max-tries
, advanced options for lattice determinization. - The default value for
--models.path
is now"/opt/mod9-asr/models"
rather than"./models"
. - The Engine now exits immediately if
--models.path
is not found at startup, rather than when loading models.
-
1.0.1 (2021 Jul 31):
- Fix bug in which maximum number of concurrent requests was limited.
-
1.0.0 (2021 Jul 15):
- This first major version release represents the functionality and interface intended for long-term compatibility.
- Deprecate specification of models as positional arguments to the
engine
binary:- The ASR and NLP models are now decoupled, and can be loaded/unloaded separately.
- The
engine
loads comma-delimited lists of names with the--models.asr
and--models.nlp
options.
-
0.9.1 (2021 May 20):
- Faster NLP models that occupy less memory - 3x improvement.
-
0.9.0 (2021 May 05):
- Move
/opt/mod9
to/opt/mod9-asr
inside the Docker image. - Engine binary is now run as
engine
instead of./asr-engine
; it is in/opt/mod9-asr/bin
, which is on$PATH
. - Support for Kaldi models using i-vectors; cf.
ivector-silence-weighting
client request option. - Logging lines are now formatted with an 8-character UUID prefix and sequential request number.
- Move
If you have questions, please email us at support@mod9.com; we'd love to help you out.
You may also call (HUH) ASK-ARLO at any time to speak with A Real Live Operator.
©2019-2023 Mod9 Technologies (Version 1.9.5)