[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]
The Mod9 ASR Python SDK is a higher-level interface than the protocol described in the TCP reference documentation. Designed as a compatible drop-in replacement for the Google Cloud STT Python Client Library, Mod9's software enables privacy-protecting on-premise deployment, while also extending functionality of the Google Cloud service.
- Differences from Google Cloud STT
- Supported configuration options
- Installation and setup
- Example usage
Install the Mod9 ASR Python SDK:
pip3 install mod9-asr
To transcribe a sample audio file accessible at https://mod9.io/hello_world.wav:
from mod9.asr.speech import SpeechClient
client = SpeechClient(host='mod9.io', port=9900)
response = client.recognize(config={'language_code': 'en-US'},
audio={'uri': 'https://mod9.io/hello_world.wav'})
print(response)
This Python code instantiates the SpeechClient
class with arguments that specify
how to connect with a server running the ASR Engine.
For convenience, such a server is deployed at mod9.io
, listening on port 9900
.
NOTE: Sensitive data should not be sent to this evaluation server, because the TCP connection is unencrypted.
For comparison, here's how the above code might look using Google's client library (and supplied credentials):
import json
from urllib.request import urlopen
from google.oauth2 import service_account
# The mod9.asr.speech module is a drop-in replacement for this module.
from google.cloud.speech import SpeechClient
# For demonstration purposes, use Mod9's GCP credentials (subject to a daily billing quota).
client = SpeechClient(credentials=service_account.Credentials.from_service_account_info(
json.load(urlopen('https://mod9.io/gstt-demo-credentials.json'))))
# Note that Google only support audio URIs at Google Cloud Storage (gs:// scheme).
response = client.recognize(config={'language_code': 'en-US'},
audio={'uri': 'gs://gstt-demo-audio/hello_world.wav'})
print(response)
Mod9's implementation of SpeechClient
replicates Google's
recognize()
method, which is synchronous: it processes an entire request before returning a single response.
This is suitable for transcribing pre-recorded audio files.
The config
argument can be either a Python dict
or
RecognitionConfig
object that contains metadata about the audio input,
as well as supported configuration options that affect the output.
The audio
argument can be either a Python dict
or
RecognitionAudio
object that contains either content
or uri
.
While content
represents audio bytes directly, uri
specifies a location where audio may be accessed.
The output from
recognize()
is a
RecognizeResponse
that may contain
SpeechRecognitionResult
objects.
Alternatively, as demonstrated in further example usage below, the
streaming_recognize()
method is used to return a generator that yields
StreamingRecognitionResult
objects while a real-time audio stream is being sent and processed.
There are some notable differences:
-
Google's
RecognitionAudio.uri
only allows files to be retrieved from Google Cloud Storage.
The Mod9 ASR Python SDK accepts audio from more diverse sources:URI Scheme Access files stored ... gs://
in Google Cloud Storage, s3://
as AWS S3 objects, http://
orhttps://
via arbitrary HTTP services, file://
or on a local filesystem. -
Google's
RecognitionAudio.content
andSpeechClient.recognize()
restrict audio to be less than 60 seconds.
The Mod9 ASR Python SDK does not limit the duration of audio. -
Google's
SpeechClient.streaming_recognize()
restricts audio to be less than 5 minutes.
The Mod9 ASR Python SDK does not limit the duration of streaming audio. -
Google's
SpeechClient.long_running_recognize()
can asynchronously process longer audio files.
The Mod9 ASR Python SDK has not replicated this; it's better served with a Google-compatible Mod9 ASR REST API. -
Google Cloud STT supports a large number of languages for a variety of acoustic conditions.
Mod9 ASR packages over 50 models for about 20 languages and dialects -- or bring your own models.
The Mod9 ASR Python SDK provides two modules:
-
mod9.asr.speech
implements a strict subset of Google's functionality. -
mod9.asr.speech_mod9
extends this with additional functionality that Google does not support.
Option in config
|
Accepted values inmod9.asr.speech
|
Extended support inmod9.asr.speech_mod9
|
---|---|---|
asr_model |
N/A | Select from loaded models |
audio_channel_count 1
|
N/A | Integer |
enable_automatic_punctuation 2
|
False , True
|
|
enable_separate_recognition_per_channel 3
|
N/A | True |
enable_word_confidence |
False , True
|
|
enable_word_time_offsets |
False , True
|
|
encoding |
"LINEAR16" "MULAW"
|
"ALAW" , "LINEAR24" , "LINEAR32" , "FLOAT32"
|
language_code |
(~20 languages/dialects) | |
latency 4
|
N/A |
0.01 , ... , 3.0
|
max_alternatives 5
|
0 , ... , 1000
|
|
max_phrase_alternatives 6
|
N/A |
1 , ... , 10000
|
max_word_alternatives 7
|
N/A |
1 , ... , 10000
|
model |
"video" , "phone_call" , "default"
|
|
intervals_json 8
|
N/A | "[[Number, Number], …]" |
options_json 9
|
N/A | "{…}" |
sample_rate_hertz |
8000 , ..., 48000
|
|
speed 10
|
N/A |
1 , ... , 9
|
1 Mod9 ASR: this is optional for non-raw audio. Internally, the Engine has a restriction on the number of channels.
2 Mod9 ASR: enabling punctuation also applies capitalization and number formatting.
3 Mod9 ASR: default is True
and Mod9 does not support a value of False
wherein only the first channel is recognized.
4 Mod9 ASR: lower values may improve responsiveness, higher values may decrease CPU usage; default is 0.24 seconds.
5 Google STT: only allows up to 30 transcript-level alternatives (i.e. N-best) to be requested, but often results in fewer.
6 Mod9 ASR: more useful representation of ambiguity in speech, as short sequences of many-to-many word mappings.
7 Mod9 ASR: a more compact representation, but restricted as one-to-one word mappings. (cf. IBM Watson STT API)
8 Mod9 ASR: provide a speech segmentation, useful for ensuring that results are aligned with speaker turns.
9 Mod9 ASR: arbitrary request options to the Mod9 ASR Engine may specified to override or extend functionality.
10 Mod9 ASR: lower values may improve recognition alternatives, higher values may decrease CPU usage; default is 5.
pip3 install mod9-asr
The Python SDK must connect to an ASR Engine server to transcribe audio.
It may be most expedient to use the evaluation server running at mod9.io
:
export ASR_ENGINE_HOST=mod9.io
However, because this TCP transport is unencrypted and traverses the public Internet, customers are strongly advised that sensitive data should not be sent to this evaluation server. No data privacy is implied, nor service level promised.
The ASR Engine can also be run locally on bare-metal Linux, or in a Docker container. See installation instructions.
This Mod9 ASR Python SDK is designed to emulate the Google Cloud STT Python Client Library, and we encourage developers to compare our respective software and services side-by-side to ensure compatibility.
Google Cloud credentials are required for such comparisons, so we share gstt-demo-credentials.json to facilitate testing. Download and enable these demo credentials by setting an environment variable in your current shell:
curl -O https://mod9.io/gstt-demo-credentials.json
export GOOGLE_APPLICATION_CREDENTIALS=gstt-demo-credentials.json
Sensitive data should not be used with these shared demo credentials, as it could be seen by other users who are testing. A daily quota is set to prevent abuse of these limited-use credentials, so Google's service may at times be unavailable.
The Mod9 ASR Python SDK is a drop-in replacement for the Google Cloud STT Python Client Library.
To demonstrate this compatibility, consider the sample scripts published by Google:
-
transcribe.py
uses
recognize()
for basic processing of 16kHz audio files. -
transcribe_auto_punctuation.py
demonstrates an extra
config
option and uses 8kHz audio. -
transcribe_streaming_mic.py
uses
recognize_streaming()
with live audio captured from your microphone.
To download Google's sample scripts with a command-line tool:
curl -LO github.com/googleapis/python-speech/raw/main/samples/snippets/transcribe.py
curl -LO github.com/googleapis/python-speech/raw/main/samples/snippets/transcribe_auto_punctuation.py
curl -LO github.com/googleapis/python-speech/raw/main/samples/microphone/transcribe_streaming_mic.py
Modify lines that call from google.cloud import speech
to now use mod9.asr
, for example with a stream editor:
sed s/google.cloud/mod9.asr/ transcribe.py > transcribe_mod9.py
sed s/google.cloud/mod9.asr/ transcribe_auto_punctuation.py > transcribe_auto_punctuation_mod9.py
sed s/google.cloud/mod9.asr/ transcribe_streaming_mic.py > transcribe_streaming_mic_mod9.py
The mod9ified sample scripts are named as *_mod9.py
and differ only in the import lines. To verify this:
diff transcribe.py transcribe_mod9.py
The modified scripts do not communicate with Google Cloud;
the following example usage can even be demonstrated on a laptop with no Internet connection —
e.g. if the Mod9 ASR Engine is deployed on localhost
.
Download sample audio files, greetings.wav (2s @ 16kHz) and SW_4824_B.wav (5m @ 8kHz):
curl -L -O mod9.io/greetings.wav -O mod9.io/SW_4824_B.wav
Run the modified sample script:
python3 transcribe_mod9.py greetings.wav
If it can connect to the Mod9 ASR Engine, the script should print
Transcript: greetings world
.
Google's recognize()
method only allows audio duration up to 60 seconds.
To demonstrate that Mod9 ASR extends support for arbitrarily long durations,
run another script (which is configured for 8kHz audio and transcript formatting):
python3 transcribe_auto_punctuation_mod9.py SW_4824_B.wav
To compare with Google Cloud STT (optional), run the original unmodified scripts:
python3 transcribe.py greetings.wav
python3 transcribe_auto_punctuation.py SW_4824_B.wav
The first script produces the same result as Mod9 ASR; meanwhile, Google STT will fail to process the longer audio file.
The streaming scripts will require PortAudio and PyAudio for OS-dependent microphone access. To install on a Mac:
brew install portaudio && pip3 install pyaudio
Running this sample script will record audio from your microphone and print results in real-time:
python3 transcribe_streaming_mic_mod9.py
To compare with Google Cloud STT (optional), run the unmodified script:
python3 transcribe_streaming_mic.py
It can be especially helpful to run both of these scripts at the same time, comparing side-by-side in different windows. Note that the unmodified script using Google STT will eventually disconnect after reaching their 5-minute streaming limit.
See also the Mod9 ASR REST API,
which can run a Google-compatible service that is accessible to HTTP clients.
This is especially recommended for asynchronous batch-processing workloads,
with a POST followed by GET.
The TCP reference documentation describes the lower-level protocol that is abstracted by the Python SDK and REST API. The TCP interface can enable more extensive functionality, including user-defined words and domain-specific grammar.
Advanced configuration of the Mod9 ASR Engine is described in the deployment guide.
Contact support@mod9.com for additional assistance.
©2019-2022 Mod9 Technologies (Engine 1.9.5 : Python SDK 1.11.6)