Endpoint Rules | Mod9 ASR Engine

[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

Mod9 ASR Engine: Endpoint Rules

The Engine allows an advanced endpoint-rules request option that can be used to customize the endpointing system for a "recognize" request in default mode.

As an example, we perform a recognition job with a rule to endpoint any time there is 0.5 seconds of consecutive silence when the current utterance is longer than 7 seconds.

curl -sLO mod9.io/SW_4824_B.wav
(echo '{"cmd":"recognize", "endpoint-rules": {"rule1":{"min-utterance-length":7, "min-trailing-silence":0.5}}}'; \
    cat SW_4824_B.wav) | nc $HOST PORT

Endpointing Background

Each time the Engine reads a chunk of audio, it needs to decide whether the utterance it is currently processing has reached an endpoint. Endpointing in Kaldi is implemented as a disjunction (a chained OR) of several endpointing rules.

// Returns a boolean. True if we've reached an endpoint.
// This is called every time the engine reads in a chunk of audio.
EndpointDetected {
    if (rule0.Activated) {
        return true;
    }
    .
    .
    if (rule5.Activated) {
        return true;
    }
    return false;
}

Internally, the engine has 6 rules. Each rule has the same structure: they're all the same function, but vary in their parameters. Each rule is a conjunction (a chain of ANDs) of several parameters.

// Returns true if this endpointing rule detects an endpoint.
Rule::Activated {
      return
      (contains_nonsilence OR !rule.must_contain_nonsilence) AND
      trailing_silence >= rule.min_trailing_silence AND
      relative_cost <= rule.max_relative_cost AND
      utterance_length >= rule.min_utterance_length AND
      utterance_length <= rule.max_utterance_length;
}

The endpoint-rules are customized by overwriting these parameters.

When writing endpointing rules, there are a few useful principles to keep in mind.

  • The Engine should endpoint and end a segment during long pauses.
  • Longer segments generally are more accurate because they have more audio and language context.
  • If the system only outputs long segments, there will be high latency in between messages, which might make the engine feel unresponsive, especially when processing live audio in real time.
  • As the utterance gets longer the internal lattice representing the current utterance grows in complexity. Thus it is practical to tolerate shorter and shorter pauses when dealing with longer utterances.
  • The latency request option will affect endpointing performance; shorter is better.

Setting endpoint-rules

Endpoint rules can be passed in as a request option in the initial JSON request when the request command is "recognize" and batch is false.

NOTE: It is the convention that higher numbered rules deal with longer utterance lengths.

Field Type Description
endpoint-rules object Add additional endpointing options, overriding defaults and engine command line. The accepted keys are "rule0"..."rule5". Example: {"endpoint-rules": {"rule2": {"min-trailing-silence": 0.1}}}

The JSON for each endpoint has the following fields:

Field Type Description
must-contain-nonsilence boolean True if the utterance must have a non-silent frame for us to endpoint.
min-trailing-silence number Minimum duration in seconds of consecutive silence at the end of the current utterance. We restart counting once we hit nonsilence
max-relative-cost number or "inf" A non-negative cost that is 0 if it is extremely likely we are at a final state, and higher the less likely we are to be at a final state. This is primarily used for small grammars.
min-utterance-length number Minimum number of seconds of the utterance for this rule to apply (before min-utterance-length seconds, do not apply this rule).
max-utterance-length number Maximum number of seconds of the utterance for this rule to apply (after max-utterance-length seconds, do not apply this rule).

[top]

Examples

Implement a hard cut at 40s.

To implement a hard cut at 40s, we overwrite rule 6 so that it is always activated when utterance_length > 40.

curl -sLO mod9.io/SW_4824_B.wav
(echo '{' \
      '  "endpoint-rules":' \
      '    {' \
      '      "rule5":' \
      '        { ' \
      '         "min-utterance-length":40,' \
      '          "max-utterance-length":100,' \
      '          "max-relative-cost":"inf",' \
      '          "min-trailing-silence":0,' \
      '          "must-contain-nonsilence":false' \
      '        }' \
      '    }' \
      '}'; cat SW_4824_B.wav) | nc $HOST $PORT

Enforce a minimum segment length of 20s

Set the min-utterance-length of each rule to a duration longer than 20.

[top]


©2019-2022 Mod9 Technologies (Version 1.9.5)