[ Overview || TCP | C++ | Python | REST | WebSocket || Models | Customization | Deployment | Licensing ]

ASR Engine: Endpoint Rules

The Engine allows an advanced endpoint-rules request option that can be used to customize the endpointing system for a "recognize" request in default mode.

As an example, we perform a recognition job with a rule to endpoint any time there is 0.5 seconds of consecutive silence when the current utterance is longer than 7 seconds.

curl -sLO mod9.io/SW_4824_B.wav
(echo '{"cmd":"recognize", "endpoint-rules": {"rule1":{"min-utterance-length":7, "min-trailing-silence":0.5}}}'; \
    cat SW_4824_B.wav) | nc $HOST PORT

Endpointing Background

Each time the Engine reads a chunk of audio, it needs to decide whether the utterance it is currently processing has reached an endpoint. Endpointing in Kaldi is implemented as a disjunction (a chained OR) of several endpointing rules.

// Returns a boolean. True if we've reached an endpoint.
// This is called every time the engine reads in a chunk of audio.
EndpointDetected {
    if (rule0.Activated) {
        return true;
    }
    .
    .
    if (rule5.Activated) {
        return true;
    }
    return false;
}

Internally, the engine has 6 rules. Each rule has the same structure: they're all the same function, but vary in their parameters. Each rule is a conjunction (a chain of ANDs) of several parameters.

// Returns true if this endpointing rule detects an endpoint.
Rule::Activated {
      return
      (contains_nonsilence OR !rule.must_contain_nonsilence) AND
      trailing_silence >= rule.min_trailing_silence AND
      relative_cost <= rule.max_relative_cost AND
      utterance_length >= rule.min_utterance_length AND
      utterance_length <= rule.max_utterance_length;
}

The endpoint-rules are customized by overwriting these parameters.

When writing endpointing rules, there are a few useful principles to keep in mind.

The Engine should endpoint and end a segment during long pauses.
Longer segments generally are more accurate because they have more audio and language context.
If the system only outputs long segments, there will be high latency in between messages, which might make the engine feel unresponsive, especially when processing live audio in real time.
As the utterance gets longer the internal lattice representing the current utterance grows in complexity. Thus it is practical to tolerate shorter and shorter pauses when dealing with longer utterances.
The latency request option will affect endpointing performance; shorter is better.

Setting endpoint-rules

Endpoint rules can be passed in as a request option in the initial JSON request when the request command is "recognize" and batch is false.

NOTE: It is the convention that higher numbered rules deal with longer utterance lengths.

Field	Type	Description
`endpoint-rules`	object	Add additional endpointing options, overriding defaults and engine command line. The accepted keys are `"rule0"`...`"rule5"`. Example: `{"endpoint-rules": {"rule2": {"min-trailing-silence": 0.1}}}`

The JSON for each endpoint has the following fields:

Field	Type	Description
`must-contain-nonsilence`	boolean	True if the utterance must have a non-silent frame for us to endpoint.
`min-trailing-silence`	number	Minimum duration in seconds of consecutive silence at the end of the current utterance. We restart counting once we hit nonsilence
`max-relative-cost`	number or `"inf"`	A non-negative cost that is 0 if it is extremely likely we are at a final state, and higher the less likely we are to be at a final state. This is primarily used for small grammars.
`min-utterance-length`	number	Minimum number of seconds of the utterance for this rule to apply (before `min-utterance-length` seconds, do not apply this rule).
`max-utterance-length`	number	Maximum number of seconds of the utterance for this rule to apply (after `max-utterance-length` seconds, do not apply this rule).

[top]

Examples

Implement a hard cut at 40s.

To implement a hard cut at 40s, we overwrite rule 6 so that it is always activated when utterance_length > 40.

curl -sLO mod9.io/SW_4824_B.wav
(echo '{' \
      '  "endpoint-rules":' \
      '    {' \
      '      "rule5":' \
      '        { ' \
      '         "min-utterance-length":40,' \
      '          "max-utterance-length":100,' \
      '          "max-relative-cost":"inf",' \
      '          "min-trailing-silence":0,' \
      '          "must-contain-nonsilence":false' \
      '        }' \
      '    }' \
      '}'; cat SW_4824_B.wav) | nc $HOST $PORT

Enforce a minimum segment length of 20s

Set the min-utterance-length of each rule to a duration longer than 20.