Skip to main content

Automatic Speech Recognition - ASR

This sample runs a speech recognition model using the hbm_runtime inference engine to automatically transcribe .wav audio files and output the corresponding text. The sample code is located in /app/pydev_demo/07_speech_sample/01_asr/.

warning

The current RDK S100 system image does not include the asr.hbm model. Before running this sample, you must download it manually (see the download URL in "Model Description" below) and place it at the default path /opt/hobot/model/s100/basic/asr.hbm, or specify another path with --model-path.

This sample runs a speech recognition model using the hbm_runtime inference engine to automatically transcribe .wav audio files and output the corresponding text. The sample code is located in /app/pydev_demo/speech_sample/asr/.

Model Description

  • Introduction:

    ASR (Automatic Speech Recognition) models convert audio signals into text. The input is single-channel speech waveforms (after sample rate conversion and standardization), and the output is character-level token sequences. Combined with a vocabulary (vocab) file, Chinese speech transcription can be achieved. This sample uses a quantized .hbm model.

  • HBM model name: asr.hbm

  • Input format: audio waveform, single channel, sample rate 16kHz, maximum length 30000 (sample points)

  • Output: character token probability distribution (logits); recognized text is obtained by argmax decoding and mapping

  • Model download URL (automatically downloaded by the program):

    https://archive.d-robotics.cc/downloads/rdk_model_zoo/rdk_s100/asr/asr.hbm
    https://archive.d-robotics.cc/downloads/rdk_model_zoo/rdk_s600/asr/asr.hbm

Features

  • Model loading

    Use hbm_runtime to load the ASR model and automatically parse model input/output shapes and quantization information.

  • Input preprocessing

    Read audio with SoundFile (supports .wav). The audio is:

    • Converted to single channel
    • Resampled to the target sample rate (default 16kHz)
    • Standardized to zero mean and unit variance (z-score)
    • Padded or truncated to a fixed length (for example, 30000)
    • Supports generator-based processing of long audio for streaming recognition
  • Inference execution

    Complete inference using the .run() method, outputting a logits tensor.

  • Output postprocessing

    Use np.argmax() to obtain token indices from output logits, map them to characters using the vocab dictionary file (JSON format), and output the final recognized text.

Environment Dependencies

  • Ensure the dependencies in pydev are installed

    pip install -r ../../requirements.txt
    pip install -r ../../requirements.txt --break-system-packages
  • Install the soundfile package

    pip install soundfile==0.13.1
    pip install soundfile==0.13.1 --break-system-packages

Directory Structure

01_asr/
├── asr.py # Main inference script

Parameter Description

ParameterDescriptionDefault Value
--model-pathModel path (.hbm format)/opt/hobot/model/s100/basic/asr.hbm
--audio-fileInput audio file (supports .wav or .flac)/app/res/assets/chi_sound.wav
--vocab-fileVocabulary file, mapping token → id/app/res/labels/vocab.json
--priorityInference priority, 0~255; larger is higher0
--bpu-coresBPU cores to use (for example, --bpu-cores 0 1)[0]
--audio_maxlenFixed length after audio cropping/padding (sample points)30000
--new_rateTarget sample rate; audio is automatically resampled16000
ParameterDescriptionDefault Value
--model-pathModel path (.hbm format)/opt/hobot/model/s600/basic/asr.hbm
--audio-fileInput audio file (supports .wav or .flac)/app/res/assets/chi_sound.wav
--vocab-fileVocabulary file, mapping token → id/app/res/labels/vocab.json
--priorityInference priority, 0~255; larger is higher0
--bpu-coresBPU cores to use (for example, --bpu-cores 0 1)[0]
--audio_maxlenFixed length after audio cropping/padding (sample points)30000
--new_rateTarget sample rate; audio is automatically resampled16000

Quick Start

  • Run the model

    • Use default parameters

      python asr.py
    • Run with specified parameters

      python asr.py \
      --model-path /opt/hobot/model/s100/basic/asr.hbm \
      --audio-file /app/res/assets/chi_sound.wav \
      --vocab-file /app/res/labels/vocab.json \
      --priority 0 \
      --bpu-cores 0 \
      --audio_maxlen 30000 \
      --new_rate 16000
      python asr.py \
      --model-path /opt/hobot/model/s600/basic/asr.hbm \
      --audio-file /app/res/assets/chi_sound.wav \
      --vocab-file /app/res/labels/vocab.json \
      --priority 0 \
      --bpu-cores 0 \
      --audio_maxlen 30000 \
      --new_rate 16000
  • View the result

    After successful execution, the result will be printed.

    我是来自阿里云的大规模语言磨型过叫通意千问||

Notes

  • If the specified model path does not exist, the program will attempt to download the model automatically.