4.2 Python API
hbm_runtime is a Python binding built on pybind11 for accessing and operating the underlying libhbucp / libdnn C++ libraries, providing high-performance neural network model loading and inference.
This interface encapsulates low-level model runtime details so Python users can conveniently load single or multiple neural network models, query and manage model input/output metadata, and run inference flexibly. It supports multiple input data formats and, when necessary, automatically converts inputs to C-contiguous storage to ensure correct and efficient low-level access.
In addition, the new interface releases the Python GIL on the C++ side during inference, allowing multiple Python threads to call run() concurrently. For multi-model inference, the runtime automatically schedules each model’s inference task in parallel using multiple threads to improve throughput.
Use Cases
- Quickly integrate and call hbm_runtime capabilities in Python environments.
- Applications with high demands on inference efficiency and scheduling flexibility, such as robot vision and intelligent edge computing.
- Scenarios that need to load and manage multiple models simultaneously and configure task scheduling parameters (priority, core binding, device ID, etc.) per inference call as needed.
- Scenarios that need to query compile-time BPU information (for example, compile-time BPU core count) to assist runtime resource configuration and consistency checks.
Key Features
- Multi-model support
- Supports loading a single model or a group of multiple models; each model can independently expose input/output metadata and run inference.
run()supports one-shot inference over multi-model inputs and returns results keyed by model name (even for a single model, the nested structure{model_name: {...}}is returned).
- Flexible input formats
- Single input (
numpy.ndarray); - Single-model multi-input dict (
Dict[str, np.ndarray], keys are input tensor names); - Multi-model multi-input structure (
Dict[str, Dict[str, np.ndarray]], outer keys are model names, inner keys are input tensor names). - All inputs are automatically checked for C-contiguous memory layout and copied when necessary to ensure efficient and correct low-level access (non-contiguous inputs may incur extra copy overhead).
- Single input (
- Scheduling parameter configuration: default parameters + per-call overrides (run-local)
- Set model-level default scheduling parameters via
set_scheduling_params(...)(persisted inside the runtime and reusable across calls). - Optionally override scheduling on each
run()call; overrides take precedence over defaults for that call only and do not affect other threads or otherrun()invocations.
- Set model-level default scheduling parameters via
- Multi-threaded inference
- Concurrent
run()from multiple Python threads: the GIL is released inside C++ during inference so multiple Python threads can issue inference calls simultaneously. - Parallel multi-model inference: when the input is a multi-model structure, the runtime launches a thread per model to run inference in parallel (multi-threaded launch), which can improve throughput on multi-core BPU systems; a single-model case uses one inference thread.
- Concurrent
Installation
The hbm_runtime module is a high-performance inference runtime Python interface implemented in C++. It depends on pybind11 and Horizon’s underlying inference libraries (such as libdnn, libhbucp, etc.). It can be installed via system DEB packages (.deb) and supports Python 3.10 and above.
System Dependencies
| Dependency | Minimum Version | Description |
|---|---|---|
| Python | ≥ 3.10 | Python 3.10 is recommended |
| pip | ≥ 22.0 | Required for installing wheel packages |
| pybind11 | any | Used at build time; not required when installing the package |
| scikit-build-core | ≥ 0.7 | Used when building wheel packages (source builds only) |
| Horizon base libraries | platform-specific | e.g. libdnn.so, libucp.so, usually provided by the BSP |
Building Wheel Packages
There are three ways to build a wheel package, described below.
Build During DEB Installation
The hobot-dnn package install process includes building the hbm_runtime wheel. After the DEB install completes, the hbm-runtime whl package is generated.
# Install from apt source
sudo apt-get install hobot-dnn
# Install from a local deb package (package names vary by build; use your actual filename)
dpkg -i hobot-dnn_4.0.4-20250909195426_arm64.deb
# After installation, find the wheel under /tmp on the board
ls /tmp
# Whl package names vary by version; xxx stands for the version
#hbm_runtime-x.x.x-cp310-cp310-manylinux_2_34_aarch64.whl
Build During System Image Compilation
When building the system software image, the hobot-dnn deb is installed; during that install the hbm-runtime whl is built and copied to out/product/deb_packages.
sudo ./pack_image.sh
ls out/product/deb_packages
# Whl package names vary by version; xxx stands for the version
#hbm_runtime-x.x.x-cp310-cp310-manylinux_2_34_aarch64.whl
Build on Device
# Enter the hbm_runtime source tree
cd /usr/hobot/lib/hbm_runtime
# Run the build script
./build.sh
# List built wheel packages
ls dist/
# Whl package names vary by version; xxx stands for the version
#hbm_runtime-x.x.x-cp310-cp310-manylinux_2_34_aarch64.whl
Installation Methods
Using a Wheel Package
You can use either of the following wheel install methods.
-
Install from a local wheel package
- Locate the
.whlfile built in the Building Wheel Packages section.
# Example: install local whl with pip (package names vary by version; xxx stands for the version)
pip install hbm_runtime-x.x.x-cp310-cp310-manylinux_2_34_aarch64.whl - Locate the
-
Install from PyPI
pip install hbm_runtime
Using a .deb Package
You can use either of the following deb install methods.
-
Install from a local DEB package
# Example: install DEB package (package names vary by build; use your actual filename)
sudo dpkg -i hobot-dnn_4.0.2-20250714201215_arm64.deb -
Install from apt source
sudo apt-get install hobot-dnn -
FAQ
- If files are not updated after a
.debinstall, check whether other packages block the upgrade (for example, an olderhobot-spdev). - Use
dpkg -L hobot-dnnto verify deployed files.
- If files are not updated after a
Uninstallation
-
Uninstall pip-installed package:
pip uninstall hbmruntime -
Uninstall deb-installed package:
sudo apt remove hobot-dnn
Quick Start
This section shows how to load models and run inference with hbm_runtime. A few lines of code are enough to run a model and obtain outputs.
Prerequisites
Ensure HBMRuntime is installed correctly (see Installation) and that you have an HBM model file.
Examples
Single-Threaded Inference
Single-Threaded, Single-Model, Single-Input Inference
For models with a single input tensor.
import numpy as np
from hbm_runtime import HB_HBMRuntime
# Load model
model = HB_HBMRuntime("/opt/hobot/model/s600/basic/lanenet256x512.hbm")
# Get model name and input name
model_name = model.model_names[0]
input_name = model.input_names[model_name][0] # Assume single input
# Get shape for this input
input_shape = model.input_shapes[model_name][input_name]
# Build numpy input
input_tensor = np.ones(input_shape, dtype=np.float32)
# Run inference
outputs = model.run(input_tensor)
# Get output
output_array = outputs[model_name]
print("Output:", output_array)
Single-Threaded, Single-Model, Multi-Input Inference
For models with multiple input tensors.
import numpy as np
from hbm_runtime import HB_HBMRuntime
hb_dtype_map = {
"U8": np.uint8,
"S8": np.int8,
"F32": np.float32,
"F16": np.float16,
"U16": np.uint16,
"S16": np.int16,
"S32": np.int32,
"U32": np.uint32,
"BOOL8": np.bool_,
}
# Load model
model = HB_HBMRuntime("/opt/hobot/model/s600/basic/yolov5x_672x672_nv12.hbm")
# Get model name (assume one model loaded)
model_name = model.model_names[0]
# Prepare input names and shapes
input_names = model.input_names[model_name]
input_shapes = model.input_shapes[model_name]
input_dtypes = model.input_dtypes[model_name]
# Build input dict
input_tensors = {}
for name in input_names:
shape = input_shapes[name]
np_dtype = hb_dtype_map.get(input_dtypes[name].name, np.float32) # fallback
input_tensors[name] = np.ones(shape, dtype=np_dtype)
# Optional: set inference priority and BPU device
priority = {model_name: 5}
bpu_cores = {model_name: [0]}
model.set_scheduling_params(
priority=priority,
bpu_cores=bpu_cores
)
# Run inference; optional per-call priority and BPU cores
results = model.run(input_tensors)
# Print outputs
for output_name, output_data in results[model_name].items():
print(f"Output: {output_name}, shape={output_data.shape}")
Single-Threaded, Multi-Model, Multi-Input Inference
For multiple models each with multiple inputs. “Multi-model” can mean several HBM files or several models inside one HBM file.
"""Multi-model inference quick start."""
import numpy as np
from hbm_runtime import HB_HBMRuntime
MODEL_PATHS = [
"/opt/hobot/model/s600/basic/yolov5x_672x672_nv12.hbm",
"/opt/hobot/model/s600/basic/resnet18_224x224_nv12.hbm",
]
DTYPE_MAP = {
"U8": np.uint8, "S8": np.int8,
"F16": np.float16, "F32": np.float32,
}
# Load models
rt = HB_HBMRuntime(MODEL_PATHS)
# Build inputs from model metadata
inputs = {
m: {
inp: np.random.rand(*rt.input_shapes[m][inp]).astype(
DTYPE_MAP.get(rt.input_dtypes[m][inp].name, np.float32)
)
for inp in rt.input_names[m]
}
for m in rt.model_names
}
# Optional: default scheduling params
rt.set_scheduling_params(
priority={m: 5 for m in rt.model_names},
bpu_cores={m: [0] for m in rt.model_names},
)
# Run inference (multi-model, parallel internally)
outputs = rt.run(inputs)
# Print results
for m, outs in outputs.items():
print(f"[{m}]")
for name, arr in outs.items():
print(f" {name}: {arr.shape}, {arr.dtype}")
Multi-Threaded Inference
Multi-Threaded, Single-Model, Single-Input Inference
For models with a single input tensor.
import threading
import numpy as np
from hbm_runtime import HB_HBMRuntime
# Load model
model = HB_HBMRuntime("/opt/hobot/model/s600/basic/asr.hbm")
model_name = model.model_names[0]
input_name = model.input_names[model_name][0]
input_shape = model.input_shapes[model_name][input_name]
# Shared input (read-only)
input_tensor = np.ones(input_shape, dtype=np.float32)
def worker(core_id: int):
outputs = model.run(
input_tensor,
model_name=model_name,
priority={model_name: 5},
bpu_cores={model_name: [core_id]},
custom_id={model_name: core_id}, # optional
)
# Print minimal info
outs = outputs[model_name]
first_name, first_arr = next(iter(outs.items()))
print(f"[T{core_id}] {first_name}: shape={first_arr.shape}, dtype={first_arr.dtype}")
threads = [threading.Thread(target=worker, args=(i,)) for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()
Multi-Threaded, Single-Model, Multi-Input Inference
For models with multiple input tensors.
import threading
import numpy as np
from hbm_runtime import HB_HBMRuntime
hb_dtype_map = {
"U8": np.uint8, "S8": np.int8,
"F16": np.float16, "F32": np.float32,
"U16": np.uint16, "S16": np.int16,
"U32": np.uint32, "S32": np.int32,
"BOOL8": np.bool_,
}
# Load single model
model = HB_HBMRuntime("/opt/hobot/model/s600/basic/yolov5x_672x672_nv12.hbm")
model_name = model.model_names[0]
# Build input tensors (shared, read-only)
input_tensors = {
name: np.ones(
model.input_shapes[model_name][name],
dtype=hb_dtype_map.get(model.input_dtypes[model_name][name].name, np.float32)
)
for name in model.input_names[model_name]
}
def worker(core_id: int):
results = model.run(
input_tensors,
model_name=model_name,
priority={model_name: 5},
bpu_cores={model_name: [core_id]},
custom_id={model_name: core_id}, # optional, for tracing
)
out_name, out_arr = next(iter(results[model_name].items()))
print(f"[T{core_id}] {out_name}: {out_arr.shape}, {out_arr.dtype}")
# Launch 4 threads, bind to BPU cores 0~3
threads = [threading.Thread(target=worker, args=(i,)) for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()
Multi-Threaded, Multi-Model, Multi-Input Inference
"""4-thread demo: each thread runs inference on a dedicated BPU core."""
import threading
import numpy as np
from hbm_runtime import HB_HBMRuntime
MODEL_PATHS = [
"/opt/hobot/model/s600/basic/yolov5x_672x672_nv12.hbm",
"/opt/hobot/model/s600/basic/resnet18_224x224_nv12.hbm",
]
DTYPE_MAP = {
"U8": np.uint8, "S8": np.int8,
"F16": np.float16, "F32": np.float32,
}
rt = HB_HBMRuntime(MODEL_PATHS)
# Build one shared input package (read-only in each thread)
inputs = {
m: {
inp: np.random.rand(*rt.input_shapes[m][inp]).astype(
DTYPE_MAP.get(rt.input_dtypes[m][inp].name, np.float32)
)
for inp in rt.input_names[m]
}
for m in rt.model_names
}
def worker(core_id: int):
# Per-run scheduling override: bind this run to a specific BPU core
outputs = rt.run(
inputs,
priority={m: 5 for m in rt.model_names},
bpu_cores={m: [core_id] for m in rt.model_names},
custom_id={m: core_id for m in rt.model_names}, # optional, for tracing
)
# Print one line per model to keep it simple
for m, outs in outputs.items():
first_out = next(iter(outs.values()))
print(f"[T{core_id}][{m}] first_out: {first_out.shape}, {first_out.dtype}")
threads = [threading.Thread(target=worker, args=(i,)) for i in range(4)]
for t in threads: t.start()
for t in threads: t.join()
FAQ
| Question | Answer |
|---|---|
| How do I get model names? | Use model.model_names to list loaded model names. |
| How do I confirm input dimensions and types? | Use model.input_shapes and model.input_dtypes. |
| How do I assign BPU cores? | Use the bpu_cores parameter with values such as [0, 1, 2, 3]; actual support depends on hardware. |
For advanced usage (multi-input models, reading quantization parameters, etc.), see the API Reference.
Module, Class, and Function Reference (API Reference)
The Python module hbm_runtime is a Horizon HBM model inference interface wrapped with PyBind11, implemented on libdnn and libhbucp. It provides unified model loading, input/output metadata queries, and inference execution, supporting multi-model loading, multi-input inference, per-model selection, BPU core binding, inference task priority, and more.
Enumerations
hbDNNDataType
Tensor data type enumeration:
- S4: 4-bit signed
- U4: 4-bit unsigned
- S8: 8-bit signed
- U8: 8-bit unsigned
- F16: 16-bit float
- S16: 16-bit signed
- U16: 16-bit unsigned
- F32: 32-bit float
- S32: 32-bit signed
- U32: 32-bit unsigned
- F64: 64-bit float
- S64: 64-bit signed
- U64: 64-bit unsigned
- BOOL8: 8-bit bool type
- MAX: maximum value (reserved)