Skip to main content

supported_op_list

Supported Operator Lists and Restrictions

Limitations and Notes

This section primarily covers the operators supported by the D-Robotics Processor for both Caffe and ONNX. Operators not listed are currently unsupported due to hardware limitations on the BPU.

Terminology:

  • BPU Acceleration: Operators that the D-Robotics Processor can accelerate under certain constraints; if not met, they will be computed on the CPU.
  • CPU Computation: Operators already optimized on D-Robotics's ARM CPU, supporting ONNX opsets 10 and 11.
  • CPU Computation※: Temporary CPU operators not yet integrated.

Additional Considerations:

  • For all BPU in RDK X3, there is a general restriction: input_batch ≤ 128.

  • On RDK Ultra BPU, restrictions apply:

    1. Input and output dimensions must be 4D; support for non-four-dimensional ops is indicated explicitly.
    2. Shape: H, W, C ∈ [1, 65536], N ≤ 4096; and N x C x H x W ≤ 1GB.
    3. Supports Caffe 1.0 base operators and common extended operators, as well as ONNX opsets 10 and 11. Ops not meeting BPU acceleration constraints fallback to ARM CPU.
  • Operators like Cast, Constant, Dropout, Reshape, Squeeze, Unsqueeze, and Shape (OPs) cannot run directly on the BPU, but algorithmic toolchains may optimize them in some cases (e.g., constant folding) for support.

  • Operators marked as PyTorch are officially unsupported opsets 11 ops, which D-Robotics's algorithm toolchain provides a script to export from PyTorch to custom ONNX ops.

  • Tensorflow-onnx conversion tool (https://github.com/onnx/tensorflow-onnx) supports converting TensorFlow 1.x operators to stable ONNX opsets 6-11, but TensorFlow 2.x support is still experimental.

  • Quantization Details: A compliant operator may still run on CPU due to being a passively quantized OP. The algorithm toolchain designs quantization logic based on the OP's computation characteristics and BPU low-level logic. For more information on active, passive, and manual quantization, see the "Quantization Logic in Algorithm Toolchain" chapter.

RDK X3 List of supported Caffe operators

Caffe Operator NameCPU Computing/BPU AccelerationX3 BPU ConstraintsCPU Constraints
ConvolutionBPU AccelerationKernel Size: HxW = [1, 7]x[1, 7] for BPU
Channel limit: <= 2048 (for non-dilated, group, depthwise conv), or <= 4096 for standard convs
No stride limit
Dilation: only powers of 2 allowed, divisible by stride
h_dilated <= w_dilated
Total kernel size: HxWxC <= 32768
axis not supported (default: 1)
4D Conv only
auto_pad not supported
Type constraints: float, int32, int8
Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart == Hend and Wstart == Wend
DeconvolutionBPU AccelerationKernel Size: HxW = [2, 14]x[2, 14]
Channel limit: C <= 2048
Padding: HxW = [0, (Kernel_H-1)/2]x[0, (Kernel_W-1)/2]
Stride: Stride ∈ {2, 4}, stride_h ≤ stride_w
Dilation: (1, 1)
No axis support
output_shape and output_padding unsupported
auto_pad: NOTSET only
No axis support
MaxUnpoolCPU Computing---from_type constraints: X - float only, I - Tensor(int64)
to_type constraints: float only
PoolingBPU AccelerationFour types: MaxPooling, AveragePooling, GlobalMaxPooling, GlobalAveragePooling
Constraints:
MaxPooling: Kernel Size = [1, 64]x[1, 64], Stride = [1, 185], Padding >= 0
AveragePooling: HxW = [1, 7]x[1, 7], Stride ∈ 185
GlobalAveragePooling: HxW <= 8192 for NCHW input
GlobalMaxPooling: HxW = [1, 1024]x[1, 1024] for NCHW input
None
SPPCPU ComputingNot supportedpyramid_height: 2^n pooling, n < 7
pooling kernel size <= 255
pool option: 1
InnerProductBPU AccelerationConverted to Conv
Constraints:
For NCHW input, if HW < 7, Gemm limits same as Conv
H = W = 1: C limit <= 16384; otherwise, C limit <= 2048
Low-precision int8 output after BPU node: H x W/8 x C/4 ≤ 1024
High-precision int32 output: H x W/8 x C/4 < 2048
No axis support
None
LRNCPU ComputingNot supportedlocal_size supported
alpha, beta supported
norm_region: ACROSS_CHANNELS, WITHIN_CHANNEL (optional)
k supported
MVNCPU ComputingNot supported normalize_variance: {0, 1} (optional)
across_channels: {0, 1} (optional)
Float32 only
BatchNormBPU AccelerationUnlimitedNone
ELUCPU ComputingNot supportedNone
BNLLCPU ComputingNot supportedNone
PReLUBPU AccelerationUnlimitedNone
ReLU/LeakyReLuBPU AccelerationUnlimitedNone
SigmoidBPU AccelerationFor 1CHW tensor: min(8W4C-aligned shape, 32C-aligned shape) ≤ 8192
8W4C: pad W to multiples of 8, C to multiples of 4
32C: pad C to multiples of 32
Use the smaller aligned shape
None
TanHBPU AccelerationUnlimitedNone
EltwiseBPU AccelerationOperation supports Add and Mul, no Sub
Add: M ≤ 2048 channels
Supported cases:
1. NCHW vs NCHW
2. NCHW vs NC11 (inputs must be op outputs)
Mul: Both inputs must be 4D, C ≤ 2048
Supported shapes:
1. (1xCxHxW vs 1xCxHxW)
2. (1xCxHxW vs 1xCx1x1)
3. (1xCxHxW vs 1x1x1x1)
None
BiasBPU AccelerationRefer to Eltwise (Add) constraintsNone
ScaleBPU AccelerationRefer to Eltwise (Mul) constraintsNone
AbsValCPU ComputingNot supportedNone
ExpBPU AccelerationUnlimitedNone
LogCPU ComputingNot supportedNone
PowerBPUUnlimitedNone
ThresholdCPUNot supportedNone
ReductionCPUNot supportedOperation supports SUM, ASUM, SUMSQ, MEAN.
Axis supports.
Only supports Float32 calculations.
SoftmaxCPUNot supportedNone
ArgMaxBPUOnly supports axis=1 and c<=64.
Does not support top_k != 1
None
ConcatBPUInput/Output Channel: C<=2048None
SplitBPUUnlimitedNone
SliceBPUUnlimitedNone
ReshapeCPUNot supported (can be fused in some scenarios)Shape supports up to [1,4] shape_dim configurations.
Axis supports [-4,3]. No support for N dimensions. Default value is 0, follows Caffe rules.
FlattenCPUNot supported (can be fused in some scenarios)Axis range [-4,3], default is 1, -4 and 0 have the same meaning. Only supports End_axis == -1.
CropCPUNot supportedNone
DropoutBPUUnlimitedNone
LSTMBPUOnly supports batch=1--
NormalizeCPUNot supportedType constraint: only supports float type.
PassThroughBPUSupports mode=DCR and mode=CRD.
Only supports rearrangement in H and W directions with blocksize=2.
Type constraint: only supports float type.
CReLUCPUNot supportedType constraint: only supports float type.
RReLUCPUNot supportedNone
PermuteCPUNot supported- Supports nhwc2nchw, perm: [0, 3, 1, 2].
- Supports nchw2nhwc, perm: [0, 2, 3, 1].
- Supports specified perm dimension conversions, data types: float, int8, int32.
MatMulBPUOptimized for specific scenarios:
- K vs KxN, K vs 1xKxN, K vs 1x1xKxN
- MxK vs K, MxK vs KxN, MxK vs 1x1xKxN
- 1xMxK vs K, 1xMxK vs 1xKxN
- 1x1xMxK vs K, 1x1xMxK vs 1xKxN, 1x1xMxK vs 1x1xKxN
- BxMxK vs KxN (B>=1)
- 1xBxMxK vs KxN (B>=1)
- AxBxMxK vs KxN (A>1, B>1)
For the opposite scenario:
- 1xBxMxK vs 1x1xKxN (B>1)
Optimized for two featuremaps:
- 1xBxMxK vs 1x1xKxN (B>=1)
Type constraint: only supports float type.
UpsampleBPUInput featuremap must be 4D NCHW, resize only on H and W dimensions, factors must be 2^N.
Supports different factors for H and W, but H_factor <= W_factor required.
None
ROIPoolingCPUNot supportedNone
PSROIPoolingCPUNot supportedNone

RDK X3 List of supported ONNX operators

ONNX Operator NameCPU Computing/BPU AccelerationX3 BPU ConstraintsCPU Constraints
AbsCPU Calculation--Type constraint: only supports float type.
AcosCPU Calculation--Type constraint: only supports float type.
AcoshCPU Calculation--Type constraint: only supports float type.
AddBPU AccelerationM <= 2048, supported cases:
1. NCHW and NCHW shapes for both inputs.
2. NCHW and NC11 shapes (both inputs need to be outputs of other ops).
3. Integrated into the previous conv in ResNet's shortcut structure for acceleration.
- Supports same shape inputs calculation.
- Supports scalar input 1 or input 2 calculation.
- Supports broadcast calculation with a max dimension of 5.
AndCPU Calculation--- Supports same shape inputs calculation.
- Supports scalar input 1 or input 2 calculation.
- Supports broadcast calculation with a max dimension of 5.
ArgMaxBPU Acceleration1. Four-dimensional input (NCHW).
2. Only supports argmax along the C dimension (axis=1).
3.C <= 64
Type constraint: only supports float type.
ArgMinCPU Calculation--Type constraint: only supports float type.
AsinCPU Calculation--Type constraint: only supports float type.
AsinhCPU Calculation--Type constraint: only supports float type.
AtanCPU Calculation--Type constraint: only supports float type.
AtanhCPU Calculation--Type constraint: only supports float type.
AveragePoolBPU AccelerationKernel HxW: [1, 7]x[1, 7], Stride ∈185auto_pad attribute not supported.
Only supports four-dimensional tensors.
BatchNormalizationBPU AccelerationOptimized to fuse with previous convType constraint: only supports float type.
Supports channel-first data layout (dim=1).
BitShiftCPU Calculation(*)----
CastCPU Calculation--from_type supports double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8.
to_type supports double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8.
CeilCPU Calculation--Type constraint: only supports float type.
ClipBPU AccelerationUnlimitedType constraint: only supports float type.
Default min parameter when two inputs are provided.
CompressCPU Calculation(*)----
ConcatBPU AccelerationInput/Output Channel: C<=2048--
ConcatFromSequenceCPU Calculation(*)----
ConstantBPU AccelerationOptimized through constant foldingNo support for sparse_tensor attribute.
Type constraint: only supports float type.
ConstantOfShapeBPU AccelerationOptimized through constant foldingSupported types: float, int32, int8.
ConvBPU AccelerationKernel HxW: [1, 7]x[1, 7].
Input/output Channel (for one group): <= 2048 (relaxed to <=4096 for non-dilated, group, depthwise conv).
Stride: Unrestricted (except stride=1 for Conv followed by Add in ResNet shortcut-connecting).
Dilation: Only powers of 2 allowed, divisible by stride.
h_dilated ≤ w_dilated.
Total kernel size limit: HxWxC ≤ 32768
Only supports 4D Convolution.
auto_pad attribute not supported.
Type constraint: float, int32, int8.
Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend.
ConvIntegerCPU Calculation(*)----
ConvTransposeBPU AccelerationKernel HxW: [2, 14]x[2, 14].
Input/output Channel: C <= 2048.
Padding HxW: [0,(Kernel_H-1)/2]x[0,(Kernel_W-1)/2].
Stride: 4.
stride_h ≤ stride_w.
Dilation: (1, 1) only
auto_pad attribute not supported.
Type constraint: float, int32, int8.
CosBPU AccelerationLimited to CxHxW <= 8192 for 1CHW tensorType constraint: only supports float type.
CoshCPU Calculation--Type constraint: only supports float type.
CumSumCPU Calculation----
Operator: from_type:Description: BPU AccelerationSupported ModesInput Shape ConstraintsOutput Type Constraints
DepthToSpaceBPU accelerationSupports DCR and CRD modes.Only supports rearrangement along H and W dimensions with blockSize=2.
- from_type: only float types allowed.
- 4D Tensor computation only.
- to_type: only float types allowed.
- 4D Tensor computation only.
DequantizeLinearCPU computation------
DetCPU computation※------
DivBPU acceleration1. Supports featuremap inputs only (no constant inputs).
2. Input shape constraints refer to Mul operator.
- Same input shape supported.
- Supports scalar input1 or input2.
- Broadcast calculation up to 5 dimensions.
DropoutBPU accelerationNot computed in inference, removed by optimization.--
EinsumCPU computation※------
EluCPU computation--Type constraint: only float types.
EqualCPU computation--- Same input shape supported.
- Supports scalar input1 or input2.
- Broadcast calculation up to 5 dimensions.
ErfCPU computation--Type constraint: supports float and double types.
ExpBPU acceleration--Type constraint: only float types.
ExpandCPU computation----
EyeLikeCPU computation----
FlattenCPU computation----
FloorCPU computation--Type constraint: only float types.
GRUCPU computation--- direction attribute supports forward only.
- Type constraint: only float types.
- Input count must be 3, 4, or 6.
- Output count is 2.
GatherCPU computation--from_type:
- input: types supported: float, int64, int32, int8, uint64, uint32, uint8.
- indices: type supported: int32, int64.
- to_type: types supported: float, int64, int32, int8, uint64, uint32, uint8.
GatherElementsCPU computation----
GatherNDCPU computation--from_type:
- input: types supported: float, int32, int8.
- indices: tensor(int64).
- to_type: types supported: float, int32, int8.
GemmBPU accelerationConverted to Conv implementation.
- HW <= 7 for both H and W if both are <= 7.
- C <= 16384 if H/W = 1; otherwise, C <= 2048.
- Low-precision int8 output if followed by BPU-supported node: H x W/8 x C/4 <= 1024.
- High-precision int32 output if followed by non-BPU-supported node: H x W/8 x C/4 < 2048.
- Type constraint: only float types.
GlobalAveragePoolBPU accelerationInput HxW must be <= 8192 for NCHW shape.--
GlobalLpPoolCPU computation--Type constraint: supports float and double types.
- 4D Tensor computation only.
GlobalMaxPoolBPU accelerationInput HxW range: [1, 1024]x[1, 1024] for NCHW shape.Type constraint: only float types.
- 4D Tensor only.
GreaterCPU computation--- Same input shape supported.
- Supports scalar input1 or input2.
- Broadcast calculation up to 5 dimensions.
HardSigmoidCPU computation--Type constraint: only float types.
HardmaxCPU computation※----
IdentityCPU computation----
IfCPU computation※----
InstanceNormalizationCPU Calculation
IsInfCPU CalculationOnly supports float type.
IsNaNCPU CalculationOnly supports float type.
LRNCPU CalculationOnly supports 4D Tensors and float type.
LSTMBPU AcceleratedSupports batch_size=1 only.No attribute settings supported. Only supports inputs of 3, 4, or 8, and outputs of 2. Float type only.
LeakyReluBPU AcceleratedN/AN/A
LessCPU CalculationSupports same input shape, scalar input1 or input2, and broadcastSupports up to 5-dimensional broadcast with same input shapes, and scalar inputs.
LessOrEqualCPU CalculationSame as 'Less'Same as 'Less'.
LogCPU CalculationOnly supports float type.
LogSoftmaxCPU CalculationOnly supports float type.
LoopCPU Calculation
LpNormalizationCPU Calculationp-norm only supports 1 or 2, double or float type.
LpPoolCPU Calculationauto_pad not supported, double or float type, and 4D computation
MatMulIntegerCPU Calculation
MatMulBPU AcceleratedFor scenarios where the two inputs are featuremap and weight, which involve element-wise multiplication between a featuremap and a constant, the following can be optimized for execution on a BPU:
- K vs KxN, K vs 1xKxN, K vs 1x1xKxN
- MxK vs K, MxK vs KxN, MxK vs 1x1xKxN
- 1xMxK vs K, 1xMxK vs 1xKxN
- 1x1xMxK vs K, 1x1xMxK vs 1xKxN, 1x1xMxK vs 1x1xKxN
- BxMxK vs KxN (where B >= 1)
- 1xBxMxK vs KxN (where B >= 1)
- AxBxMxK vs KxN (where A > 1 and B > 1)
For situations where both inputs are featuremaps (i.e., element-wise multiplication of featuremaps), the following can be optimized for the BPU:
- 1xBxMxK vs 1x1xKxN (where B >= 1)
Only supports float type. Optimizations apply to specific input shapes: see details below.
MaxCPU CalculationSupports multiple inputs, same shape, scalar inputs, and broadcastUp to 5-dimensional broadcast, supports scalar inputs.
MaxPoolBPU AcceleratedKernel size [1-64]x[1-64], stride [1-185], padding >= 0, no dilationdilation only supports 1x1, data row-major storage, no auto_pad or storage_order support, 4D Tensors only.
MaxRoiPoolCPU Calculation
MeanCPU Calculation
MinCPU CalculationSame as 'Max'
ModCPU Calculation
MulBPU Accelerated4D inputs with C <= 2048, specific shape rules apply
1. (1xCxHxW vs 1xCxHxW)。
2. (1xCxHxW vs 1xCx1x1)。
3. (1xCxHxW vs 1x1x1x1) 。
Same broadcast constraints as 'Mul'. Input values must not be 0.
MultinomialCPU Calculation
NegCPU Calculation
NonZeroCPU CalculationSupports float, int32, or int8 types, 1D or 4D computations
NotCPU Calculation
OneHotCPU----
OrCPU--Supports same input shape calculation.
Supports scalar inputs.
Broadcasting up to 5 dimensions.
PReluBPU- Type constraints: Only supports float type.
- from_type: X and slope.
- to_type: Y.
- X's shape is data_shape, slope's is slope_shape.
- data_shape == slope_shape.
- slope_shape.ProdSize() == 1.
- NCHW layout for 4D tensors with equal N and C dimensions.
- HxW with 1x1 (slope_shape).
- HxW with Hx1 (slope_shape).
- HxW with 1xW (slope_shape).
- Special case: 4D X and 3D slope, with data_shape[1] == slope_shape[0], slope_shape[1] == 1, slope_shape[2] == 1.
PadBPUSupports mode=Constant.
Only supports padding on H, W dimensions.
Pad-10:
- Type constraint: float only.
- 4D NCHW tensors.
- pads constraint: len(pads) == 8, pads[i] >= 0, pads[0] = pads[1] = pads[4] = pads[5] = 0.
Pad-11:
- from_type: data (float), pads (int64 tensor), optional constant_value (float).
- 4D tensor, 2D or 3D padding only.
- to_type: float only.
PowBPUSupports exponent as a single value.- Type constraints: double, float, int64, int32.
- Supports same shape, scalar inputs, and broadcasting up to 5 dimensions.
- X and Y must be of the same type.
QLinearConvCPU※----
QLinearMatMulCPU※----
QuantizeLinearCPU----
RNNCPU--- Type constraint: float only.
- direction attribute: forward only.
- Input constraints: X, W, R required, B, sequence_lens, initial_h unsupported.
- Output constraint: Y_h output, shape [num_directions, batch_size, hidden_size].
RandomNormalCPU※----
RandomNormalLikeCPU※----
RandomUniformCPU----
RandomUniformLikeCPU----
RangeCPUType constraints: float, int64, int32, int16.--
ReciprocalBPU----
ReduceL1CPU----
ReduceL2CPU----
ReduceLogSumCPU--Only supports float, double data types.
ReduceLogSumExpCPU--Type constraints: float, double.
ReduceMaxCPU--Axes support: 0, 1, or equal to input dimensions.
ReduceMeanBPUInput featuremap must be 4D, axes=[2, 3].Axes support: 0, 1, or equal to input dimensions.
ReduceMinCPU----
ReduceProdCPU----
ReduceSumCPU--Axes support: 0, 1, or equal to input dimensions.
ReduceSumSquareCPU--Axes support: 0, 1, or equal to input dimensions.
ReluBPU----
ReshapeCPU----
ResizeBPU1. Input must be NCHW 4D and only resize in H and W dimensions. ROI input supported in ONNX opset=11 (manual modification required for PyTorch models to add ROI input, which only accepts constant inputs and works with tf_crop_and_resize mode).
2. Mode supports nearest and linear.
3. Supports scaling up and down.
4. For nearest mode, scaling factors should be powers of 2 (e.g., 2, 4, 8, 16, 32) and H_factor must be less than or equal to W_factor.
5. coordinate_transformation_mode supports half_pixel, pytorch_half_pixel, asymmetric, align_corners, and tf_crop_and_resize. When using tf_crop_and_resize, ensure ROI input coordinates are integers.
resize-10
- Use opset10 when input is 2.
- Input is a 4D Tensor.
resize-11
- Use opset11 when input is greater than 2.
- Input is a 4D Tensor.
- coordinate_transformation_mode supports half_pixel, asymmetric, align_corners, and pytorch_half_pixel for nearest and linear modes, and half_pixel only for cubic mode.
- extrapolation_value not supported.
ReverseSequenceCPU----
RoiAlignCPU----
RoundCPU----
ScanCPU※----
Scatter (deprecated)CPU※----
ScatterElementsCPU--from_type: float, int32, int8
indices: int32 only
updates: float, int32, int8
to_type: float, int32, int8
ScatterNDCPU--from_type: float, int32, int8
updates: float, int32, int8
to_type: float, int32, int8
SeluCPU--Only supports float types.
SequenceAtCPU※----
SequenceConstructCPU※----
SequenceEmptyCPU※----
SequenceEraseCPU※----
SequenceInsertCPU※----
SequenceLengthCPU※----
ShapeBPUOptimized to numerical storage via constant folding.--
ShrinkCPU※----
SigmoidBPULimited to 1CHW tensors where CxHxW <= 8192.
8W4C: pad W to multiples of 8 and C to multiples of 4.
32C: pad C to multiples of 32.
Choose the smallest aligned shape between the two and ensure <= 8192.
Only supports float types.
SignCPU--None
SinBPULimited to 1CHW tensors where CxHxW <= 8192.Only supports float types.
SinhCPU--Only supports float types.
SizeBPUOptimized to numerical storage via constant folding.--
SliceBPUUnlimitedNone
SoftmaxBPURuns on CPU by default. Can be set to BPU for 4D inputs with axis=1 and as model output, using run_on_bpu.Only supports float types.
SoftplusBPU accelerationSupports CxHxW <= 8192 for a tensor of input dimension 1CHW.Only supports float type.
SoftsignCPU computation--Only supports float type.
SpaceToDepthBPU accelerationSupports DCR and CRD modes.
Restrictions: H and W permutation, blocksize=2 only.
Only supports float type.
SplitBPU accelerationRestrictions: NCHW input, divisible lengths, axis=1,2,3.Only supports float type.
SplitToSequenceCPU computation(*)----
SqrtBPU accelerationSupports CxHxW <= 8192 for a tensor of input dimension 1CHW.Only supports float type.
SqueezeCPU computationRemoved by constant folding optimization if in constant substructure.--
StringNormalizerCPU computation(*)----
SubCPU computation--Supports same shape, scalar inputs, broadcast up to 5 dimensions.
SumBPU accelerationSame restrictions as Add.Only supports float type.
TanCPU computation--Only supports float type.
TanhBPU accelerationSupports CxHxW <= 8192 for a tensor of input dimension 1CHW.Only supports float type.
TfIdfVectorizerCPU computation(*)----
ThresholdedReluCPU computation--Only supports float type.
TileCPU computation--Supports float, int64, int32, uint64, uint32 types.
TopKCPU computation--Only supports float type, opset-10.
TransposeCPU computationSupports nhwc2nchw, perm=[0, 3, 1, 2], nchw2nhwc, perm=[0, 2, 3, 1].Supports float, int8, int32 types.
UniqueCPU computation(*)----
UnsqueezeCPU computationRemoved by constant folding optimization if in constant substructure.--
Upsample (replace resize)BPU acceleration--Upsample-10 for input=2, 4D Tensor.
Upsample-11 for input>2, 4D Tensor.
WhereCPU computation--Supports float and int64 types.
Shape constraints detailed in the description.
XorCPU computation(*)----
FunctionCPU computation(*)----
CeluCPU computation(*)----
DynamicQuantizeLinearCPU computation(*)----
GreaterOrEqualCPU computation--Supports same shape, scalar inputs, broadcast up to 5 dimensions.
MeanVarianceNormalizationCPU computation(*)----
GridSample (PyTorch)CPU computation(*)----

RDK Ultra Supported Caffe Operators List

Caffe Operator NameCPU Computation/BPU AccelerationRDK Ultra BPU ConstraintsCPU Constraints
ConvolutionBPU Accelerated- Kernel width and height: <= 32
- Input/output channels (for one group): <= 8192 (or <= 65536 if last in quantized graph)
- Stride: Unrestricted, stride for Conv followed by Add (ResNet shortcut-connection) should be {1, 2}
- Dilation: <= 16
- Only supports dilation=1 when dilation != 1
- Axis default: 1
- 4D Convolution only
- auto_pad attribute not supported
- Type constraints: float, int32, int8
- Pads attribute constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend.
DeconvolutionBPU Accelerated- kernel >= stride
- Input/output featuremaps <= 2048
- pad <= kernel / stride
- out_pad < 2
- stride: 14 >= stride >= 1, but stride_h and stride_w cannot both be 1
- Axis configuration not supported
- Shape constraint: 4D Tensor computation only
- Type constraint: float only
- Attribute constraints: dilations, group, output_padding, pads, strides attributes
- Pads attribute constraint: [hstart, wstart, hend, wend] must satisfy (hstart==hend and wstart==wend).
MaxUnpoolCPU Computation---- from_type constraints: X - float, I - Tensor(int64)
- to_type constraints: float only
PoolingBPU Accelerated- Four types: MaxPooling, AveragePooling, GlobalMaxPooling, GlobalAveragePooling
- Constraints: MaxPooling - int16 input/output, kernel <= 256, stride <= 256, padding <= 256
- AveragePooling - same as MaxPooling
- GlobalAveragePooling - unlimited
- GlobalMaxPooling - H, W ∈ [1, 256]
None
SPPCPU ComputationNot supported- Supports pyramid_height with 2^n pooling, n < 7
- pooling kernel <= 255
- pool option, configurable values: {0, 1}
InnerProductBPU AcceleratedConverted to Conv with Conv constraints
- Axis configuration not supported
None
LRNCPU ComputationNot supported- local_size supported
- alpha, beta, norm_region supported (configurable values: ACROSS_CHANNELS, WITHIN_CHANNEL)
- k supported
MVNCPU ComputationNot supported- normalize_variance: configurable values 1
- across_channels: configurable values 1
- Float32 computation only
BatchNormBPU AcceleratedUnlimitedNone
ELUBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
BNLLCPU ComputationNot supportedNone
PReLUCPU Computation- type constraint: float only
- from_type: X and slope
- to_type: Y
- Shape constraints: X = data_shape, slope = slope_shape
- data_shape == slope_shape
- slope_shape.ProdSize() == 1
- 4D NCHW layout for X and slope, N, C dimensions must be equal
- HxW or 1x1 for slope_shape
- Hx1 or 1xH for slope_shape
- 1xW or Wx1 for slope_shape
- Special case: 4D X and 3D slope with data_shape[1] = slope_shape[0] and slope_shape[1] = 1, slope_shape[2] = 1
None
ReLU/LeakyReLUBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
SigmoidBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
TanHBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
EltwiseBPU AcceleratedSupports Add, Sub, Mul operations
- int16 input/output support
- Feature map and constant inputs, at most one constant
- Broadcasting except first dimension
- 2D, 3D, 4D, and 5D dimensions supported, with general limitations (see notes)
- Different input dimensions supported, 5D inputs must meet: merge adjacent dimensions to 4D (e.g., NHWD1 and N1WDC), broadcast dimensions cannot be adjacent (e.g., NHWD1 and N11DC due to broadcast on H, W, and C)
None
BiasBPU AcceleratedRefer to Eltwise (Add) constraintsNone
ScaleBPU AcceleratedRefer to Eltwise (Mul) constraintsNone
AbsValBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
ExpBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
LogBPU Accelerated- int16 input/output support
- Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536]
None
PowerBPU Op1. Supports int16 input and output.
2. Input and output support up to 10 dimensions, with max dimension ∈ [1, 4096], others ∈ [1, 65536].
3. Second input only supports scalar.
-
ThresholdCPU ComputationNot supported-
ReductionCPU ComputationNot supported. Operation supports SUM, ASUM, SUMSQ, MEAN, Max, LogSum, Min, Prod; Axis supports; Only supports Float32 computation.-
SoftmaxBPU Op1. Supports int16 input and output.
2. Defaults to CPU execution. Can run on BPU for 4D inputs with axis=1,2,3 if specified by run_on_bpu.
-
ArgMaxBPU Op1. Only supports axis=1, c<=64.
2. Does not support top_k ≠ 1.
3. Supports int16 input and output.
-
ConcatBPU Op1. Supports int16 input and output.
2. Does not support N-dimensional concat.
-
SplitBPU Op1. Supports int16 input and output.
2. Length of the original input must be a multiple of each split tensor length.
3. Supports any dimension except N.
4. Split count should be divisible.
5. Supports non-four-dimensional input and output.
-
SliceBPU Op1. Supports int16 input and output.
2. Unlimited, supports non-four-dimensional input and output.
-
ReshapeBPU Op1. Supports int16 input and output.
2. Supports up to 10-dimensional input and output.
Shape supports [1,4] shape_dim configurations; Axis supports [-4,3], does not support N dimensions, default 0 follows Caffe rules; num_axes supports [-1,3], default -1 means all axes from axis start.
FlattenCPU ComputationNot supported (can be fused in some scenarios)Axis range [-4,3], default is 1, with -4 and 0 having the same meaning. Only supports End_axis == -1.
CropCPU ComputationNot supported-
DropoutBPU OpUnlimited-
LSTMBPU OpOnly supports batch=1-
NormalizeCPU ComputationNot supportedType constraint: only supports float types.
PassThroughBPU OpSupports mode=DCR and mode=CRD. Only supports reordering along H and W directions with blocksize=2, e.g., NxCxHxW -> Nx(4C)x(H/2)x(W/2).Type constraint: only supports float types.
CReLUCPU ComputationNot supportedType constraint: only supports float types.
RReLUCPU ComputationNot supportedNone
PermuteBPU Op1. Supports arbitrary input dimensions.
2. Supports conversion of any other dimension except batch dimension (first dimension).
- Supports nhwc2nchw, perm: [0, 3, 1, 2].
- Supports nchw2nhwc, perm: [0, 2, 3, 1].
- Supports permutation of specified dimensions, data types supported: float, int8, int32.
MatMulBPU OpC = MatMul(A, B), with dimension constraints for A and B:
- Both A and B can have non-four-dimensional inputs but must meet these conditions:
- Dimensions of A and B must be the same.
- The lowest two dimensions M, K ∈ [1, 8192], higher dimensions ∈ [1, 4096].
Note: HDMK vs HDKN, MK/KN refers to the lowest two dimensions.
- Broadcasting is supported under these conditions:
- All other dimensions than the lowest two of A and B are either 1 or do not require broadcasting.
- Supported example: HDMK vs H1KN
- Unsupported example: H1MK vs 1DKN
- A cannot have both broadcasting and non-broadcasting values in dimensions beyond its lowest two.
- Supported example: 11MK vs HDKN
- Unsupported example: H1MK vs HDKN
- If B has both broadcasting and non-broadcasting values in higher dimensions, non-broadcasting values must be contiguous.
- Supported example: BHDMK vs B11KN
- Unsupported example: BHDMK vs B1DKN
- Broadcasting rules:
- If A and B have unequal values in a given dimension, the 1 is considered the broadcasting value, and the non-1 is not.
- If A and B have equal values in a given dimension, both are considered non-broadcasting values (e.g., HDMK vs H1KN, 1 is the broadcasting value, H is not).
Type constraint: only supports float types.
UpsampleBPU OpRequires four-dimensional NCHW input, resize only supported on H and W dimensions; factor cannot be less than 2.-
ROIPoolingCPU ComputationNot supported-
PSROIPoolingCPU ComputationNot supported-

RDK Ultra-supported ONNX Operators List

ONNX Operator NameCPU/CPU AccelerationRDK Ultra BPU ConstraintsCPU Constraints
AbsBPU Accelerated1. Supports int16 input/output.
2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536].
Type constraint: only supports float types.
AcosCPU Computation--Type constraint: only supports float types.
AcoshCPU Computation--Type constraint: only supports float types.
AddBPU Accelerated1. Supports int16 input/output.
2. Input can be featuremaps or constants, with at most one constant input.
3. Supports broadcast except for the first dimension, including NHWC and N1WC broadcasting.
4. Dimensions supported: 2D, 3D, 4D, and 5D, with general restrictions (see notes).
5. In ResNet's shortcut connection, Add is fused into the preceding conv for acceleration.
- Supports computation with same input shape.
- Supports scalar inputs as either input 1 or 2.
- Supports broadcast up to 5D.
AndCPU Computation--- Supports same input shape calculation.
- Supports scalar inputs as either input 1 or 2.
- Supports broadcast up to 5D.
ArgMaxBPU Accelerated1. 4D input format NCHW.
2. Only supports argmax along the C axis (axis=1).<br/>3. C <= 64.
4. Supports int16 input/output.
Type constraint: only supports float types.
ArgMinBPU AcceleratedSimilar to ArgMax constraintsType constraint: only supports float types.
AsinCPU Computation--Type constraint: only supports float types.
AsinhCPU Computation--Type constraint: only supports float types.
AtanBPU Accelerated1. Supports int16 input/output.
2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536].
Type constraint: only supports float types.
AtanhCPU Computation--Type constraint: only supports float types.
AveragePoolBPU AcceleratedKernel <= 256.
Stride <= 256.<br/>Padding <= 256.
No support for auto_pad attribute.
Only supports 4D Tensors.
BatchNormalizationBPU AcceleratedNo limitations.Type constraint: only supports float types.
Supports channel-first data layout (dimension 1 is channel).
BitShiftCPU Computation※----
CastCPU Computation--from_type supports: double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8.
to_type supports: double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8.
CeilBPU Accelerated1. Supports int16 input/output.
2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536].
Type constraint: only supports float types.
ClipBPU Accelerated1. Supports int16 input/output.
2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536].
Opset 6: min, max as attributes, dtype only supports float.
Opset 11: min, max as inputs, second input is min when there are two; dtype supports float, double.
CompressCPU Computation※----
ConcatBPU Accelerated1. Supports int16 input/output.
2. Does not support N-dimensional concatenation.
--
ConcatFromSequenceCPU Computation※----
ConstantBPU AcceleratedOptimized via constant foldingNo support for sparse_tensor attribute.
ConstantOfShapeBPU AcceleratedOptimized via constant foldingSupported types: float, int32, int8.
ConvBPU AcceleratedSupports 4D (conv2d) and 5D (conv3d) inputs.
4D conv2d: Kernel size range: N,C ∈ [1, 8192]; H,W ∈ [1, 31].
CHW ≤ 65535.
Channel limits: 1 group, C ≤ 8192 (or 65536 if last operator in quantized graph).
Stride: H,W ∈ [1, 256] (except for shortcut-connected conv, stride=1,2); dilation: H,W ∈ [1, 16], with H and W factors dividing input Tensor dimensions.
Padding: H,W ∈ [0, 256].
5D conv3d: NCDHW limits: N ∈ [1, 128]; H,W,D,C ∈ [1, 65536].
Kernel size: N,C ∈ [1, 65536]; H,W ∈ [1, 31], D ∈ [1, 8191].
Padding: DHW: H,W ∈ [0, 256], D ∈ [0, kernel_d/2].
Stride: H, W must be 1 or 2.
Group and dilation not supported.
Size limit: 1GB; DC ≤ 4096 for DHalignCeil(W, 256)DC < 1GB.
Weight limit: D
C ≤ 8192.
Only supports 4D convolutions.
No support for auto_pad attribute.
Supported types: float, int32, int8.
Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend.
ConvIntegerCPU Computation※----
ConvTransposeBPU AcceleratedInput/output featuremap limits: N ∈ [1, 128], H,W ∈ [1, 65536], C ∈ [1, 2048].
Size limit: 1GB.
Weight size limits: N,C ∈ [1, 2048], H,W ∈ [1, 14], HW ≠ 1.
Size: [1, 65535].
Padding: For odd strides, H,W ∈ [0, kernel / stride); even strides, H,W ∈ [0, kernel / stride].
Out_pad: H,W ∈ 1.
Stride: 1-14, not both stride_h and stride_w equal to 1. n ∈ 1.
Shape Constraint: Only supports 4D Tensors for computation.
Type Constraint: Only supports float types.
Attribute Constraints:
- Supports only dilations, group, output_padding, pads, and strides attributes.
- The pads attribute constraint is that [hstart, wstart, hend, wend] must satisfy (hstart==hend and wstart==wend).
CosBPU Acceleration1. This operator supports int16 input and output.
2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536].
Type Constraint: Only supports float types.
CoshCPU Computation--
CumSumCPU Computation--Axis: Type Constraint is only for int32 types.
DepthToSpaceBPU AccelerationSupports modes DCR and CRD.
Only rearrangement of H and W directions is supported, and blocksize=2 rearrangement only.
Example: NxCxHxW -> Nx(C/4)x(2H)x(2W), where the number of channels must be a multiple of 4.
From_Type Constraints:
- Type Constraint: Only supports float types.
- Limited to 4D Tensor computation.
To_Type Constraints:
- Type Constraint: Only supports float types.
- Limited to 4D Tensor computation.
DequantizeLinearCPU Computation--
DetCPU Computation※--
DivBPU Acceleration1. Only supports featuremap inputs (not constant inputs);
2. Input shape constraints refer to the Mul operator.
- Supports same-input-shape computation.
- Supports computation when input 1 is a scalar or input 2 is a scalar.
- Supports broadcast computation with a maximum dimension of 5.
DropoutBPU AccelerationDoes not participate in inference computations and will be removed during optimization.
EinsumCPU Computation※--
EluBPU Acceleration1. This operator supports int16 input and output.
2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536].
Type Constraint: Only supports float types.
EqualBPU Acceleration1. Supports int16 input.
2. Input and output dimensions support 2-5 dimensions.
3. Supports broadcast across all dimensions, broadcast for fin0 or fin1 input allowed, but not mutual broadcasting. 5D broadcast has the following restrictions:
- Must merge adjacent dimensions to reduce to 4D (including dimension N), e.g., NHWDC and NH1D1 can merge the NH dimension.
- Broadcasted dimensions cannot merge with adjacent ones, e.g., NHWDC and N1W1C are unsupported due to inability to merge adjacent dimensions.
4. Runs on CPU by default; can be specified to run on BPU with run_on_bpu.
ErfCPU Computation--Type Constraint: Supports float and double data types.
ExpBPU Acceleration1. Supports int16 input and output.
2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536].
Type Constraint: Only supports float types.
ExpandBPU Acceleration1. Supports int16 input and output.
2. Input and output support dimensions up to 10, with one differing dimension between input and output.
3. Only allows one differing dimension between input and output.
EyeLikeCPU Computation--
FlattenBPU AccelerationConstraints similar to Reshape.
FloorBPU Acceleration1. Supports int16 input and output.
2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536].
Type Constraint: Only supports float types.
GRUCPU Computation--Direction Attribute: Only supports forward type.
Type Constraint: Only supports float types.
GatherBPU Acceleration1. All ranks of input/output/indices must be less than or equal to 4.
2. Indices support:
- When indices are feature (other op outputs), type constraint is only for int32.
- When indices are weight (model constants), type constraint supports int32 and int64.
From_Type Constraints:
- input: Type constraint supports float, int64, int32, int8, uint64, uint32, uint8.
- indices: Type constraint supports int32, int64.
To_Type Constraints:
- Type constraint supports float, int64, int32, int8, uint64, uint32, uint8.
GatherElementsBPU Acceleration1. Supports int16 input and output.
2. Input/indices/output dimensions support up to 10 dimensions.
3. Indices type constraint supports int16/int32/int64.
GatherNDCPU Computation--From_Type Constraints:
- input: Type constraint supports float, int32, int8.
- indices: tensor(int64).
To_Type Constraints: Type constraint supports float, int32, int8.
GemmBPU AccelerationGemm will be converted to Conv implementation, with boundary constraints referring to Conv.Type Constraint: Only supports float types.
GlobalAveragePoolBPU AccelerationNo limitations.- Type Constraint: Only supports float types.
- Limited to 4D Tensors.
GlobalLpPoolCPU Computation--- Type Constraint: Supports float and double types.
- Limited to 4D Tensor computation.
GlobalMaxPoolBPU AccelerationH, W ∈ [1, 256].- Type Constraint: Only supports float types.
- Limited to 4D Tensors.
GreaterBPU Acceleration1. Supports int16 input.
2. Input and output dimensions support 2-5 dimensions.
3. Same as Equal operator constraints.
4. Runs on CPU by default; can be specified to run on BPU with run_on_bpu.
HardSigmoidBPU Acceleration1. Supports int16 input and output.
2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536].
Type Constraint: Only supports float types.
HardmaxCPU Computation※--
IdentityCPU Computation--
IfCPU Computation※----
InstanceNormalizationCPU Computation--- Type constraint only supports float types.
- Supports data layout with the first dimension as channels.
IsInfCPU Computation※----
IsNaNCPU Computation※----
LRNCPU Computation--- Type constraint only supports float types.
- Only supports four-dimensional Tensors.
LSTMBPU AccelerationSupports batch_size=1 only. If using multiple batches, ensure LSTM's batch is 1 during ONNX export and configure the parameter input_batch=1 in the YAML.- Type constraint only supports float types.
- Attribute constraint: direction attribute only supports forward.
- Input constraints:
- Supports X, W, R inputs;
- Supports X, W, R, B inputs (sequence_lens is empty or default);
- Supports X, W, R, B, sequence_lens, initial_h, initial_c, P inputs (sequence_lens is empty or default).
LeakyReluBPU Acceleration1. Supports int16 input and output.
2. Input and output dimensions support 1-10 dimensions, with the highest dimension ∈ [1, 4096], others ∈ [1, 65536].
Type constraint: only supports float types.
LessBPU Acceleration1. Supports int16 input.
2. Input/output dimensions support 2-5 dimensions.
3. Runs on CPU by default; can be specified to run on BPU using run_on_bpu.
- Supports same shape inputs calculation.
- Supports scalar input1 or scalar input2 calculation.
- Supports broadcast calculation with a max dimension of 5.
LessOrEqualBPU AccelerationIn opset11, single LessOrEqual not supported; Greater + Not operator is used instead, with the same limitations as Greater.- Supports same shape inputs calculation.
- Supports scalar input1 or scalar input2 calculation.
- Supports broadcast calculation with a max dimension of 5.
LogBPU Acceleration1. Supports int16 input and output.
2. Input and output dimensions support 1-10 dimensions, with the highest dimension ∈ [1, 4096], others ∈ [1, 65536].
Type constraint: only supports float types.
LogSoftmaxCPU Computation--Type constraint: only supports float types.
LoopCPU Computation※----
LpNormalizationCPU Computation--- p-norm only supports 1 or 2.
- Type constraint supports double and float types.
LpPoolCPU Computation--- auto_pad attribute not supported.
- Type constraint supports double and float types.
- Limited to 4-dimensional computation.
MatMulIntegerCPU Computation※----
MatMulBPU AccelerationC = MatMul(A, B), with input A and B dimension restrictions:
- Non-quadruple dimensional inputs allowed but must meet these constraints:
- A and B must have identical dimensions.
- The lowest two dimensions M, K ∈ [1, 8192], higher dimensions ∈ [1, 4096].
Note: HDMK vs HDKN, MK/KN refers to the lowest two dimensions.
- Broadcast is supported under these conditions:
- For A and B, all dimensions except the lowest two must be either 1 or non-broadcastable values.
- Examples: HDMK vs H1KN
- Counterexample: H1MK vs 1DKN
- A's higher dimensions cannot contain both broadcastable and non-broadcastable values.
- Examples: 11MK vs HDKN
- Counterexample: H1MK vs HDKN
- If B's higher dimensions contain both broadcastable and non-broadcastable values, non-broadcastable ones must be consecutive high dimensions.
- Examples: BHDMK vs B11KN
- Counterexample: BHDMK vs B1DKN
- Type constraint: only supports float types.
MaxBPU Acceleration1. Supports int16 input and output.
2. Input/output dimensions support 2-5 dimensions.
3. Supports broadcast across all dimensions, broadcast for fin0 or fin1 individually, not mutual broadcast. Restrictions for 5D broadcast:
- Can merge adjacent dimensions to 4D (including dimension N), e.g., NHWDC and NH1D1 can merge NH.
- Broadcast dimensions cannot merge with adjacent ones, e.g., NHWDC and N1W1C unsupported due to no adjacent dimension merge.
- Other details in the documentation.
- Supports 1-∞ inputs.
- Supports same shape inputs calculation.
- Supports scalar input1 or scalar input2 calculation.
- Supports broadcast calculation with a max dimension of 5.
MaxPoolBPU AccelerationSupports int16 input and output.
Kernel size ≤ 256.
Stride ≤ 256.
Padding ≤ 256.
MaxPool does not support dilation.
1. Dilation only supports 1x1.
2. Data row-major storage only.
3. auto_pad attribute not supported.
4. storage_order attribute not supported.
5. Limited to four-dimensional Tensor computation.
MaxRoiPoolCPU Computation--No specific constraints.
MeanCPU Computation※----
MinBPU Acceleration1. Supports int16 input and output.
2. Input/output dimensions support 2-5 dimensions.
3. Similar to Max, but with different broadcast and dimension merge rules.
4. Runs on CPU by default; can be moved to BPU using run_on_bpu.
- Similar to Max, but with different input constraints.
ModCPU Computation※----
MulBPU Acceleration1. Supports int16 input and output.
2. Input types support feature maps and constants, with at most one constant input.
3. Supports broadcast except the first dimension, mutual broadcast between inputs, like NH1C and N1WC.
4. Dimensions up to 5D, with general restrictions (see notes). Supports different input dimensions, with specific restrictions for 5D input.
(1) Merge adjacent dimensions to 4D, e.g., NHWD1 and N1WDC can merge W and D.
(2) Cannot merge broadcast dimensions with adjacent ones, e.g., NHWD1 and N11DC unsupported due to H, W, and C being broadcast dimensions.
- Supports same shape inputs calculation.
- Supports scalar input1 or scalar input2 calculation.
- Supports broadcast calculation with a max dimension of 5.
MultinomialCPU Computation※----
OperationImplementationNotesLimitations
NegCPU computation
NotCPU computation
OneHotCPU computation
OrCPU computation- Supports same-input-shape computation.
- Supports when Input 1 is a scalar or Input 2 is a scalar.
- Supports broadcast calculation with a maximum dimension of 5.
PReluCPU computation- Type constraint: only supports float types.
- from_type: X and slope.
- to_type: Y.
- Constraints for X's shape (data_shape):
- data_shape == slope_shape.
- slope_shape.ProdSize() == 1.
- N, C dimensions must be equal in 4D NCHW layout.
- HxW with 1x1 (slope_shape), Hx1 (slope_shape), or 1xW (slope_shape).
- Special case: 4D X and 3D slope with data_shape[1] == slope_shape[0] == 1 and slope_shape[2] == 1.
PadBPU acceleration1. Supports int16 input and output.
2. Supports mode: Constant.
3. Supports padding in all dimensions.

Pad-10:
- Type constraint: float only.
- 4D NCHW tensors only.
- Constraint on pads attribute:
- len(pads) == 8
- pads[i] >= 0
- pads[0] == pads[1] == pads[4] == pads[5] == 0.
Pad-11:
- from_type: data - float only.
- pads: tensor(int64)
- constant_value (optional) - float only.
- to_type: float only.
- 4D Tensor only.
- Supports 2D or 3D padding only.
PowBPU acceleration1. Supports int16 input and output.
2. Input/output support 1-10 dimensions, max dim ∈ [1, 4096], others ∈ [1, 65536].
3. Second input must be a scalar.
- Type constraints: double, float, int64, int32.
- Supports same-input-shape calculation.
- Supports scalar inputs for either Input 1 or Input 2.
- Supports broadcast calculation with a maximum dimension of 5.
- Requires X and Y to have the same type.
QLinearConvCPU computation※
QLinearMatMulCPU computation※
QuantizeLinearCPU computation
RNNCPU computation- Type constraint: float only.
- Attribute constraint: direction attribute supports forward only.
- Input constraint: X, W, R inputs only, no optional inputs like B, sequence_lens, initial_h allowed.
- Output constraint: Only Y_h output supported, shape [num_directions, batch_size, hidden_size].
RandomNormalCPU computation※
RandomNormalLikeCPU computation※
RandomUniformCPU computation
RandomUniformLikeCPU computation
RangeCPU computationType constraints: float, int64, int32, int16.
ReciprocalBPU acceleration1. Supports int16 input and output.
2. Input/output support 1-10 dimensions, max dim ∈ [1, 4096], others ∈ [1, 65536].
ReduceL1CPU computation
ReduceL2CPU computation
ReduceLogSumCPU computation
ReduceLogSumExpCPU computationType constraints: float, double.
ReduceMaxBPU acceleration1. Supports int16 input and output.
2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension.
3. Reduced dimension size ∈ [1, 8192].
4. keepdims == 1 only.
Axes supported: 0, 1, or equal to input data dimensions.
ReduceMeanBPU acceleration1. Supports int16 input and output.
2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension.
3. Special case: Supports HW reduction when reduce_dim = 2.
4. keepdims == 1 only.
Axes supported: 0, 1, or equal to input data dimensions.
ReduceMinCPU computation
ReduceProdCPU computation
ReduceSumBPU acceleration1. Supports int16 input and output.
2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension.
Axes supported: 0, 1, or equal to input data dimensions.
ReduceSumSquareCPU computation
ReluBPU accelerationUnlimitedOnly supports float type.
ReshapeBPU acceleration1. Supports int16 inputs and outputs.
2. Supports 1-10 dimensional inputs and outputs.
None.
ResizeBPU acceleration1. NCHW input featuremaps, resize only on H and W dimensions. onnx opset=11 supports ROI input (PyTorch models need manual modification to add ROI input, which only accepts constant inputs).
2. Mode supports nearest and linear.
3. Supports scaling up or down.
4. For nearest mode, scale factors must be powers of 2 (e.g., 2, 4, 8, 16, 32) with H_factor <= W_factor.
5. onnx opset=11 supports half_pixel, pytorch_half_pixel, asymmetric, align_corners, and tf_crop_and_resize. ROI input is only effective in tf_crop_and_resize mode, requiring integer boundary coordinates after conversion.
6. extrapolation_value not supported.
ReverseSequenceCPU computation----
RoiAlignCPU computation----
RoundCPU computation----
ScanCPU computation*----
Scatter (deprecated)CPU computation*----
ScatterElementsCPU computation--from_type: supports float, int32, int8.
indices: only supports int32 type.
updates: supports float, int32, int8.
to_type: supports float, int32, int8.
ScatterNDCPU computation--from_type: supports float, int32, int8.
updates: supports float, int32, int8.
to_type: supports float, int32, int8.
SeluCPU computation--Only supports float type.
SequenceAtCPU computation*----
SequenceConstructCPU computation*----
SequenceEmptyCPU computation*----
SequenceEraseCPU computation*----
SequenceInsertCPU computation*----
SequenceLengthCPU computation*----
ShapeBPU accelerationOptimized through constant folding into numerical storage.--
ShrinkCPU computation*----
SigmoidBPU acceleration1. Supports int16 inputs and outputs.
2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536].
Only supports float type.
SignCPU computationOnly supports float type.--
SinBPU acceleration1. Supports int16 inputs and outputs.
2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536].
Only supports float type.
SinhCPU computationOnly supports float type.--
SizeBPU accelerationOptimized through constant folding into numerical storage.--
SliceBPU acceleration1. Supports int16 inputs and outputs.
2. Unlimited, supports non-four-dimensional inputs and outputs.
No constraints.
SoftmaxBPU acceleration- Supports int16 inputs and outputs.
- Runs on CPU by default, with differences between onnx::softmax and pytorch::softmax:
1. For onnx::softmax, can run on BPU if input is 4D and axis=3. Specify run_on_bpu.
2. For pytorch::softmax, can run on BPU for 4D inputs and axis=1, 2, 3. Specify run_on_bpu.
Only supports float type.
SoftplusBPU acceleration1. Supports int16 inputs and outputs.
2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536].
Only supports float type.
SoftsignCPU computation----
OperatorAccelerationSupport modes and constraintsType constraints
SpaceToDepthBPU acceleratedSupports DCR and CRD modes. Only reordering along H and W dimensions is allowed, with blocksize=2.float only
SplitBPU accelerated1. Supports int16 inputs and outputs.
2. Input length must be a multiple of each split tensor's length.
3. Supports arbitrary dimensions except N.
4. Split count must be divisible.
5. Non-four-dimensional inputs and outputs supported.
float only
SplitToSequenceCPU computation(*)----
SqrtBPU accelerated1. Supports int16 inputs and outputs.
2. Input/output supports 1-10 dimensions, with max dimension in [1, 4096] and others in [1, 65536].
float only
SqueezeBPU acceleratedConverted to Reshape op. BPU constraints apply.--
StringNormalizerCPU computation(*)----
SubBPU accelerated1. Supports int16 inputs and outputs.
2. Feature map and constant inputs supported, up to one constant.
3. Broadcasting except first dimension, supports input broadcasting between NH1C and N1WC.
4. 2D-5D dimensions supported, with general restrictions (see notes). Supports different input dimensions; for 5D inputs, see restrictions below.
(1) Merge adjacent dimensions to 4D, e.g., NHWD1 and N1WDC.
(2) Cannot merge broadcasted dimensions with adjacent ones, e.g., NHWD1 and N11DC not supported due to H, W, and C being broadcasted dimensions.
Same shape input support
Scalar input support
Broadcasting up to 5 dimensions.
SumBPU acceleratedConstraints same as Addfloat only
TanCPU computation--float only
TanhBPU accelerated1. Supports int16 inputs and outputs.
2. Input/output supports 1-10 dimensions, with max dimension in [1, 4096] and others in [1, 65536].
float only
TfIdfVectorizerCPU computation(*)----
ThresholdedReluCPU computation--float only
TileBPU accelerated1. Supports int16 inputs and outputs.
2. Only one dimension may have differing values between input and output.
float, int64, etc.
TopKBPU accelerated1. Supports int16 inputs and outputs.
2. Input/indices/output dimensions: 1-10.
3. Indices type: int16/int32/int64.
4. Sorted parameter supports true only.
float only
TransposeBPU accelerated1. Supports int16 inputs and outputs.
2. Arbitrary input dimensions.
nhwc2nchw, perm: [0, 3, 1, 2]
nchw2nhwc, perm: [0, 2, 3, 1]
Custom perm dimensions for float, int8, int32.
UniqueCPU computation(*)----
UnsqueezeBPU acceleratedConverted to Reshape op. BPU constraints apply.--
Upsample (resize replacement)BPU accelerated--Upsample-10
Input: 4D Tensor, opset10 when = 2
Upsample-11
Input: 4D Tensor, opset11 when > 2
Coordinate transformation modes: nearest, linear (half_pixel, asymmetric, align_corners, pytorch_half_pixel), cubic (half_pixel only)
Extrapolation_value unsupported.
WhereCPU computation--float, int64
XorCPU computation(*)----
FunctionCPU computation(*)----
CeluCPU computation(*)----
DynamicQuantizeLinearCPU computation(*)----
GreaterOrEqualBPU acceleratedOpset11 doesn't support standalone GreaterOrEqual; Less + Not on BPU for split conditions, with similar restrictions to Less.Same shape, scalar, broadcast up to 5D.
MeanVarianceNormalizationCPU computation(*)----
GridSample (PyTorch)BPU accelerated1. Input dimensions: 4D, N ∈ [1, 4096], C ∈ [1, 65536], H, W ∈ [1, 1024], HW ≤ 7201024.
2. Mode: bilinear, nearest.
3. Padding_mode: zeros, border.
4. Opset16 ONNX operator, exported via horizon_nn.torch.export_onnx (not opset11 native). See example below.
--