supported_op_list
Supported Operator Lists and Restrictions
Limitations and Notes
This section primarily covers the operators supported by the D-Robotics Processor for both Caffe
and ONNX
. Operators not listed are currently unsupported due to hardware limitations on the BPU.
Terminology:
- BPU Acceleration: Operators that the D-Robotics Processor can accelerate under certain constraints; if not met, they will be computed on the CPU.
- CPU Computation: Operators already optimized on D-Robotics's ARM CPU, supporting ONNX opsets 10 and 11.
- CPU Computation※: Temporary CPU operators not yet integrated.
Additional Considerations:
-
For all BPU in RDK X3, there is a general restriction: input_batch ≤ 128.
-
On RDK Ultra BPU, restrictions apply:
- Input and output dimensions must be 4D; support for non-four-dimensional ops is indicated explicitly.
- Shape: H, W, C ∈ [1, 65536], N ≤ 4096; and N x C x H x W ≤ 1GB.
- Supports Caffe 1.0 base operators and common extended operators, as well as ONNX opsets 10 and 11. Ops not meeting BPU acceleration constraints fallback to ARM CPU.
-
Operators like
Cast
,Constant
,Dropout
,Reshape
,Squeeze
,Unsqueeze
, andShape
(OPs) cannot run directly on the BPU, but algorithmic toolchains may optimize them in some cases (e.g., constant folding) for support. -
Operators marked as PyTorch are officially unsupported opsets 11 ops, which D-Robotics's algorithm toolchain provides a script to export from PyTorch to custom ONNX ops.
-
Tensorflow-onnx conversion tool (https://github.com/onnx/tensorflow-onnx) supports converting TensorFlow 1.x operators to stable ONNX opsets 6-11, but TensorFlow 2.x support is still experimental.
-
Quantization Details: A compliant operator may still run on CPU due to being a passively quantized OP. The algorithm toolchain designs quantization logic based on the OP's computation characteristics and BPU low-level logic. For more information on active, passive, and manual quantization, see the "Quantization Logic in Algorithm Toolchain" chapter.
RDK X3 List of supported Caffe operators
Caffe Operator Name | CPU Computing/BPU Acceleration | X3 BPU Constraints | CPU Constraints |
---|---|---|---|
Convolution | BPU Acceleration | Kernel Size: HxW = [1, 7]x[1, 7] for BPUChannel limit: <= 2048 (for non-dilated, group, depthwise conv), or <= 4096 for standard convs No stride limit Dilation: only powers of 2 allowed, divisible by stride h_dilated <= w_dilated Total kernel size: HxWxC <= 32768 axis not supported (default: 1) | 4D Conv only auto_pad not supported Type constraints: float, int32, int8 Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart == Hend and Wstart == Wend |
Deconvolution | BPU Acceleration | Kernel Size: HxW = [2, 14]x[2, 14]Channel limit: C <= 2048 Padding: HxW = [0, (Kernel_H-1)/2]x[0, (Kernel_W-1)/2] Stride: Stride ∈ {2, 4}, stride_h ≤ stride_w Dilation: (1, 1) No axis support | output_shape and output_padding unsupported auto_pad: NOTSET only No axis support |
MaxUnpool | CPU Computing | --- | from_type constraints: X - float only, I - Tensor(int64) to_type constraints: float only |
Pooling | BPU Acceleration | Four types: MaxPooling, AveragePooling, GlobalMaxPooling, GlobalAveragePooling Constraints: MaxPooling: Kernel Size = [1, 64]x[1, 64], Stride = [1, 185], Padding >= 0 AveragePooling: HxW = [1, 7]x[1, 7], Stride ∈ 185 GlobalAveragePooling: HxW <= 8192 for NCHW input GlobalMaxPooling: HxW = [1, 1024]x[1, 1024] for NCHW input | None |
SPP | CPU Computing | Not supported | pyramid_height: 2^n pooling, n < 7 pooling kernel size <= 255 pool option: 1 |
InnerProduct | BPU Acceleration | Converted to Conv Constraints: For NCHW input, if HW < 7, Gemm limits same as Conv H = W = 1: C limit <= 16384; otherwise, C limit <= 2048 Low-precision int8 output after BPU node: H x W/8 x C/4 ≤ 1024 High-precision int32 output: H x W/8 x C/4 < 2048 No axis support | None |
LRN | CPU Computing | Not supported | local_size supported alpha, beta supported norm_region: ACROSS_CHANNELS, WITHIN_CHANNEL (optional) k supported |
MVN | CPU Computing | Not supported | normalize_variance: {0, 1} (optional) across_channels: {0, 1} (optional) Float32 only |
BatchNorm | BPU Acceleration | Unlimited | None |
ELU | CPU Computing | Not supported | None |
BNLL | CPU Computing | Not supported | None |
PReLU | BPU Acceleration | Unlimited | None |
ReLU/LeakyReLu | BPU Acceleration | Unlimited | None |
Sigmoid | BPU Acceleration | For 1CHW tensor: min(8W4C-aligned shape, 32C-aligned shape) ≤ 8192 8W4C: pad W to multiples of 8, C to multiples of 4 32C: pad C to multiples of 32 Use the smaller aligned shape | None |
TanH | BPU Acceleration | Unlimited | None |
Eltwise | BPU Acceleration | Operation supports Add and Mul, no SubAdd: M ≤ 2048 channels Supported cases: 1. NCHW vs NCHW 2. NCHW vs NC11 (inputs must be op outputs) Mul: Both inputs must be 4D, C ≤ 2048 Supported shapes: 1. (1xCxHxW vs 1xCxHxW) 2. (1xCxHxW vs 1xCx1x1) 3. (1xCxHxW vs 1x1x1x1) | None |
Bias | BPU Acceleration | Refer to Eltwise (Add) constraints | None |
Scale | BPU Acceleration | Refer to Eltwise (Mul) constraints | None |
AbsVal | CPU Computing | Not supported | None |
Exp | BPU Acceleration | Unlimited | None |
Log | CPU Computing | Not supported | None |
Power | BPU | Unlimited | None |
Threshold | CPU | Not supported | None |
Reduction | CPU | Not supported | Operation supports SUM, ASUM, SUMSQ, MEAN. Axis supports. Only supports Float32 calculations. |
Softmax | CPU | Not supported | None |
ArgMax | BPU | Only supports axis=1 and c<=64. Does not support top_k != 1 | None |
Concat | BPU | Input/Output Channel: C<=2048 | None |
Split | BPU | Unlimited | None |
Slice | BPU | Unlimited | None |
Reshape | CPU | Not supported (can be fused in some scenarios) | Shape supports up to [1,4] shape_dim configurations. Axis supports [-4,3]. No support for N dimensions. Default value is 0, follows Caffe rules. |
Flatten | CPU | Not supported (can be fused in some scenarios) | Axis range [-4,3], default is 1, -4 and 0 have the same meaning. Only supports End_axis == -1. |
Crop | CPU | Not supported | None |
Dropout | BPU | Unlimited | None |
LSTM | BPU | Only supports batch=1 | -- |
Normalize | CPU | Not supported | Type constraint: only supports float type. |
PassThrough | BPU | Supports mode=DCR and mode=CRD. Only supports rearrangement in H and W directions with blocksize=2. | Type constraint: only supports float type. |
CReLU | CPU | Not supported | Type constraint: only supports float type. |
RReLU | CPU | Not supported | None |
Permute | CPU | Not supported | - Supports nhwc2nchw, perm: [0, 3, 1, 2]. - Supports nchw2nhwc, perm: [0, 2, 3, 1]. - Supports specified perm dimension conversions, data types: float, int8, int32. |
MatMul | BPU | Optimized for specific scenarios: - K vs KxN, K vs 1xKxN, K vs 1x1xKxN - MxK vs K, MxK vs KxN, MxK vs 1x1xKxN - 1xMxK vs K, 1xMxK vs 1xKxN - 1x1xMxK vs K, 1x1xMxK vs 1xKxN, 1x1xMxK vs 1x1xKxN - BxMxK vs KxN (B>=1) - 1xBxMxK vs KxN (B>=1) - AxBxMxK vs KxN (A>1, B>1) For the opposite scenario: - 1xBxMxK vs 1x1xKxN (B>1) Optimized for two featuremaps: - 1xBxMxK vs 1x1xKxN (B>=1) | Type constraint: only supports float type. |
Upsample | BPU | Input featuremap must be 4D NCHW, resize only on H and W dimensions, factors must be 2^N. Supports different factors for H and W, but H_factor <= W_factor required. | None |
ROIPooling | CPU | Not supported | None |
PSROIPooling | CPU | Not supported | None |
RDK X3 List of supported ONNX operators
ONNX Operator Name | CPU Computing/BPU Acceleration | X3 BPU Constraints | CPU Constraints |
---|---|---|---|
Abs | CPU Calculation | -- | Type constraint: only supports float type. |
Acos | CPU Calculation | -- | Type constraint: only supports float type. |
Acosh | CPU Calculation | -- | Type constraint: only supports float type. |
Add | BPU Acceleration | M <= 2048, supported cases: 1. NCHW and NCHW shapes for both inputs. 2. NCHW and NC11 shapes (both inputs need to be outputs of other ops). 3. Integrated into the previous conv in ResNet's shortcut structure for acceleration. | - Supports same shape inputs calculation. - Supports scalar input 1 or input 2 calculation. - Supports broadcast calculation with a max dimension of 5. |
And | CPU Calculation | -- | - Supports same shape inputs calculation. - Supports scalar input 1 or input 2 calculation. - Supports broadcast calculation with a max dimension of 5. |
ArgMax | BPU Acceleration | 1. Four-dimensional input (NCHW). 2. Only supports argmax along the C dimension (axis=1). 3. C <= 64 | Type constraint: only supports float type. |
ArgMin | CPU Calculation | -- | Type constraint: only supports float type. |
Asin | CPU Calculation | -- | Type constraint: only supports float type. |
Asinh | CPU Calculation | -- | Type constraint: only supports float type. |
Atan | CPU Calculation | -- | Type constraint: only supports float type. |
Atanh | CPU Calculation | -- | Type constraint: only supports float type. |
AveragePool | BPU Acceleration | Kernel HxW: [1, 7]x[1, 7], Stride ∈185 | auto_pad attribute not supported. Only supports four-dimensional tensors. |
BatchNormalization | BPU Acceleration | Optimized to fuse with previous conv | Type constraint: only supports float type. Supports channel-first data layout (dim=1). |
BitShift | CPU Calculation(*) | -- | -- |
Cast | CPU Calculation | -- | from_type supports double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8. to_type supports double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8. |
Ceil | CPU Calculation | -- | Type constraint: only supports float type. |
Clip | BPU Acceleration | Unlimited | Type constraint: only supports float type. Default min parameter when two inputs are provided. |
Compress | CPU Calculation(*) | -- | -- |
Concat | BPU Acceleration | Input/Output Channel: C<=2048 | -- |
ConcatFromSequence | CPU Calculation(*) | -- | -- |
Constant | BPU Acceleration | Optimized through constant folding | No support for sparse_tensor attribute. Type constraint: only supports float type. |
ConstantOfShape | BPU Acceleration | Optimized through constant folding | Supported types: float, int32, int8. |
Conv | BPU Acceleration | Kernel HxW: [1, 7]x[1, 7].Input/output Channel (for one group): <= 2048 (relaxed to <=4096 for non-dilated, group, depthwise conv). Stride: Unrestricted (except stride=1 for Conv followed by Add in ResNet shortcut-connecting). Dilation: Only powers of 2 allowed, divisible by stride. h_dilated ≤ w_dilated. Total kernel size limit: HxWxC ≤ 32768 | Only supports 4D Convolution. auto_pad attribute not supported. Type constraint: float, int32, int8. Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend. |
ConvInteger | CPU Calculation(*) | -- | -- |
ConvTranspose | BPU Acceleration | Kernel HxW: [2, 14]x[2, 14].Input/output Channel: C <= 2048. Padding HxW: [0,(Kernel_H-1)/2]x[0,(Kernel_W-1)/2]. Stride: 4. stride_h ≤ stride_w .Dilation: (1, 1) only | auto_pad attribute not supported. Type constraint: float, int32, int8. |
Cos | BPU Acceleration | Limited to CxHxW <= 8192 for 1CHW tensor | Type constraint: only supports float type. |
Cosh | CPU Calculation | -- | Type constraint: only supports float type. |
CumSum | CPU Calculation | -- | -- |
Operator: from_type: | Description: BPU Acceleration | Supported Modes | Input Shape Constraints | Output Type Constraints |
---|---|---|---|---|
DepthToSpace | BPU acceleration | Supports DCR and CRD modes. | Only supports rearrangement along H and W dimensions with blockSize=2. | - from_type: only float types allowed. - 4D Tensor computation only. - to_type: only float types allowed. - 4D Tensor computation only. |
DequantizeLinear | CPU computation | -- | -- | -- |
Det | CPU computation※ | -- | -- | -- |
Div | BPU acceleration | 1. Supports featuremap inputs only (no constant inputs). 2. Input shape constraints refer to Mul operator. | - Same input shape supported. - Supports scalar input1 or input2. - Broadcast calculation up to 5 dimensions. | |
Dropout | BPU acceleration | Not computed in inference, removed by optimization. | -- | |
Einsum | CPU computation※ | -- | -- | -- |
Elu | CPU computation | -- | Type constraint: only float types. | |
Equal | CPU computation | -- | - Same input shape supported. - Supports scalar input1 or input2. - Broadcast calculation up to 5 dimensions. | |
Erf | CPU computation | -- | Type constraint: supports float and double types. | |
Exp | BPU acceleration | -- | Type constraint: only float types. | |
Expand | CPU computation | -- | -- | |
EyeLike | CPU computation | -- | -- | |
Flatten | CPU computation | -- | -- | |
Floor | CPU computation | -- | Type constraint: only float types. | |
GRU | CPU computation | -- | - direction attribute supports forward only. - Type constraint: only float types. - Input count must be 3, 4, or 6. - Output count is 2. | |
Gather | CPU computation | -- | from_type: - input: types supported: float, int64, int32, int8, uint64, uint32, uint8. - indices: type supported: int32, int64. - to_type: types supported: float, int64, int32, int8, uint64, uint32, uint8. | |
GatherElements | CPU computation | -- | -- | |
GatherND | CPU computation | -- | from_type: - input: types supported: float, int32, int8. - indices: tensor(int64). - to_type: types supported: float, int32, int8. | |
Gemm | BPU acceleration | Converted to Conv implementation. | - HW <= 7 for both H and W if both are <= 7. - C <= 16384 if H/W = 1; otherwise, C <= 2048. - Low-precision int8 output if followed by BPU-supported node: H x W/8 x C/4 <= 1024. - High-precision int32 output if followed by non-BPU-supported node: H x W/8 x C/4 < 2048. - Type constraint: only float types. | |
GlobalAveragePool | BPU acceleration | Input HxW must be <= 8192 for NCHW shape. | -- | |
GlobalLpPool | CPU computation | -- | Type constraint: supports float and double types. - 4D Tensor computation only. | |
GlobalMaxPool | BPU acceleration | Input HxW range: [1, 1024]x[1, 1024] for NCHW shape. | Type constraint: only float types. - 4D Tensor only. | |
Greater | CPU computation | -- | - Same input shape supported. - Supports scalar input1 or input2. - Broadcast calculation up to 5 dimensions. | |
HardSigmoid | CPU computation | -- | Type constraint: only float types. | |
Hardmax | CPU computation※ | -- | -- | |
Identity | CPU computation | -- | -- | |
If | CPU computation※ | -- | -- | |
InstanceNormalization | CPU Calculation | |||
IsInf | CPU Calculation | Only supports float type. | ||
IsNaN | CPU Calculation | Only supports float type. | ||
LRN | CPU Calculation | Only supports 4D Tensors and float type. | ||
LSTM | BPU Accelerated | Supports batch_size=1 only. | No attribute settings supported. Only supports inputs of 3, 4, or 8, and outputs of 2. Float type only. | |
LeakyRelu | BPU Accelerated | N/A | N/A | |
Less | CPU Calculation | Supports same input shape, scalar input1 or input2, and broadcast | Supports up to 5-dimensional broadcast with same input shapes, and scalar inputs. | |
LessOrEqual | CPU Calculation | Same as 'Less' | Same as 'Less'. | |
Log | CPU Calculation | Only supports float type. | ||
LogSoftmax | CPU Calculation | Only supports float type. | ||
Loop | CPU Calculation | |||
LpNormalization | CPU Calculation | p-norm only supports 1 or 2, double or float type. | ||
LpPool | CPU Calculation | auto_pad not supported, double or float type, and 4D computation | ||
MatMulInteger | CPU Calculation | |||
MatMul | BPU Accelerated | For scenarios where the two inputs are featuremap and weight, which involve element-wise multiplication between a featuremap and a constant, the following can be optimized for execution on a BPU: - K vs KxN, K vs 1xKxN, K vs 1x1xKxN - MxK vs K, MxK vs KxN, MxK vs 1x1xKxN - 1xMxK vs K, 1xMxK vs 1xKxN - 1x1xMxK vs K, 1x1xMxK vs 1xKxN, 1x1xMxK vs 1x1xKxN - BxMxK vs KxN (where B >= 1) - 1xBxMxK vs KxN (where B >= 1) - AxBxMxK vs KxN (where A > 1 and B > 1) For situations where both inputs are featuremaps (i.e., element-wise multiplication of featuremaps), the following can be optimized for the BPU: - 1xBxMxK vs 1x1xKxN (where B >= 1) | Only supports float type. Optimizations apply to specific input shapes: see details below. | |
Max | CPU Calculation | Supports multiple inputs, same shape, scalar inputs, and broadcast | Up to 5-dimensional broadcast, supports scalar inputs. | |
MaxPool | BPU Accelerated | Kernel size [1-64]x[1-64], stride [1-185], padding >= 0, no dilation | dilation only supports 1x1, data row-major storage, no auto_pad or storage_order support, 4D Tensors only. | |
MaxRoiPool | CPU Calculation | |||
Mean | CPU Calculation | |||
Min | CPU Calculation | Same as 'Max' | ||
Mod | CPU Calculation | |||
Mul | BPU Accelerated | 4D inputs with C <= 2048, specific shape rules apply 1. (1xCxHxW vs 1xCxHxW)。 2. (1xCxHxW vs 1xCx1x1)。 3. (1xCxHxW vs 1x1x1x1) 。 | Same broadcast constraints as 'Mul'. Input values must not be 0. | |
Multinomial | CPU Calculation | |||
Neg | CPU Calculation | |||
NonZero | CPU Calculation | Supports float, int32, or int8 types, 1D or 4D computations | ||
Not | CPU Calculation | |||
OneHot | CPU | -- | -- | |
Or | CPU | -- | Supports same input shape calculation. Supports scalar inputs. Broadcasting up to 5 dimensions. | |
PRelu | BPU | - Type constraints: Only supports float type. - from_type: X and slope. - to_type: Y. | - X's shape is data_shape, slope's is slope_shape. - data_shape == slope_shape. - slope_shape.ProdSize() == 1. - NCHW layout for 4D tensors with equal N and C dimensions. - HxW with 1x1 (slope_shape). - HxW with Hx1 (slope_shape). - HxW with 1xW (slope_shape). - Special case: 4D X and 3D slope, with data_shape[1] == slope_shape[0], slope_shape[1] == 1, slope_shape[2] == 1. | |
Pad | BPU | Supports mode=Constant. Only supports padding on H, W dimensions. | Pad-10: - Type constraint: float only. - 4D NCHW tensors. - pads constraint: len(pads) == 8, pads[i] >= 0, pads[0] = pads[1] = pads[4] = pads[5] = 0. Pad-11: - from_type: data (float), pads (int64 tensor), optional constant_value (float). - 4D tensor, 2D or 3D padding only. - to_type: float only. | |
Pow | BPU | Supports exponent as a single value. | - Type constraints: double, float, int64, int32. - Supports same shape, scalar inputs, and broadcasting up to 5 dimensions. - X and Y must be of the same type. | |
QLinearConv | CPU※ | -- | -- | |
QLinearMatMul | CPU※ | -- | -- | |
QuantizeLinear | CPU | -- | -- | |
RNN | CPU | -- | - Type constraint: float only. - direction attribute: forward only. - Input constraints: X, W, R required, B, sequence_lens, initial_h unsupported. - Output constraint: Y_h output, shape [num_directions, batch_size, hidden_size]. | |
RandomNormal | CPU※ | -- | -- | |
RandomNormalLike | CPU※ | -- | -- | |
RandomUniform | CPU | -- | -- | |
RandomUniformLike | CPU | -- | -- | |
Range | CPU | Type constraints: float, int64, int32, int16. | -- | |
Reciprocal | BPU | -- | -- | |
ReduceL1 | CPU | -- | -- | |
ReduceL2 | CPU | -- | -- | |
ReduceLogSum | CPU | -- | Only supports float, double data types. | |
ReduceLogSumExp | CPU | -- | Type constraints: float, double. | |
ReduceMax | CPU | -- | Axes support: 0, 1, or equal to input dimensions. | |
ReduceMean | BPU | Input featuremap must be 4D, axes=[2, 3]. | Axes support: 0, 1, or equal to input dimensions. | |
ReduceMin | CPU | -- | -- | |
ReduceProd | CPU | -- | -- | |
ReduceSum | CPU | -- | Axes support: 0, 1, or equal to input dimensions. | |
ReduceSumSquare | CPU | -- | Axes support: 0, 1, or equal to input dimensions. | |
Relu | BPU | -- | -- | |
Reshape | CPU | -- | -- | |
Resize | BPU | 1. Input must be NCHW 4D and only resize in H and W dimensions. ROI input supported in ONNX opset=11 (manual modification required for PyTorch models to add ROI input, which only accepts constant inputs and works with tf_crop_and_resize mode). 2. Mode supports nearest and linear. 3. Supports scaling up and down. 4. For nearest mode, scaling factors should be powers of 2 (e.g., 2, 4, 8, 16, 32) and H_factor must be less than or equal to W_factor. 5. coordinate_transformation_mode supports half_pixel, pytorch_half_pixel, asymmetric, align_corners, and tf_crop_and_resize. When using tf_crop_and_resize, ensure ROI input coordinates are integers. resize-10 - Use opset10 when input is 2. - Input is a 4D Tensor. resize-11 - Use opset11 when input is greater than 2. - Input is a 4D Tensor. - coordinate_transformation_mode supports half_pixel, asymmetric, align_corners, and pytorch_half_pixel for nearest and linear modes, and half_pixel only for cubic mode. - extrapolation_value not supported. | ||
ReverseSequence | CPU | -- | -- | |
RoiAlign | CPU | -- | -- | |
Round | CPU | -- | -- | |
Scan | CPU※ | -- | -- | |
Scatter (deprecated) | CPU※ | -- | -- | |
ScatterElements | CPU | -- | from_type: float, int32, int8 indices: int32 only updates: float, int32, int8 to_type: float, int32, int8 | |
ScatterND | CPU | -- | from_type: float, int32, int8 updates: float, int32, int8 to_type: float, int32, int8 | |
Selu | CPU | -- | Only supports float types. | |
SequenceAt | CPU※ | -- | -- | |
SequenceConstruct | CPU※ | -- | -- | |
SequenceEmpty | CPU※ | -- | -- | |
SequenceErase | CPU※ | -- | -- | |
SequenceInsert | CPU※ | -- | -- | |
SequenceLength | CPU※ | -- | -- | |
Shape | BPU | Optimized to numerical storage via constant folding. | -- | |
Shrink | CPU※ | -- | -- | |
Sigmoid | BPU | Limited to 1CHW tensors where CxHxW <= 8192. 8W4C: pad W to multiples of 8 and C to multiples of 4. 32C: pad C to multiples of 32. Choose the smallest aligned shape between the two and ensure <= 8192. | Only supports float types. | |
Sign | CPU | -- | None | |
Sin | BPU | Limited to 1CHW tensors where CxHxW <= 8192. | Only supports float types. | |
Sinh | CPU | -- | Only supports float types. | |
Size | BPU | Optimized to numerical storage via constant folding. | -- | |
Slice | BPU | Unlimited | None | |
Softmax | BPU | Runs on CPU by default. Can be set to BPU for 4D inputs with axis=1 and as model output, using run_on_bpu. | Only supports float types. | |
Softplus | BPU acceleration | Supports CxHxW <= 8192 for a tensor of input dimension 1CHW. | Only supports float type. | |
Softsign | CPU computation | -- | Only supports float type. | |
SpaceToDepth | BPU acceleration | Supports DCR and CRD modes. Restrictions: H and W permutation, blocksize=2 only. | Only supports float type. | |
Split | BPU acceleration | Restrictions: NCHW input, divisible lengths, axis=1,2,3. | Only supports float type. | |
SplitToSequence | CPU computation(*) | -- | -- | |
Sqrt | BPU acceleration | Supports CxHxW <= 8192 for a tensor of input dimension 1CHW. | Only supports float type. | |
Squeeze | CPU computation | Removed by constant folding optimization if in constant substructure. | -- | |
StringNormalizer | CPU computation(*) | -- | -- | |
Sub | CPU computation | -- | Supports same shape, scalar inputs, broadcast up to 5 dimensions. | |
Sum | BPU acceleration | Same restrictions as Add. | Only supports float type. | |
Tan | CPU computation | -- | Only supports float type. | |
Tanh | BPU acceleration | Supports CxHxW <= 8192 for a tensor of input dimension 1CHW. | Only supports float type. | |
TfIdfVectorizer | CPU computation(*) | -- | -- | |
ThresholdedRelu | CPU computation | -- | Only supports float type. | |
Tile | CPU computation | -- | Supports float, int64, int32, uint64, uint32 types. | |
TopK | CPU computation | -- | Only supports float type, opset-10. | |
Transpose | CPU computation | Supports nhwc2nchw, perm=[0, 3, 1, 2], nchw2nhwc, perm=[0, 2, 3, 1]. | Supports float, int8, int32 types. | |
Unique | CPU computation(*) | -- | -- | |
Unsqueeze | CPU computation | Removed by constant folding optimization if in constant substructure. | -- | |
Upsample (replace resize) | BPU acceleration | -- | Upsample-10 for input=2, 4D Tensor. Upsample-11 for input>2, 4D Tensor. | |
Where | CPU computation | -- | Supports float and int64 types. Shape constraints detailed in the description. | |
Xor | CPU computation(*) | -- | -- | |
Function | CPU computation(*) | -- | -- | |
Celu | CPU computation(*) | -- | -- | |
DynamicQuantizeLinear | CPU computation(*) | -- | -- | |
GreaterOrEqual | CPU computation | -- | Supports same shape, scalar inputs, broadcast up to 5 dimensions. | |
MeanVarianceNormalization | CPU computation(*) | -- | -- | |
GridSample (PyTorch) | CPU computation(*) | -- | -- |
RDK Ultra Supported Caffe Operators List
Caffe Operator Name | CPU Computation/BPU Acceleration | RDK Ultra BPU Constraints | CPU Constraints |
---|---|---|---|
Convolution | BPU Accelerated | - Kernel width and height: <= 32 - Input/output channels (for one group): <= 8192 (or <= 65536 if last in quantized graph) - Stride: Unrestricted, stride for Conv followed by Add (ResNet shortcut-connection) should be {1, 2} - Dilation: <= 16 - Only supports dilation=1 when dilation != 1 - Axis default: 1 | - 4D Convolution only - auto_pad attribute not supported - Type constraints: float, int32, int8 - Pads attribute constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend. |
Deconvolution | BPU Accelerated | - kernel >= stride - Input/output featuremaps <= 2048 - pad <= kernel / stride- out_pad < 2 - stride: 14 >= stride >= 1 , but stride_h and stride_w cannot both be 1- Axis configuration not supported | - Shape constraint: 4D Tensor computation only - Type constraint: float only - Attribute constraints: dilations, group, output_padding, pads, strides attributes - Pads attribute constraint: [hstart, wstart, hend, wend] must satisfy (hstart==hend and wstart==wend). |
MaxUnpool | CPU Computation | --- | - from_type constraints: X - float, I - Tensor(int64) - to_type constraints: float only |
Pooling | BPU Accelerated | - Four types: MaxPooling, AveragePooling, GlobalMaxPooling, GlobalAveragePooling - Constraints: MaxPooling - int16 input/output, kernel <= 256, stride <= 256, padding <= 256 - AveragePooling - same as MaxPooling - GlobalAveragePooling - unlimited - GlobalMaxPooling - H, W ∈ [1, 256] | None |
SPP | CPU Computation | Not supported | - Supports pyramid_height with 2^n pooling, n < 7 - pooling kernel <= 255 - pool option, configurable values: {0, 1} |
InnerProduct | BPU Accelerated | Converted to Conv with Conv constraints - Axis configuration not supported | None |
LRN | CPU Computation | Not supported | - local_size supported - alpha, beta, norm_region supported (configurable values: ACROSS_CHANNELS, WITHIN_CHANNEL) - k supported |
MVN | CPU Computation | Not supported | - normalize_variance: configurable values 1 - across_channels: configurable values 1 - Float32 computation only |
BatchNorm | BPU Accelerated | Unlimited | None |
ELU | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
BNLL | CPU Computation | Not supported | None |
PReLU | CPU Computation | - type constraint: float only - from_type: X and slope - to_type: Y - Shape constraints: X = data_shape, slope = slope_shape - data_shape == slope_shape - slope_shape.ProdSize() == 1 - 4D NCHW layout for X and slope, N, C dimensions must be equal - HxW or 1x1 for slope_shape - Hx1 or 1xH for slope_shape - 1xW or Wx1 for slope_shape - Special case: 4D X and 3D slope with data_shape[1] = slope_shape[0] and slope_shape[1] = 1, slope_shape[2] = 1 | None |
ReLU/LeakyReLU | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
Sigmoid | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
TanH | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
Eltwise | BPU Accelerated | Supports Add, Sub, Mul operations - int16 input/output support - Feature map and constant inputs, at most one constant - Broadcasting except first dimension - 2D, 3D, 4D, and 5D dimensions supported, with general limitations (see notes) - Different input dimensions supported, 5D inputs must meet: merge adjacent dimensions to 4D (e.g., NHWD1 and N1WDC), broadcast dimensions cannot be adjacent (e.g., NHWD1 and N11DC due to broadcast on H, W, and C) | None |
Bias | BPU Accelerated | Refer to Eltwise (Add) constraints | None |
Scale | BPU Accelerated | Refer to Eltwise (Mul) constraints | None |
AbsVal | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
Exp | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
Log | BPU Accelerated | - int16 input/output support - Input/output dimensions up to 10D, max dimension [1, 4096], others [1, 65536] | None |
Power | BPU Op | 1. Supports int16 input and output. 2. Input and output support up to 10 dimensions, with max dimension ∈ [1, 4096], others ∈ [1, 65536]. 3. Second input only supports scalar. | - |
Threshold | CPU Computation | Not supported | - |
Reduction | CPU Computation | Not supported. Operation supports SUM, ASUM, SUMSQ, MEAN, Max, LogSum, Min, Prod; Axis supports; Only supports Float32 computation. | - |
Softmax | BPU Op | 1. Supports int16 input and output. 2. Defaults to CPU execution. Can run on BPU for 4D inputs with axis=1,2,3 if specified by run_on_bpu. | - |
ArgMax | BPU Op | 1. Only supports axis=1, c<=64. 2. Does not support top_k ≠ 1. 3. Supports int16 input and output. | - |
Concat | BPU Op | 1. Supports int16 input and output. 2. Does not support N-dimensional concat. | - |
Split | BPU Op | 1. Supports int16 input and output. 2. Length of the original input must be a multiple of each split tensor length. 3. Supports any dimension except N. 4. Split count should be divisible. 5. Supports non-four-dimensional input and output. | - |
Slice | BPU Op | 1. Supports int16 input and output. 2. Unlimited, supports non-four-dimensional input and output. | - |
Reshape | BPU Op | 1. Supports int16 input and output. 2. Supports up to 10-dimensional input and output. | Shape supports [1,4] shape_dim configurations; Axis supports [-4,3], does not support N dimensions, default 0 follows Caffe rules; num_axes supports [-1,3], default -1 means all axes from axis start. |
Flatten | CPU Computation | Not supported (can be fused in some scenarios) | Axis range [-4,3], default is 1, with -4 and 0 having the same meaning. Only supports End_axis == -1. |
Crop | CPU Computation | Not supported | - |
Dropout | BPU Op | Unlimited | - |
LSTM | BPU Op | Only supports batch=1 | - |
Normalize | CPU Computation | Not supported | Type constraint: only supports float types. |
PassThrough | BPU Op | Supports mode=DCR and mode=CRD. Only supports reordering along H and W directions with blocksize=2, e.g., NxCxHxW -> Nx(4C)x(H/2)x(W/2). | Type constraint: only supports float types. |
CReLU | CPU Computation | Not supported | Type constraint: only supports float types. |
RReLU | CPU Computation | Not supported | None |
Permute | BPU Op | 1. Supports arbitrary input dimensions. 2. Supports conversion of any other dimension except batch dimension (first dimension). | - Supports nhwc2nchw, perm: [0, 3, 1, 2]. - Supports nchw2nhwc, perm: [0, 2, 3, 1]. - Supports permutation of specified dimensions, data types supported: float, int8, int32. |
MatMul | BPU Op | C = MatMul(A, B), with dimension constraints for A and B: - Both A and B can have non-four-dimensional inputs but must meet these conditions: - Dimensions of A and B must be the same. - The lowest two dimensions M, K ∈ [1, 8192], higher dimensions ∈ [1, 4096]. Note: HDMK vs HDKN, MK/KN refers to the lowest two dimensions. - Broadcasting is supported under these conditions: - All other dimensions than the lowest two of A and B are either 1 or do not require broadcasting. - Supported example: HDMK vs H1KN - Unsupported example: H1MK vs 1DKN - A cannot have both broadcasting and non-broadcasting values in dimensions beyond its lowest two. - Supported example: 11MK vs HDKN - Unsupported example: H1MK vs HDKN - If B has both broadcasting and non-broadcasting values in higher dimensions, non-broadcasting values must be contiguous. - Supported example: BHDMK vs B11KN - Unsupported example: BHDMK vs B1DKN - Broadcasting rules: - If A and B have unequal values in a given dimension, the 1 is considered the broadcasting value, and the non-1 is not. - If A and B have equal values in a given dimension, both are considered non-broadcasting values (e.g., HDMK vs H1KN, 1 is the broadcasting value, H is not). | Type constraint: only supports float types. |
Upsample | BPU Op | Requires four-dimensional NCHW input, resize only supported on H and W dimensions; factor cannot be less than 2. | - |
ROIPooling | CPU Computation | Not supported | - |
PSROIPooling | CPU Computation | Not supported | - |
RDK Ultra-supported ONNX Operators List
ONNX Operator Name | CPU/CPU Acceleration | RDK Ultra BPU Constraints | CPU Constraints |
---|---|---|---|
Abs | BPU Accelerated | 1. Supports int16 input/output. 2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536]. | Type constraint: only supports float types. |
Acos | CPU Computation | -- | Type constraint: only supports float types. |
Acosh | CPU Computation | -- | Type constraint: only supports float types. |
Add | BPU Accelerated | 1. Supports int16 input/output. 2. Input can be featuremaps or constants, with at most one constant input. 3. Supports broadcast except for the first dimension, including NHWC and N1WC broadcasting. 4. Dimensions supported: 2D, 3D, 4D, and 5D, with general restrictions (see notes). 5. In ResNet's shortcut connection, Add is fused into the preceding conv for acceleration. | - Supports computation with same input shape. - Supports scalar inputs as either input 1 or 2. - Supports broadcast up to 5D. |
And | CPU Computation | -- | - Supports same input shape calculation. - Supports scalar inputs as either input 1 or 2. - Supports broadcast up to 5D. |
ArgMax | BPU Accelerated | 1. 4D input format NCHW.2. Only supports argmax along the C axis (axis=1).<br/>3. C <= 64. 4. Supports int16 input/output. | Type constraint: only supports float types. |
ArgMin | BPU Accelerated | Similar to ArgMax constraints | Type constraint: only supports float types. |
Asin | CPU Computation | -- | Type constraint: only supports float types. |
Asinh | CPU Computation | -- | Type constraint: only supports float types. |
Atan | BPU Accelerated | 1. Supports int16 input/output. 2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536]. | Type constraint: only supports float types. |
Atanh | CPU Computation | -- | Type constraint: only supports float types. |
AveragePool | BPU Accelerated | Kernel <= 256. Stride <= 256.<br/>Padding <= 256. | No support for auto_pad attribute. Only supports 4D Tensors. |
BatchNormalization | BPU Accelerated | No limitations. | Type constraint: only supports float types. Supports channel-first data layout (dimension 1 is channel). |
BitShift | CPU Computation※ | -- | -- |
Cast | CPU Computation | -- | from_type supports: double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8. to_type supports: double, float, bool, int64, uint32, int32, uint16, int16, uint8, int8. |
Ceil | BPU Accelerated | 1. Supports int16 input/output. 2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536]. | Type constraint: only supports float types. |
Clip | BPU Accelerated | 1. Supports int16 input/output. 2. Input/output dimensions up to 10D, with max dimensions in [1, 4096] and others in [1, 65536]. Opset 6: min, max as attributes, dtype only supports float. Opset 11: min, max as inputs, second input is min when there are two; dtype supports float, double. | |
Compress | CPU Computation※ | -- | -- |
Concat | BPU Accelerated | 1. Supports int16 input/output. 2. Does not support N-dimensional concatenation. | -- |
ConcatFromSequence | CPU Computation※ | -- | -- |
Constant | BPU Accelerated | Optimized via constant folding | No support for sparse_tensor attribute. |
ConstantOfShape | BPU Accelerated | Optimized via constant folding | Supported types: float, int32, int8. |
Conv | BPU Accelerated | Supports 4D (conv2d) and 5D (conv3d) inputs. 4D conv2d: Kernel size range: N,C ∈ [1, 8192]; H,W ∈ [1, 31]. CHW ≤ 65535. Channel limits: 1 group, C ≤ 8192 (or 65536 if last operator in quantized graph). Stride: H,W ∈ [1, 256] (except for shortcut-connected conv, stride=1,2); dilation: H,W ∈ [1, 16], with H and W factors dividing input Tensor dimensions. Padding: H,W ∈ [0, 256]. 5D conv3d: NCDHW limits: N ∈ [1, 128]; H,W,D,C ∈ [1, 65536]. Kernel size: N,C ∈ [1, 65536]; H,W ∈ [1, 31], D ∈ [1, 8191]. Padding: DHW: H,W ∈ [0, 256], D ∈ [0, kernel_d/2]. Stride: H, W must be 1 or 2. Group and dilation not supported. Size limit: 1GB; DC ≤ 4096 for DHalignCeil(W, 256)DC < 1GB. Weight limit: DC ≤ 8192. | Only supports 4D convolutions. No support for auto_pad attribute. Supported types: float, int32, int8. Pads constraint: [Hstart, Wstart, Hend, Wend] (4 elements) with Hstart==Hend and Wstart==Wend. |
ConvInteger | CPU Computation※ | -- | -- |
ConvTranspose | BPU Accelerated | Input/output featuremap limits: N ∈ [1, 128], H,W ∈ [1, 65536], C ∈ [1, 2048]. Size limit: 1GB. Weight size limits: N,C ∈ [1, 2048], H,W ∈ [1, 14], HW ≠ 1. Size: [1, 65535]. Padding: For odd strides, H,W ∈ [0, kernel / stride); even strides, H,W ∈ [0, kernel / stride]. Out_pad: H,W ∈ 1. Stride: 1-14, not both stride_h and stride_w equal to 1. n ∈ 1. | Shape Constraint: Only supports 4D Tensors for computation. Type Constraint: Only supports float types. Attribute Constraints: - Supports only dilations, group, output_padding, pads, and strides attributes. - The pads attribute constraint is that [hstart, wstart, hend, wend] must satisfy (hstart==hend and wstart==wend). |
Cos | BPU Acceleration | 1. This operator supports int16 input and output. 2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536]. Type Constraint: Only supports float types. | |
Cosh | CPU Computation | -- | |
CumSum | CPU Computation | -- | Axis: Type Constraint is only for int32 types. |
DepthToSpace | BPU Acceleration | Supports modes DCR and CRD. Only rearrangement of H and W directions is supported, and blocksize=2 rearrangement only. Example: NxCxHxW -> Nx(C/4)x(2H)x(2W), where the number of channels must be a multiple of 4. | From_Type Constraints: - Type Constraint: Only supports float types. - Limited to 4D Tensor computation. To_Type Constraints: - Type Constraint: Only supports float types. - Limited to 4D Tensor computation. |
DequantizeLinear | CPU Computation | -- | |
Det | CPU Computation※ | -- | |
Div | BPU Acceleration | 1. Only supports featuremap inputs (not constant inputs); 2. Input shape constraints refer to the Mul operator. - Supports same-input-shape computation. - Supports computation when input 1 is a scalar or input 2 is a scalar. - Supports broadcast computation with a maximum dimension of 5. | |
Dropout | BPU Acceleration | Does not participate in inference computations and will be removed during optimization. | |
Einsum | CPU Computation※ | -- | |
Elu | BPU Acceleration | 1. This operator supports int16 input and output. 2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536]. Type Constraint: Only supports float types. | |
Equal | BPU Acceleration | 1. Supports int16 input. 2. Input and output dimensions support 2-5 dimensions. 3. Supports broadcast across all dimensions, broadcast for fin0 or fin1 input allowed, but not mutual broadcasting. 5D broadcast has the following restrictions: - Must merge adjacent dimensions to reduce to 4D (including dimension N), e.g., NHWDC and NH1D1 can merge the NH dimension. - Broadcasted dimensions cannot merge with adjacent ones, e.g., NHWDC and N1W1C are unsupported due to inability to merge adjacent dimensions. 4. Runs on CPU by default; can be specified to run on BPU with run_on_bpu. | |
Erf | CPU Computation | -- | Type Constraint: Supports float and double data types. |
Exp | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536]. Type Constraint: Only supports float types. | |
Expand | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output support dimensions up to 10, with one differing dimension between input and output. 3. Only allows one differing dimension between input and output. | |
EyeLike | CPU Computation | -- | |
Flatten | BPU Acceleration | Constraints similar to Reshape. | |
Floor | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536]. Type Constraint: Only supports float types. | |
GRU | CPU Computation | -- | Direction Attribute: Only supports forward type. Type Constraint: Only supports float types. |
Gather | BPU Acceleration | 1. All ranks of input/output/indices must be less than or equal to 4. 2. Indices support: - When indices are feature (other op outputs), type constraint is only for int32. - When indices are weight (model constants), type constraint supports int32 and int64. From_Type Constraints: - input: Type constraint supports float, int64, int32, int8, uint64, uint32, uint8. - indices: Type constraint supports int32, int64. To_Type Constraints: - Type constraint supports float, int64, int32, int8, uint64, uint32, uint8. | |
GatherElements | BPU Acceleration | 1. Supports int16 input and output. 2. Input/indices/output dimensions support up to 10 dimensions. 3. Indices type constraint supports int16/int32/int64. | |
GatherND | CPU Computation | -- | From_Type Constraints: - input: Type constraint supports float, int32, int8. - indices: tensor(int64). To_Type Constraints: Type constraint supports float, int32, int8. |
Gemm | BPU Acceleration | Gemm will be converted to Conv implementation, with boundary constraints referring to Conv. | Type Constraint: Only supports float types. |
GlobalAveragePool | BPU Acceleration | No limitations. | - Type Constraint: Only supports float types. - Limited to 4D Tensors. |
GlobalLpPool | CPU Computation | -- | - Type Constraint: Supports float and double types. - Limited to 4D Tensor computation. |
GlobalMaxPool | BPU Acceleration | H, W ∈ [1, 256]. | - Type Constraint: Only supports float types. - Limited to 4D Tensors. |
Greater | BPU Acceleration | 1. Supports int16 input. 2. Input and output dimensions support 2-5 dimensions. 3. Same as Equal operator constraints. 4. Runs on CPU by default; can be specified to run on BPU with run_on_bpu. | |
HardSigmoid | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output support dimensions up to 10, with the highest dimension ∈ [1, 4096], and other dimensions ∈ [1, 65536]. Type Constraint: Only supports float types. | |
Hardmax | CPU Computation※ | -- | |
Identity | CPU Computation | -- | |
If | CPU Computation※ | -- | -- |
InstanceNormalization | CPU Computation | -- | - Type constraint only supports float types. - Supports data layout with the first dimension as channels. |
IsInf | CPU Computation※ | -- | -- |
IsNaN | CPU Computation※ | -- | -- |
LRN | CPU Computation | -- | - Type constraint only supports float types. - Only supports four-dimensional Tensors. |
LSTM | BPU Acceleration | Supports batch_size=1 only. If using multiple batches, ensure LSTM's batch is 1 during ONNX export and configure the parameter input_batch=1 in the YAML. | - Type constraint only supports float types. - Attribute constraint: direction attribute only supports forward. - Input constraints: - Supports X, W, R inputs; - Supports X, W, R, B inputs (sequence_lens is empty or default); - Supports X, W, R, B, sequence_lens, initial_h, initial_c, P inputs (sequence_lens is empty or default). |
LeakyRelu | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output dimensions support 1-10 dimensions, with the highest dimension ∈ [1, 4096], others ∈ [1, 65536]. | Type constraint: only supports float types. |
Less | BPU Acceleration | 1. Supports int16 input. 2. Input/output dimensions support 2-5 dimensions. 3. Runs on CPU by default; can be specified to run on BPU using run_on_bpu. | - Supports same shape inputs calculation. - Supports scalar input1 or scalar input2 calculation. - Supports broadcast calculation with a max dimension of 5. |
LessOrEqual | BPU Acceleration | In opset11, single LessOrEqual not supported; Greater + Not operator is used instead, with the same limitations as Greater. | - Supports same shape inputs calculation. - Supports scalar input1 or scalar input2 calculation. - Supports broadcast calculation with a max dimension of 5. |
Log | BPU Acceleration | 1. Supports int16 input and output. 2. Input and output dimensions support 1-10 dimensions, with the highest dimension ∈ [1, 4096], others ∈ [1, 65536]. | Type constraint: only supports float types. |
LogSoftmax | CPU Computation | -- | Type constraint: only supports float types. |
Loop | CPU Computation※ | -- | -- |
LpNormalization | CPU Computation | -- | - p-norm only supports 1 or 2. - Type constraint supports double and float types. |
LpPool | CPU Computation | -- | - auto_pad attribute not supported. - Type constraint supports double and float types. - Limited to 4-dimensional computation. |
MatMulInteger | CPU Computation※ | -- | -- |
MatMul | BPU Acceleration | C = MatMul(A, B), with input A and B dimension restrictions: - Non-quadruple dimensional inputs allowed but must meet these constraints: - A and B must have identical dimensions. - The lowest two dimensions M, K ∈ [1, 8192], higher dimensions ∈ [1, 4096]. Note: HDMK vs HDKN, MK/KN refers to the lowest two dimensions. - Broadcast is supported under these conditions: - For A and B, all dimensions except the lowest two must be either 1 or non-broadcastable values. - Examples: HDMK vs H1KN - Counterexample: H1MK vs 1DKN - A's higher dimensions cannot contain both broadcastable and non-broadcastable values. - Examples: 11MK vs HDKN - Counterexample: H1MK vs HDKN - If B's higher dimensions contain both broadcastable and non-broadcastable values, non-broadcastable ones must be consecutive high dimensions. - Examples: BHDMK vs B11KN - Counterexample: BHDMK vs B1DKN - Type constraint: only supports float types. | |
Max | BPU Acceleration | 1. Supports int16 input and output. 2. Input/output dimensions support 2-5 dimensions. 3. Supports broadcast across all dimensions, broadcast for fin0 or fin1 individually, not mutual broadcast. Restrictions for 5D broadcast: - Can merge adjacent dimensions to 4D (including dimension N), e.g., NHWDC and NH1D1 can merge NH. - Broadcast dimensions cannot merge with adjacent ones, e.g., NHWDC and N1W1C unsupported due to no adjacent dimension merge. - Other details in the documentation. | - Supports 1-∞ inputs. - Supports same shape inputs calculation. - Supports scalar input1 or scalar input2 calculation. - Supports broadcast calculation with a max dimension of 5. |
MaxPool | BPU Acceleration | Supports int16 input and output. Kernel size ≤ 256. Stride ≤ 256. Padding ≤ 256. MaxPool does not support dilation. | 1. Dilation only supports 1x1. 2. Data row-major storage only. 3. auto_pad attribute not supported. 4. storage_order attribute not supported. 5. Limited to four-dimensional Tensor computation. |
MaxRoiPool | CPU Computation | -- | No specific constraints. |
Mean | CPU Computation※ | -- | -- |
Min | BPU Acceleration | 1. Supports int16 input and output. 2. Input/output dimensions support 2-5 dimensions. 3. Similar to Max, but with different broadcast and dimension merge rules. 4. Runs on CPU by default; can be moved to BPU using run_on_bpu. | - Similar to Max, but with different input constraints. |
Mod | CPU Computation※ | -- | -- |
Mul | BPU Acceleration | 1. Supports int16 input and output. 2. Input types support feature maps and constants, with at most one constant input. 3. Supports broadcast except the first dimension, mutual broadcast between inputs, like NH1C and N1WC. 4. Dimensions up to 5D, with general restrictions (see notes). Supports different input dimensions, with specific restrictions for 5D input. (1) Merge adjacent dimensions to 4D, e.g., NHWD1 and N1WDC can merge W and D. (2) Cannot merge broadcast dimensions with adjacent ones, e.g., NHWD1 and N11DC unsupported due to H, W, and C being broadcast dimensions. | - Supports same shape inputs calculation. - Supports scalar input1 or scalar input2 calculation. - Supports broadcast calculation with a max dimension of 5. |
Multinomial | CPU Computation※ | -- | -- |
Operation | Implementation | Notes | Limitations |
---|---|---|---|
Neg | CPU computation | ||
Not | CPU computation | ||
OneHot | CPU computation | ||
Or | CPU computation | - Supports same-input-shape computation. - Supports when Input 1 is a scalar or Input 2 is a scalar. - Supports broadcast calculation with a maximum dimension of 5. | |
PRelu | CPU computation | - Type constraint: only supports float types. - from_type: X and slope. - to_type: Y. - Constraints for X's shape (data_shape): - data_shape == slope_shape. - slope_shape.ProdSize() == 1. - N, C dimensions must be equal in 4D NCHW layout. - HxW with 1x1 (slope_shape), Hx1 (slope_shape), or 1xW (slope_shape). - Special case: 4D X and 3D slope with data_shape[1] == slope_shape[0] == 1 and slope_shape[2] == 1. | |
Pad | BPU acceleration | 1. Supports int16 input and output. 2. Supports mode: Constant. 3. Supports padding in all dimensions. | Pad-10: - Type constraint: float only. - 4D NCHW tensors only. - Constraint on pads attribute: - len(pads) == 8 - pads[i] >= 0 - pads[0] == pads[1] == pads[4] == pads[5] == 0. Pad-11: - from_type: data - float only. - pads: tensor(int64) - constant_value (optional) - float only. - to_type: float only. - 4D Tensor only. - Supports 2D or 3D padding only. |
Pow | BPU acceleration | 1. Supports int16 input and output. 2. Input/output support 1-10 dimensions, max dim ∈ [1, 4096], others ∈ [1, 65536]. 3. Second input must be a scalar. | - Type constraints: double, float, int64, int32. - Supports same-input-shape calculation. - Supports scalar inputs for either Input 1 or Input 2. - Supports broadcast calculation with a maximum dimension of 5. - Requires X and Y to have the same type. |
QLinearConv | CPU computation※ | ||
QLinearMatMul | CPU computation※ | ||
QuantizeLinear | CPU computation | ||
RNN | CPU computation | - Type constraint: float only. - Attribute constraint: direction attribute supports forward only. - Input constraint: X, W, R inputs only, no optional inputs like B, sequence_lens, initial_h allowed. - Output constraint: Only Y_h output supported, shape [num_directions, batch_size, hidden_size]. | |
RandomNormal | CPU computation※ | ||
RandomNormalLike | CPU computation※ | ||
RandomUniform | CPU computation | ||
RandomUniformLike | CPU computation | ||
Range | CPU computation | Type constraints: float, int64, int32, int16. | |
Reciprocal | BPU acceleration | 1. Supports int16 input and output. 2. Input/output support 1-10 dimensions, max dim ∈ [1, 4096], others ∈ [1, 65536]. | |
ReduceL1 | CPU computation | ||
ReduceL2 | CPU computation | ||
ReduceLogSum | CPU computation | ||
ReduceLogSumExp | CPU computation | Type constraints: float, double. | |
ReduceMax | BPU acceleration | 1. Supports int16 input and output. 2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension. 3. Reduced dimension size ∈ [1, 8192]. 4. keepdims == 1 only. | Axes supported: 0, 1, or equal to input data dimensions. |
ReduceMean | BPU acceleration | 1. Supports int16 input and output. 2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension. 3. Special case: Supports HW reduction when reduce_dim = 2. 4. keepdims == 1 only. | Axes supported: 0, 1, or equal to input data dimensions. |
ReduceMin | CPU computation | ||
ReduceProd | CPU computation | ||
ReduceSum | BPU acceleration | 1. Supports int16 input and output. 2. Input supports 2-5 dimensions, requires axes attribute with 1 axis, no reduction across more than 1 dimension. | Axes supported: 0, 1, or equal to input data dimensions. |
ReduceSumSquare | CPU computation | ||
Relu | BPU acceleration | Unlimited | Only supports float type. |
Reshape | BPU acceleration | 1. Supports int16 inputs and outputs. 2. Supports 1-10 dimensional inputs and outputs. | None. |
Resize | BPU acceleration | 1. NCHW input featuremaps, resize only on H and W dimensions. onnx opset=11 supports ROI input (PyTorch models need manual modification to add ROI input, which only accepts constant inputs). 2. Mode supports nearest and linear. 3. Supports scaling up or down. 4. For nearest mode, scale factors must be powers of 2 (e.g., 2, 4, 8, 16, 32) with H_factor <= W_factor. 5. onnx opset=11 supports half_pixel, pytorch_half_pixel, asymmetric, align_corners, and tf_crop_and_resize. ROI input is only effective in tf_crop_and_resize mode, requiring integer boundary coordinates after conversion. 6. extrapolation_value not supported. | |
ReverseSequence | CPU computation | -- | -- |
RoiAlign | CPU computation | -- | -- |
Round | CPU computation | -- | -- |
Scan | CPU computation* | -- | -- |
Scatter (deprecated) | CPU computation* | -- | -- |
ScatterElements | CPU computation | -- | from_type: supports float, int32, int8. indices: only supports int32 type. updates: supports float, int32, int8. to_type: supports float, int32, int8. |
ScatterND | CPU computation | -- | from_type: supports float, int32, int8. updates: supports float, int32, int8. to_type: supports float, int32, int8. |
Selu | CPU computation | -- | Only supports float type. |
SequenceAt | CPU computation* | -- | -- |
SequenceConstruct | CPU computation* | -- | -- |
SequenceEmpty | CPU computation* | -- | -- |
SequenceErase | CPU computation* | -- | -- |
SequenceInsert | CPU computation* | -- | -- |
SequenceLength | CPU computation* | -- | -- |
Shape | BPU acceleration | Optimized through constant folding into numerical storage. | -- |
Shrink | CPU computation* | -- | -- |
Sigmoid | BPU acceleration | 1. Supports int16 inputs and outputs. 2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536]. | Only supports float type. |
Sign | CPU computation | Only supports float type. | -- |
Sin | BPU acceleration | 1. Supports int16 inputs and outputs. 2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536]. | Only supports float type. |
Sinh | CPU computation | Only supports float type. | -- |
Size | BPU acceleration | Optimized through constant folding into numerical storage. | -- |
Slice | BPU acceleration | 1. Supports int16 inputs and outputs. 2. Unlimited, supports non-four-dimensional inputs and outputs. | No constraints. |
Softmax | BPU acceleration | - Supports int16 inputs and outputs. - Runs on CPU by default, with differences between onnx::softmax and pytorch::softmax: 1. For onnx::softmax, can run on BPU if input is 4D and axis=3. Specify run_on_bpu. 2. For pytorch::softmax, can run on BPU for 4D inputs and axis=1, 2, 3. Specify run_on_bpu. | Only supports float type. |
Softplus | BPU acceleration | 1. Supports int16 inputs and outputs. 2. Supports 1-10 dimensional inputs, max dimension [1, 4096], others [1, 65536]. | Only supports float type. |
Softsign | CPU computation | -- | -- |
Operator | Acceleration | Support modes and constraints | Type constraints |
---|---|---|---|
SpaceToDepth | BPU accelerated | Supports DCR and CRD modes. Only reordering along H and W dimensions is allowed, with blocksize=2. | float only |
Split | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Input length must be a multiple of each split tensor's length. 3. Supports arbitrary dimensions except N. 4. Split count must be divisible. 5. Non-four-dimensional inputs and outputs supported. | float only |
SplitToSequence | CPU computation(*) | -- | -- |
Sqrt | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Input/output supports 1-10 dimensions, with max dimension in [1, 4096] and others in [1, 65536]. | float only |
Squeeze | BPU accelerated | Converted to Reshape op. BPU constraints apply. | -- |
StringNormalizer | CPU computation(*) | -- | -- |
Sub | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Feature map and constant inputs supported, up to one constant. 3. Broadcasting except first dimension, supports input broadcasting between NH1C and N1WC. 4. 2D-5D dimensions supported, with general restrictions (see notes). Supports different input dimensions; for 5D inputs, see restrictions below. (1) Merge adjacent dimensions to 4D, e.g., NHWD1 and N1WDC. (2) Cannot merge broadcasted dimensions with adjacent ones, e.g., NHWD1 and N11DC not supported due to H, W, and C being broadcasted dimensions. | Same shape input support Scalar input support Broadcasting up to 5 dimensions. |
Sum | BPU accelerated | Constraints same as Add | float only |
Tan | CPU computation | -- | float only |
Tanh | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Input/output supports 1-10 dimensions, with max dimension in [1, 4096] and others in [1, 65536]. | float only |
TfIdfVectorizer | CPU computation(*) | -- | -- |
ThresholdedRelu | CPU computation | -- | float only |
Tile | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Only one dimension may have differing values between input and output. | float, int64, etc. |
TopK | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Input/indices/output dimensions: 1-10. 3. Indices type: int16/int32/int64. 4. Sorted parameter supports true only. | float only |
Transpose | BPU accelerated | 1. Supports int16 inputs and outputs. 2. Arbitrary input dimensions. | nhwc2nchw, perm: [0, 3, 1, 2] nchw2nhwc, perm: [0, 2, 3, 1] Custom perm dimensions for float, int8, int32. |
Unique | CPU computation(*) | -- | -- |
Unsqueeze | BPU accelerated | Converted to Reshape op. BPU constraints apply. | -- |
Upsample (resize replacement) | BPU accelerated | -- | Upsample-10 Input: 4D Tensor, opset10 when = 2 Upsample-11 Input: 4D Tensor, opset11 when > 2 Coordinate transformation modes: nearest, linear (half_pixel, asymmetric, align_corners, pytorch_half_pixel), cubic (half_pixel only) Extrapolation_value unsupported. |
Where | CPU computation | -- | float, int64 |
Xor | CPU computation(*) | -- | -- |
Function | CPU computation(*) | -- | -- |
Celu | CPU computation(*) | -- | -- |
DynamicQuantizeLinear | CPU computation(*) | -- | -- |
GreaterOrEqual | BPU accelerated | Opset11 doesn't support standalone GreaterOrEqual; Less + Not on BPU for split conditions, with similar restrictions to Less. | Same shape, scalar, broadcast up to 5D. |
MeanVarianceNormalization | CPU computation(*) | -- | -- |
GridSample (PyTorch) | BPU accelerated | 1. Input dimensions: 4D, N ∈ [1, 4096], C ∈ [1, 65536], H, W ∈ [1, 1024], HW ≤ 7201024. 2. Mode: bilinear, nearest. 3. Padding_mode: zeros, border. 4. Opset16 ONNX operator, exported via horizon_nn.torch.export_onnx (not opset11 native). See example below. | -- |