Skip to main content

Vision Language Models

Feature Overview

This section introduces how to experience on-device Vision Language Models (VLMs) on the RDK platform. Thanks to the outstanding achievements of InternVL and SmolVLM, we have implemented quantization and deployment on the RDK platform. Additionally, this example leverages the powerful KV Cache management capabilities from llama.cpp combined with the computational advantages of the RDK platform's BPU module to enable local VLM deployment.

Code repository: (https://github.com/D-Robotics/hobot_llamacpp.git)

Supported Platforms

PlatformRuntime EnvironmentExample Feature
RDK X5, RDK X5 ModuleUbuntu 22.04 (Humble)On-device Vision Language Model demo
RDK S100, RDK S100PUbuntu 22.04 (Humble)On-device Vision Language Model demo

Supported Models

Model TypeParametersPlatformImage Encoder ModelText Decoder Model
InternVL2_51BX5vit_model_int16_v2.binQwen2.5-0.5B-Instruct-Q4_0.gguf
InternVL2_51BS100vit_model_int16.hbmQwen2.5-0.5B-Instruct-Q4_0.gguf
InternVL31BX5vit_model_int16_VL3_1B_Instruct_X5.binqwen2_5_q8_0_InternVL3_1B_Instruct.gguf
InternVL31BS100vit_model_int16_VL3_1B_Instruct.hbmqwen2_5_q8_0_InternVL3_1B_Instruct.gguf
InternVL32BX5vit_model_int16_VL3_2B_Instruct.binqwen2_5_1.5b_q8_0_InternVL3_2B_Instruct.gguf
InternVL32BS100vit_model_int16_VL3_2B_Instruct.hbmqwen2_5_1.5b_q8_0_InternVL3_2B_Instruct.gguf
SmolVLM2256MX5SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.binSmolVLM2-256M-Video-Instruct-Q8_0.gguf
SmolVLM2256MS100SigLip_int16_SmolVLM2_256M_Instruct_S100.hbmSmolVLM2-256M-Video-Instruct-Q8_0.gguf
SmolVLM2500MX5SigLip_int16_SmolVLM2_500M_Instruct_MLP_C1_UP_X5.binSmolVLM2-500M-Video-Instruct-Q8_0.gguf
SmolVLM2500MS100SigLip_int16_SmolVLM2_500M_Instruct_S100.hbmSmolVLM2-500M-Video-Instruct-Q8_0.gguf

Algorithm Performance Metrics

ModelParametersQuantizationPlatformInput SizeImage Encoder Time (ms)Prefill Eval Time (ms/token)Eval Time (ms/token)
InternVL2_50.5BQ4_0X51x3x448x4482456.007.751.6
InternVL30.5BQ8_0S1001x3x448x448100.009.1941.65
SmolVLM2256MQ8_0X51x3x512x51210539.327.8
SmolVLM2500MQ8_0X51x3x512x512105327.365.7

Prerequisites

RDK Platform

  1. The RDK has been flashed with an Ubuntu 22.04 system image.
  2. TogetheROS.Bot has been successfully installed on the RDK.
  3. Install the required package:
sudo apt update
sudo apt install tros-humble-hobot-llamacpp
Note

If the sudo apt update command fails or returns errors, refer to the FAQ section Common Issues, specifically question Q10: How to resolve apt update command failures or errors? for solutions.

  1. System Configuration

Use the command srpi-config to set ION memory size to 1.6GB and configure the CPU to run at its maximum frequency after reboot.

Usage Instructions

Two interaction modes are currently provided: one allows direct input of images and text via terminal; the other subscribes to image and text messages and publishes results as text output.

InternVL

Before running the program, download the model files to your working directory using the following commands:

wget https://hf-mirror.com/D-Robotics/InternVL2_5-1B-GGUF-BPU/resolve/main/Qwen2.5-0.5B-Instruct-Q4_0.gguf
wget https://hf-mirror.com/D-Robotics/InternVL2_5-1B-GGUF-BPU/resolve/main/rdkx5/vit_model_int16_v2.bin
source /opt/tros/humble/setup.bash
cp -r /opt/tros/${TROS_DISTRO}/lib/hobot_llamacpp/config/ .
ros2 run hobot_llamacpp hobot_llamacpp --ros-args -p feed_type:=0 -p image:=config/image2.jpg -p image_type:=0 -p user_prompt:="Describe this image." -p model_file_name:=vit_model_int16_v2.bin -p llm_model_name:=Qwen2.5-0.5B-Instruct-Q4_0.gguf

After launching the program, you can use local images and custom prompts to generate outputs.

internvlm_result

SmolVLM

Before running the program, download the model files to your working directory using the following commands:

wget https://hf-mirror.com/D-Robotics/SmolVLM2-256M-Video-Instruct-GGUF-BPU/resolve/main/rdkx5/SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.bin
wget https://hf-mirror.com/D-Robotics/SmolVLM2-256M-Video-Instruct-GGUF-BPU/resolve/main/SmolVLM2-256M-Video-Instruct-Q8_0.gguf
source /opt/tros/humble/setup.bash
cp -r /opt/tros/${TROS_DISTRO}/lib/hobot_llamacpp/config/ .
ros2 run hobot_llamacpp hobot_llamacpp --ros-args -p feed_type:=0 -p model_type:=1 -p image:=config/image2.jpg -p image_type:=0 -p user_prompt:="Describe the image." -p model_file_name:=SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.bin -p llm_model_name:=SmolVLM2-256M-Video-Instruct-Q8_0.gguf

After launching the program, you can use local images and custom prompts to generate outputs.

smolvlm_result

Notes

On the X5 platform, set the ION memory size to 1.6GB. On the S100 platform, set the ION memory size to more than 1.6GB; otherwise, model loading will fail.