Skip to main content

5.2.10 Vision Language Model

Introduction

This section describes how to experience on-device Vision Language Model (VLM) on the RDK platform. Thanks to the excellent results of InternVL and SmolVLM, we have achieved quantization and deployment on the RDK platform. This demo leverages the powerful KV Cache management in llama.cpp, combined with the computational advantages of the RDK platform's BPU module, to enable local VLM model deployment.

Code repository: (https://github.com/D-Robotics/hobot_llamacpp.git)

Supported Platforms

PlatformOS / MethodDemo Functionality
RDK X5Ubuntu 22.04 (Humble)On-device Vision Language Model

Supported Models

Model NameParaPlatformImage EncoderLanguage Encoder and Decoder
InternVL2_51BX5vit_model_int16_v2.binQwen2.5-0.5B-Instruct-Q4_0.gguf
InternVL2_51BS100vit_model_int16.hbmQwen2.5-0.5B-Instruct-Q4_0.gguf
InternVL31BX5vit_model_int16_VL3_1B_Instruct_X5.binqwen2_5_q8_0_InternVL3_1B_Instruct.gguf
InternVL31BS100vit_model_int16_VL3_1B_Instruct.hbmqwen2_5_q8_0_InternVL3_1B_Instruct.gguf
InternVL32BX5vit_model_int16_VL3_2B_Instruct.binqwen2_5_1.5b_q8_0_InternVL3_2B_Instruct.gguf
InternVL32BS100vit_model_int16_VL3_2B_Instruct.hbmqwen2_5_1.5b_q8_0_InternVL3_2B_Instruct.gguf
SmolVLM2256MX5SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.binSmolVLM2-256M-Video-Instruct-Q8_0.gguf
SmolVLM2500MX5SigLip_int16_SmolVLM2_500M_Instruct_MLP_C1_UP_X5.binSmolVLM2-500M-Video-Instruct-Q8_0.gguf

Preparation

RDK Platform

  1. RDK should be flashed with the Ubuntu 22.04 system image.
  2. TogetheROS.Bot must be successfully installed on the RDK.

Usage

RDK Platform

Use the srpi-config command to set the ION memory size to 1.6GB. For details, refer to the Performance Options section in the RDK User Manual.

After rebooting, set the CPU maximum frequency to 1.5GHz and the scheduling mode to performance with the following commands:

sudo bash -c 'echo 1 > /sys/devices/system/cpu/cpufreq/boost'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu5/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu6/cpufreq/scaling_governor'
sudo bash -c 'echo performance >/sys/devices/system/cpu/cpu7/cpufreq/scaling_governor'

Currently, two demo modes are provided: direct terminal input (image and text), or subscribing to image and text messages and publishing the results as text.

InternVL

Before running the program, download the model files to the working directory with the following commands:

# Download model files
wget https://huggingface.co/D-Robotics/InternVL2_5-1B-GGUF-BPU/blob/main/Qwen2.5-0.5B-Instruct-Q4_0.gguf
wget https://huggingface.co/D-Robotics/InternVL2_5-1B-GGUF-BPU/blob/main/rdkx5/vit_model_int16_v2.bin
source /opt/tros/humble/setup.bash
cp -r /opt/tros/${TROS_DISTRO}/lib/hobot_llamacpp/config/ .

ros2 run hobot_llamacpp hobot_llamacpp --ros-args -p feed_type:=0 -p image:=config/image2.jpg -p image_type:=0 -p user_prompt:="Describe the image." -p model_file_name:=vit_model_int16_v2.bin -p llm_model_name:=Qwen2.5-0.5B-Instruct-Q4_0.gguf

After starting the program, you can use a local image and custom prompt for output.

vlm_result

SmolVLM

Before running the program, download the model files to the working directory with the following commands:

# Download model files
wget https://huggingface.co/D-Robotics/SmolVLM2-256M-Video-Instruct-GGUF-BPU/resolve/main/rdkx5/SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.bin
wget https://huggingface.co/D-Robotics/SmolVLM2-256M-Video-Instruct-GGUF-BPU/resolve/main/SmolVLM2-256M-Video-Instruct-Q8_0.gguf
source /opt/tros/humble/setup.bash
cp -r /opt/tros/${TROS_DISTRO}/lib/hobot_llamacpp/config/ .
ros2 run hobot_llamacpp hobot_llamacpp --ros-args -p feed_type:=0 -p model_type:=1 -p image:=config/image2.jpg -p image_type:=0 -p user_prompt:="Describe the image." -p model_file_name:=SigLip_int16_SmolVLM2_256M_Instruct_MLP_C1_UP_X5.bin -p llm_model_name:=SmolVLM2-256M-Video-Instruct-Q8_0.gguf

After starting the program, you can use a local image and custom prompt for output.

smolvlm_result

Notes

Ensure the development board more than 1.6GB size of ION memory, otherwise the model may fail to load.