Image-to-Text Generation

In this tutorial, we demonstrate deploying image-to-text models with LLMBoost.

Step 0: Before you start

Enter the following to set up the environment variables and start an LLMBoost container on the host node where you want to run the Whisper inference service.

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
Set the license file path LICENSE_FILE to your license file location. Please contact us through contact@mangoboost.io if you don't have a llmboost license.
Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

On NVIDIA
On AMD

docker run -it --rm \
  --network host \
  --gpus all \
  --pid=host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-cuda \
  bash

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-rocm \
  bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.

Step 1: Starting an text-to-text inference service

Starting an inference service with any supported image-to-text, just like the steps here. All of the models are launched from the same entry point:

llmboost serve --model_name MODEL_NAME --query_type image --max_model_len MAX_MODEL_LEN

Where MODEL_NAME MAX_MODEL_LEN are pairs of strings from this list of supported image-to-text models
llava-hf/llava-1.5-7b-hf:1024

After starting, the service will, by default, wait for inference requests on port 8011.

It will take a few minutes for the service to be ready. Wait until the service status message reports that it is ready before proceeding with the rest of the steps.

Step 2: Test the server instance

We can submit the queries with any of your preferred method from here. In this tutorial we will demonstrate using curl

When the inference service is ready, you can connect to it by connecting to port 8011 of the host node. The service can be accessed from any node that can reach the host node by network.

Compared to text-to-text inference query, there are some extra fields you need to provide as image input.

curl -X POST http://127.0.0.1:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MODEL_NAME",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what is in this image?"},
                {
                    "type": "image_url",
                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            ]
        }
    ],
    "max_tokens": 1024
  }'

Here snippet of the expected result

{"id":"7SEMNwXk7tZdEJH7","choices":[{"index":0,"message":{"role":"assistant","content":" This image features a private road in the middle of a grass-covered field. The path, made of wood and dirt, has a grassy field on one side, and a lush green forest with trees surrounding the other side. The trail leads into the middle of the field, eventually connecting it to the forest, creating a scenic and peaceful environment for the viewer.\n"}}],"created":1750803089,"model":"llava-hf/llava-1.5-7b-hf","service_tier":null,"system_fingerprint":"CT0UTe5mnHXcuTn1","object":"chat.completion","usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Step 0: Before you start​

Step 1: Starting an text-to-text inference service​

Step 2: Test the server instance​

Step 0: Before you start

Step 1: Starting an text-to-text inference service

Step 2: Test the server instance