Skip to main content

Text-to-Text Generation

In this tutorial, we demonstrate deploying text-to-text models with LLMBoost.

Step 0: Before you start

Enter the following to set up the environment variables and start an LLMBoost container on the host node where you want to run the Whisper inference service.

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

  • Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
  • Set the license file path LICENSE_FILE to your license file location. Please contact us through contact@mangoboost.io if you don't have a llmboost license.
  • Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.
docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.

Step 1: Starting an text-to-text inference service

Starting an inference service with any supported text-to-text model very easy, just like the steps here. All of the models are launched from the same entry point:

llmboost serve --model_name MODEL_NAME

Where MODEL_NAME are strings from this list of supported models
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
google/gemma-1.1-7b-it
google/gemma-3-27b-it
GoToCompany/Llama-Sahabat-AI-v2-70B-IT
meta-llama/Llama-2-70b-chat-hf
meta-llama/Llama-2-7b
meta-llama/Llama-3.1-405B-Instruct
meta-llama/Llama-3.1-70B-Instruct
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.3-70B-Instruct
meta-llama/LlamaGuard-7b
microsoft/Phi-3.5-mini-instruct
Qwen/Qwen2.5-32B-Instruct
Qwen/Qwen2.5-72B-Instruct
Qwen/Qwen2.5-7B-Instruct

After starting, the service will, by default, wait for inference requests on port 8011.

It will take a few minutes for the service to be ready. Wait until the service status message reports that it is ready before proceeding with the rest of the steps.

Step 2: Test the server instance

We can submit the queries with any of your preferred method from here. In this tutorial we will demonstrate using curl

When the inference service is ready, you can connect to it by connecting to port 8011 of the host node. The service can be accessed from any node that can reach the host node by network.

curl localhost:8011/v1/chat/completions \
-X POST \
-d '{
"model": MODEL_NAME,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 1000
}' \
-H 'Content-Type: application/json'

Here snippet of the expected result

data: {"id":"npx904jEue72bjw3","object":"chat.completion.chunk","created":1750455264,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"delta":{"content":"<|start_header_id|>assistant<|end_header_id|>\n\nDeep learning is a subfield of machine learning that is a subset of artificial intelligence (AI). It is a type of machine learning that involves the use of neural networks with multiple layers to analyze and learn complex patterns in data. Deep learning models are particularly effective at learning and representing data with complex structures, such as images, speech, and natural language.
.....