Text-to-Text Generation
In this tutorial, we demonstrate deploying text-to-text models with LLMBoost.
Step 0: Before you start
Enter the following to set up the environment variables and start an LLMBoost container on the host node where you want to run the Whisper inference service.
export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>
💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.
- Set the model directory
MODEL_PATH
with the absolute path to the directory on your host file system where your local models are stored. - Set the license file path
LICENSE_FILE
to your license file location. Please contact us throughcontact@mangoboost.io
if you don't have a llmboost license. - Set the HuggingFace token
HUGGING_FACE_HUB_TOKEN
by obtaining a Hugging Face token fromhuggingface.co/settings/tokens
.
- On NVIDIA
- On AMD
docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash
docker run -it --rm \
--network host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/dri:/dev/dri \
--device=/dev/kfd:/dev/kfd \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-rocm \
bash
Note: Replace
<llmboost-docker-image-name>
with the image name provided by MangoBoost.
An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.
Step 1: Starting an text-to-text inference service
Starting an inference service with any supported text-to-text model very easy, just like the steps here. All of the models are launched from the same entry point:
llmboost serve --model_name MODEL_NAME
Where MODEL_NAME
are strings from this list of supported models
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
google/gemma-1.1-7b-it
google/gemma-3-27b-it
GoToCompany/Llama-Sahabat-AI-v2-70B-IT
meta-llama/Llama-2-70b-chat-hf
meta-llama/Llama-2-7b
meta-llama/Llama-3.1-405B-Instruct
meta-llama/Llama-3.1-70B-Instruct
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.3-70B-Instruct
meta-llama/LlamaGuard-7b
microsoft/Phi-3.5-mini-instruct
Qwen/Qwen2.5-32B-Instruct
Qwen/Qwen2.5-72B-Instruct
Qwen/Qwen2.5-7B-Instruct
After starting, the service will, by default, wait for inference requests on port 8011.
It will take a few minutes for the service to be ready. Wait until the service status message reports that it is ready before proceeding with the rest of the steps.
Step 2: Test the server instance
We can submit the queries with any of your preferred method from here.
In this tutorial we will demonstrate using curl
When the inference service is ready, you can connect to it by connecting to port 8011 of the host node. The service can be accessed from any node that can reach the host node by network.
curl localhost:8011/v1/chat/completions \
-X POST \
-d '{
"model": MODEL_NAME,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 1000
}' \
-H 'Content-Type: application/json'
Here snippet of the expected result
data: {"id":"npx904jEue72bjw3","object":"chat.completion.chunk","created":1750455264,"model":"meta-llama/Llama-3.1-8B-Instruct","choices":[{"delta":{"content":"<|start_header_id|>assistant<|end_header_id|>\n\nDeep learning is a subfield of machine learning that is a subset of artificial intelligence (AI). It is a type of machine learning that involves the use of neural networks with multiple layers to analyze and learn complex patterns in data. Deep learning models are particularly effective at learning and representing data with complex structures, such as images, speech, and natural language.
.....