Using LLMBoost Interactively
In this introductory tutorial, we will launch the LLMBoost container into an interactive command-line session to use a large language model for chat. We will demonstrate the features and flexibility of the llmboost
command in shielding the user from the details of running different models and inference engines on different hardware platform choices. Later tutorials will showcase more powerful and flexible ways to use LLMBoost through its Python programming API and its server deployment options.
Step 0: Before you start
Enter the following to set up the environment variables and start the LLMBoost container:
export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>
💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.
- Set the model directory
MODEL_PATH
with the absolute path to the directory on your host file system where your local models are stored. - Set the license file path
LICENSE_FILE
to your license file location. Please contact us throughcontact@mangoboost.io
if you don't have a llmboost license. - Set the HuggingFace token
HUGGING_FACE_HUB_TOKEN
by obtaining a Hugging Face token fromhuggingface.co/settings/tokens
.
- On NVIDIA
- On AMD
docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash
docker run -it --rm \
--network host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/dri:/dev/dri \
--device=/dev/kfd:/dev/kfd \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-rocm \
bash
Note: Replace
<llmboost-docker-image-name>
with the image name provided by MangoBoost.
An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.
Step 1. Running a simple chatbot
From the LLMBoost container's command shell prompt, you can invoke the llmboost
command to perform a number of inference tasks. For example, type the following command to start an interactive chat session using the specified large language model.
llmboost chat --model_name meta-llama/Llama-3.2-1B-Instruct
It can take a few minutes to load a large language model. When presented with a chat prompt, type a few questions to interact with the language model. For example, try asking:
- What is an LLM?
- Summarize the previous answer in 1 sentence.
- Translate the summary into French.
Initializing LLMBoost...
Preparing model with 8192 context length...
INFO 05-30 23:41:19 [__init__.py:239] Automatically detected platform rocm.
Deploying LLMBoost (this may take a few minutes) .......................................\
[INFO] Model: meta-llama/Llama-3.2-1B-Instruct
[INFO] You can set the system message by typing 'system: <message>'
[INFO] Type 'exit' to quit
[INFO] Welcome to Mango LLMBoost chatbot!
[INFO] What do you want to discuss?
>>> What is an LLM?
LLM stands for Large Language Model. It's a type of artificial intelligence (AI) model
...
When finished, type exit
or press ctrl+D
to end the chat session.
Step 2. Try out different language models
You can try relaunching new chat sessions with different models through the --model_name
argument.
LLMBoost hides the details of running different models on your hardware setup.
For example, try this smaller model:
llmboost chat --model_name Qwen/Qwen2.5-0.5B-Instruct
And try this larger model:
llmboost chat --model_name meta-llama/Llama-3.1-8B-Instruct
💡 You can find other models on HuggingFace to try. Your GPU needs to have enough memory to support the size of the model.
Can you see the difference in the models' quality of answer and responsiveness? Larger models will be more expensive to run. Nevertheless, LLMBoost will automatically configure the software and hardware stack for optimal execution of your chosen model and GPU choice. When you first begin, you can omit most optional arguments and simply allow LLMBoost to manage the execution configurations.
In the next tutorial, we will demonstrate using LLMBoost through LLMBoost's Python programming API.