Using LLMBoost Interactively

In this introductory tutorial, we will launch the LLMBoost container into an interactive command-line session to use a large language model for chat. We will demonstrate the features and flexibility of the llmboost command in shielding the user from the details of running different models and inference engines on different hardware platform choices. Later tutorials will showcase more powerful and flexible ways to use LLMBoost through its Python programming API and its server deployment options.

Step 0: Before you start

Enter the following to set up the environment variables and start the LLMBoost container:

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
Set the license file path LICENSE_FILE to your license file location. Please contact us through contact@mangoboost.io if you don't have a llmboost license.
Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

On NVIDIA
On AMD

docker run -it --rm \
  --network host \
  --gpus all \
  --pid=host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-cuda \
  bash

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-rocm \
  bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.

Step 1. Running a simple chatbot

From the LLMBoost container's command shell prompt, you can invoke the llmboost command to perform a number of inference tasks. For example, type the following command to start an interactive chat session using the specified large language model.

llmboost chat --model_name meta-llama/Llama-3.2-1B-Instruct

It can take a few minutes to load a large language model. When presented with a chat prompt, type a few questions to interact with the language model. For example, try asking:

What is an LLM?
Summarize the previous answer in 1 sentence.
Translate the summary into French.

Initializing LLMBoost...
Preparing model with 8192 context length...
INFO 05-30 23:41:19 [__init__.py:239] Automatically detected platform rocm.
Deploying LLMBoost (this may take a few minutes) .......................................\
[INFO] Model: meta-llama/Llama-3.2-1B-Instruct
[INFO] You can set the system message by typing 'system: <message>'
[INFO] Type 'exit' to quit

[INFO] Welcome to Mango LLMBoost chatbot!
[INFO] What do you want to discuss?
>>> What is an LLM?                                                    
LLM stands for Large Language Model. It's a type of artificial intelligence (AI) model
...

When finished, type exit or press ctrl+D to end the chat session.

Step 2. Try out different language models

You can try relaunching new chat sessions with different models through the --model_name argument. LLMBoost hides the details of running different models on your hardware setup. For example, try this smaller model:

llmboost chat --model_name Qwen/Qwen2.5-0.5B-Instruct

And try this larger model:

llmboost chat --model_name meta-llama/Llama-3.1-8B-Instruct

💡 You can find other models on HuggingFace to try. Your GPU needs to have enough memory to support the size of the model.

Can you see the difference in the models' quality of answer and responsiveness? Larger models will be more expensive to run. Nevertheless, LLMBoost will automatically configure the software and hardware stack for optimal execution of your chosen model and GPU choice. When you first begin, you can omit most optional arguments and simply allow LLMBoost to manage the execution configurations.

In the next tutorial, we will demonstrate using LLMBoost through LLMBoost's Python programming API.

Step 0: Before you start​

Step 1. Running a simple chatbot​

Step 2. Try out different language models​

Step 0: Before you start

Step 1. Running a simple chatbot

Step 2. Try out different language models