Inference Quick Start
This guide walks you through setting up and running a quick LLM inference benchmark using LLMBoost on either AMD or NVIDIA GPUs.
Before running any inference workloads, you'll need to set up the environment where LLMBoost will operate. This involves pulling the correct Docker image for your GPU backend (AMD or NVIDIA) and preparing environment variables such as model paths and access tokens. LLMBoost is packaged in a containerized format to ensure consistency, portability, and optimized performance across different systems.
Step 1: Pull LLMBoost docker Image
To get access to the LLMBoost Docker image, please contact the MangoBoost team.
- On NVIDIA
- On AMD
docker pull mangollm/<llmboost-docker-image-name>:prod-cuda
docker pull mangollm/<llmboost-docker-image-name>:prod-rocm
Note: Replace
<llmboost-docker-image-name>
with the image name provided by MangoBoost.
Step 2: Set up environment
Sett the following environment variables on the host node.
export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>
💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.
- Set the model directory
MODEL_PATH
with the absolute path to the directory on your host file system where your local models are stored. - Set the license file path
LICENSE_FILE
to your license file location. Please contact us throughcontact@mangoboost.io
if you don't have a llmboost license. - Set the HuggingFace token
HUGGING_FACE_HUB_TOKEN
by obtaining a Hugging Face token fromhuggingface.co/settings/tokens
.
Step 3: Run a benchmark in LLMBoost container
Use the following command to launch the LLMBoost container to run an inference benchmark on an NVIDIA/AMD GPU. The benchmark will evaluate the throughput and latency of the Meta-Llama/Llama-3.2-1B-Instruct
.
- On NVIDIA
- On AMD
docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash -c "python3 apps/benchmark.py \
--model_name Meta-Llama/Llama-3.2-1B-Instruct \
--input_len 128 --output_len 128"
docker run -it --rm \
--network host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/dri:/dev/dri \
--device=/dev/kfd:/dev/kfd \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-rocm \
bash -c "python3 apps/benchmark.py \
--model_name Meta-Llama/Llama-3.2-1B-Instruct \
--input_len 128 --output_len 128"
Expected Output
Once the benchmark completes, you should see performance metrics like this:
Initializing LLMBoost...
Preparing model with 8192 context length...
Applying auto parallelization...
INFO 05-30 22:49:02 [__init__.py:239] Automatically detected platform rocm.
config.json: 100%|█████████████████████████████████████████████| 877/877 [00:00<00:00, 12.0MB/s]
Deploying LLMBoost (this may take a few minutes) .............................................|
I0530 22:51:52.703913 140310129037312 benchmark.py:170] Starting LLMBoost with 1000 inputs
benchmark.py: 100%|████████████████████████████████████████████| 1000/1000 [00:07<00:00, 131.04req/s]
I0530 22:52:00.335985 140310129037312 benchmark.py:256] LLMBoost Finished
I0530 22:52:00.336340 140310129037312 benchmark.py:287] Total time: 1.0858112429850735 seconds
I0530 22:52:00.336368 140310129037312 benchmark.py:288] Throughput: 235768.42 tokens/s
I0530 22:52:00.336387 140310129037312 benchmark.py:289] 920.97 reqs/s
I0530 22:52:00.336404 140310129037312 benchmark.py:290] Prompt : 117884.21 tokens/s
I0530 22:52:00.336420 140310129037312 benchmark.py:291] Generation: 117884.21 tokens/s
I0530 22:52:00.336437 140310129037312 benchmark.py:292] TTFT : 1.0453 seconds
I0530 22:52:00.336452 140310129037312 benchmark.py:293] TBT : 0.0000 seconds
Metrics Explained:
- Throughput: Total number of tokens processed per second
- Requests/sec: Number of prompt requests completed per second
- TTFT (Time-To-First-Token): Latency until the first token is generated
- TBT (Time-Between-Tokens): Latency between token generations
Next Steps
- Compare the performance of different models by modifying the
--model_name
flag in thedocker run
command. - Explore the available deployment and integration options in the How-To Guides