Deploying an Inference Service
One of the most powerful uses of the LLMBoost container is to deploy a containerized inference service compatible with the Kubernetes framework. In this tutorial, we will start inference services manually using the LLMBoost container's command-line interface. In practice, the commands can be packaged into a shell script so that the services start automatically when the container is launched.
Step 0: Before you start
Enter the following to set up the environment variables and start the LLMBoost container:
export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>
💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.
- Set the model directory
MODEL_PATH
with the absolute path to the directory on your host file system where your local models are stored. - Set the license file path
LICENSE_FILE
to your license file location. Please contact us throughcontact@mangoboost.io
if you don't have a llmboost license. - Set the HuggingFace token
HUGGING_FACE_HUB_TOKEN
by obtaining a Hugging Face token fromhuggingface.co/settings/tokens
.
- On NVIDIA
- On AMD
docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash
docker run -it --rm \
--network host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/dri:/dev/dri \
--device=/dev/kfd:/dev/kfd \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-rocm \
bash
Note: Replace
<llmboost-docker-image-name>
with the image name provided by MangoBoost.
An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.
Step 1: Starting an inference service
Start an LLMBoost container on the host node where you want to run the inference server. Enter the following command to start a large language model inference service to respond to inference requests over the network.
llmboost serve --model_name meta-llama/Llama-3.1-8B-Instruct
After starting, the service will, by default, listen for inference requests on port 8011.
You can choose a different port by specifying it as an argument (e.g. --port 8012
).
As before, you can specify other large language models to use in the inference service.
It will take a few minutes for the service to be ready. Wait until the service status message reports that it is ready before proceeding with the rest of the steps.
Step 2: Using the inference service from a client
When the inference service is ready, you can connect to it by connecting to port 8011 of the host node.
We will later cover how to access LLMBoost inference servers using standard clients and APIs over the network.
To test the service easily in this tutorial, we will use a simple client built into llmboost
to connect to the inference service.
Start another LLMBoost container in interactive mode on the same host node following the instructions in Step 0.
💡 Instead of starting a new container, you can use the below to attach to the LLMBoost container you already started. Use
docker ps
to find your container'sDOCKER_ID
.
docker ps
docker exec -it <DOCKER_ID> bash
From the second LLMBoost container shell prompt, type the following to connect to the service listening on port 8011:
llmboost client --port 8011
You can type questions into the client like before (e.g., "What is an LLM?").
Type exit
or ctrl-D
to exit the client when done.
Instead of the command-line client, you can also access the inference service using the LLMBoost Python client API. Start a Python interpreter session. Cut and paste the following into the session.
from llmboost.entrypoints.client import send_prompt
response = send_prompt(
host="localhost",
port=8011,
model_path="meta-llama/Llama-3.1-8B-Instruct",
role="user",
user_input="What is the most famous landmark of Seattle?"
)
print(response)
To end this tutorial, type ctrl-C
in the first container window to terminate the service.
You can also use llmboost shutdown --port XXXX
to terminate the service associated with the specified port.
Or, you can use llmboost shutdown --all
to shut down all LLMBoost inference service instances on the host node.
Advanced users can visit Using OpenAI API to see how to integrate LLMBoost-powered inference with any OpenAI-compatible client or tool.
Step 3: Multi model deployment
On a server with multiple GPUs, LLMBoost can concurrently deploy multiple inference services based on different models. To use this feature, you'll need to prepare a configuration file that specifies the deployment details for each model. An example configuration file (in YAML format) is shown below.
common:
kv_cache_dtype: auto
host: 127.0.0.1
tp: 1
models:
- model_path: meta-llama/Llama-3.1-8B-Instruct
port: 8011
dp: 1
- model_path: microsoft/Phi-3-mini-4k-instruct
port: 8012
dp: 1
You can try out the above configuration on a server with at least 2
GPUs. (Please see the Deep Dive page on Using Multiple GPUs
Effectively for an explanation of the tp
and dp
parameters.) Cut and paste the above into a file (name it
config.yaml
for example). Initiate a multi-model deployment by
typing:
llmboost deploy --config config.yaml
You can check the status of the deployments by running llmboost status
. Look for a similar output as below:
+------+------------------------+---------+
| Port | Name | Status |
+------+------------------------+---------+
| 8011 | Llama-3.1-8B-Instruct | running |
| 8012 | Phi-3-mini-4k-instruct | running |
+------+------------------------+---------+
You can use the same LLMBoost clients discussed above to connect to port 8011 or 8012, depending on which model you wish to handle your request. Use llmboost shutdown --all
to terminate the services when done.
Please continue to the next tutorial to see how to start and manage inference services across a multi-node cluster.