Deploying an Inference Service

One of the most powerful uses of the LLMBoost container is to deploy a containerized inference service compatible with the Kubernetes framework. In this tutorial, we will start inference services manually using the LLMBoost container's command-line interface. In practice, the commands can be packaged into a shell script so that the services start automatically when the container is launched.

Step 0: Before you start

Enter the following to set up the environment variables and start the LLMBoost container:

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
Set the license file path LICENSE_FILE to your license file location. Please contact us through contact@mangoboost.io if you don't have a llmboost license.
Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

On NVIDIA
On AMD

docker run -it --rm \
  --network host \
  --gpus all \
  --pid=host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-cuda \
  bash

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-rocm \
  bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.

Step 1: Starting an inference service

Start an LLMBoost container on the host node where you want to run the inference server. Enter the following command to start a large language model inference service to respond to inference requests over the network.

llmboost serve --model_name meta-llama/Llama-3.1-8B-Instruct

After starting, the service will, by default, listen for inference requests on port 8011. You can choose a different port by specifying it as an argument (e.g. --port 8012). As before, you can specify other large language models to use in the inference service.

It will take a few minutes for the service to be ready. Wait until the service status message reports that it is ready before proceeding with the rest of the steps.

Step 2: Using the inference service from a client

When the inference service is ready, you can connect to it by connecting to port 8011 of the host node. We will later cover how to access LLMBoost inference servers using standard clients and APIs over the network. To test the service easily in this tutorial, we will use a simple client built into llmboost to connect to the inference service.

Start another LLMBoost container in interactive mode on the same host node following the instructions in Step 0.

💡 Instead of starting a new container, you can use the below to attach to the LLMBoost container you already started. Use docker ps to find your container's DOCKER_ID.

docker ps
docker exec -it <DOCKER_ID> bash

From the second LLMBoost container shell prompt, type the following to connect to the service listening on port 8011:

llmboost client --port 8011

You can type questions into the client like before (e.g., "What is an LLM?"). Type exit or ctrl-D to exit the client when done.

Instead of the command-line client, you can also access the inference service using the LLMBoost Python client API. Start a Python interpreter session. Cut and paste the following into the session.

from llmboost.entrypoints.client import send_prompt

response = send_prompt(
    host="localhost",
    port=8011,
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    role="user",
    user_input="What is the most famous landmark of Seattle?"
)

print(response)

To end this tutorial, type ctrl-C in the first container window to terminate the service. You can also use llmboost shutdown --port XXXX to terminate the service associated with the specified port. Or, you can use llmboost shutdown --all to shut down all LLMBoost inference service instances on the host node.

Standard API Integration

Advanced users can visit Using OpenAI API to see how to integrate LLMBoost-powered inference with any OpenAI-compatible client or tool.

Step 3: Multi model deployment

On a server with multiple GPUs, LLMBoost can concurrently deploy multiple inference services based on different models. To use this feature, you'll need to prepare a configuration file that specifies the deployment details for each model. An example configuration file (in YAML format) is shown below.

common:
  kv_cache_dtype: auto
  host: 127.0.0.1
  tp: 1

models:
  - model_path: meta-llama/Llama-3.1-8B-Instruct
    port: 8011
    dp: 1
  - model_path: microsoft/Phi-3-mini-4k-instruct
    port: 8012
    dp: 1

You can try out the above configuration on a server with at least 2 GPUs. (Please see the Deep Dive page on Using Multiple GPUs Effectively for an explanation of the tp and dp parameters.) Cut and paste the above into a file (name it config.yaml for example). Initiate a multi-model deployment by typing:

llmboost deploy --config config.yaml

You can check the status of the deployments by running llmboost status. Look for a similar output as below:

+------+------------------------+---------+
| Port |          Name          | Status  |
+------+------------------------+---------+
| 8011 | Llama-3.1-8B-Instruct  | running |
| 8012 | Phi-3-mini-4k-instruct | running |
+------+------------------------+---------+

You can use the same LLMBoost clients discussed above to connect to port 8011 or 8012, depending on which model you wish to handle your request. Use llmboost shutdown --all to terminate the services when done.

Please continue to the next tutorial to see how to start and manage inference services across a multi-node cluster.

Step 0: Before you start​

Step 1: Starting an inference service​

Step 2: Using the inference service from a client​

Step 3: Multi model deployment​

Step 0: Before you start

Step 1: Starting an inference service

Step 2: Using the inference service from a client

Step 3: Multi model deployment