Skip to main content

Deploying Scalable Inference on a Cluster

In this tutorial, we will demonstrate using LLMBoost to deploy scalable LLM inference services on a multi-node cluster. The user only needs to provide LLMBoost with a JSON file describing the participating nodes. Under the hood, LLMBoost orchestrates a scalable Kubernetes deployment.

note

Kubernetes will be automatically configured by LLMBoost; the user does not need to install Kubernetes manually.

Step 0: Before you start

0.1: Identify the manager and worker nodes

You need to identify a "manager" node to control the deployment. You also need to identify "worker" nodes that will serve the inference. (For this tutorial, you need at least two worker nodes. You can use a standalone manager node or use one of the worker nodes as the manager. The latter is not recommended for production deployments.)

0.2: Set up SSH key between manager and worker nodes

You need to set up SSH key so you can log into the worker nodes from the manager node without typing a password. The SSH key itself also must not be password-protected.

0.3: Set up manager node environment

On the manager node, enter the following to set up the shell environment:

export MODEL_PATH=<absolute_path_to_model>
export LLMBOOST_IMAGE=<llmboost-docker-image-name>:<tag>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

0.4 Retrieve /workspace/cluster directory from LLMBoost Docker image

LLMBoost collaterals for a multi-node deployment are packaged in the LLMBoost Docker image. You need to copy the directory /workspace/cluster out of the container to use on the manager node. Enter the following to create a temporary container from the LLMBoost image and copy out the directory /workspace/cluster. The last command removes the temporary container.

docker create $LLMBOOST_IMAGE
docker cp <container ID returned by docker create>:/workspace/cluster .
docker rm <container ID returned by docker create>

You will be working on the manager node from the copied cluster directory for the steps 1~3.

Step 1: Describing the multi-node cluster

Edit the file config.json in the copied cluster directory to specify the deployment's model settings, node topology, and deployment roles. The following are the required fields.

FieldDescription
vars.hf_tokenYour HuggingFace token for downloading models
vars.model_nameThe model to load (e.g., "meta-llama/Llama-3.2-1B-Instruct")
node_config.nodes.<name>.private_ipThe nodes' internal IP addresses for intra-node communication
node_config.nodes.<name>.public_ipThe nodes' public IP addresses or hostnames for external SSH access
node_config.common_ssh_optionGlobal SSH options (e.g., User, IdentityFile) used by all nodes
manager_nodeThe hostname of the manager node
worker_nodesThe hostnames of the worker nodes

The following example describes a two-node cluster comprising node-0 and node-1. Both nodes are workers, with node-0 also designated as the manager node. The file also specifies the language model to use.

{
"vars": {
"hf_token": "<PUT-YOUR-HF-TOKEN-HERE>",
"model_name": "meta-llama/Llama-3.2-1B-Instruct"
},
"node_config":
{
"common_ssh_option":
{
"User": "<Username on the nodes>",
"IdentityFile": "~/.ssh/id_rsa"
},
"nodes":
{
"node-0":
{
"private_ip": "<Node 0 private IP>",
"public_ip": "<Node 0 public IP>"
},
"node-1": {
"private_ip": "<Node 1 private IP>",
"public_ip": "<Node 1 public IP>"
}
},
"manager_node": "node-0",
"worker_nodes":
[
"node-0",
"node-1"
]
}
}

Step 2: Prepare and launch inference services

Navigate to the cluster directory on the manager node. Enter the following to launch the deployment:

./llmboost_cluster

To enable verbose logging, enter the following instead:

ARGS=--verbose ./llmboost_cluster

Please make a note of the login token and the login credential returned by llmboost_cluster. You will need them to access the management UI in Step 3 and the monitoring UI in Step 5, respectively

The llmboost_cluster command sets up Kubernetes, installs dependencies, and initializes the manager and worker nodes.

The inference services on worker nodes will listen for requests on port 30080. To stop the multi-node inference service, enter the following from the cluster directory on the manager node:

./llmboost_stop

Step 3a: Accessing the cluster management UI

A management UI is served from port 30081 of the manager node. To access the management UI, point a browser to the manager node's IP address followed by the port number 30081. For example, if the manager node's IP is 10.128.0.1, point the browser to address 10.128.0.1:30081. You can use localhost:30081 if the browser is running on the manager node.

When accessed for the first time, the management UI will ask for a login token.

UI login screen

After submitting the login token from Step 2, you can access an management UI similar to the screenshot below.

Management UI screen

Step 3b: Monitoring the LLMBoost services

Give the inference services a few minutes to get set up after launching them. Afterward, use curl to check if the inference services are ready to accept requests. For two worker nodes with private IP addresses 10.4.16.1 and 10.4.16.2, check their status from the manager node with the following commands:

curl 10.4.16.1:30080/status
curl 10.4.16.2:30080/status

The curl command should return output similar to what is shown below, indicating that the endpoints are running.

{"status":"running","server_name":"meta-llama/Llama-3.2-1B-Instruct"}

💡 If your manager node cannot have curl installed, you can launch an LLMBoost container on the manager node and use the curl command from inside the container.

Step 4: Accessing the services as a client

The rest of this tutorial will cover accessing the multi-node inference service from a client's perspective. Follow the instructions from earlier tutorials to start an LLMBoost container on any node that can connect to the manager and worker nodes. You will be entering client commands from inside the container in the rest of the steps.

4.1: Connecting to an endpoint with llmboost client

Inference services on worker nodes will listen for requests on port 30080. Enter the following in an LLMBoost container to start a client to connect to the inference service on worker node 10.4.16.1:

llmboost client --host 10.4.16.1 --port 30080

💡 If the client node is outside the cluster, you will need to access the worker's public IP address.

4.2: Connecting to multiple endpoints with Python API

A better way to use the multi-node inference service is to spread request traffic across nodes for concurrent processing. You can try this with a simple Python client script provided.

Sample Python Client (client.py)

import argparse
import threading
from queue import Queue
from openai import OpenAI

# Define prompt list
prompts = [
"how does multithreading work in Python?",
"Write me a Fibonacci generator in Python",
"Which pet should I get, dog or cat?",
"How do I fine-tune an LLM model?"
]

# Thread worker for sending requests
def run_thread(host, queue: Queue):
client = OpenAI(
base_url=f"http://{host}/v1",
api_key="-"
)
while not queue.empty():
prompt = queue.get()
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
stream=False,
max_tokens=1000
)
print(
f"-------------------------------------------------------------------\n"
f"Question: {prompt}\nAnswer:{chat_completion.choices[0].message.content}"
)

# Argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("--hosts", nargs="+", help="list of servers <server_ip>:<port>")
args = parser.parse_args()

threads = []
queue = Queue()

# Populate the request queue
for prompt in prompts:
queue.put(prompt)

# Launch threads for each host
for host in args.hosts:
t = threading.Thread(target=run_thread, args=(host, queue))
threads.append(t)
t.start()

# Wait for all threads to complete
for thread in threads:
thread.join()

To run the client:

python client.py --hosts 10.4.16.1:30080 10.4.16.2:30080

4.3: Connecting through the built-in load balancer

llmboost_cluster launches a load balancer that can adaptively distribute inference requests across the worker nodes, routing new requests to the node with the least live connections. Simply connect the client to port 8081 of the manager node to use the built-in load balancer.

5 Grafana monitoring UI

5.1 Accessing the monitoring UI

The monitoring UI can be accessed at port 30082 of the control node. For example, if the server is hosted at 10.128.0.1, type in 10.128.1.1:30082 in the browser. Upon successful login using the login credentials printed out at the end of deployment, the following screen will show:

Monitoring UI screen

5.2 Adding a new dashboard

To add the visualization dashboard for various metrics including GPU core/memory utilization, temperature, and LLMBoost request statistics, follow the steps:

Step 1

Step 2

Step 3

Step 4

Step 5

Currently, LLMBoost exposes the following metrics for each of the nodes

  • num_active_requests: the number of ongoing chat and/or completion requests in the node
  • num_total_requests_total: the total number of requests that have been served by the LLMBoost engine in the node so far

Other metrics such as GPU-related metrics are available as well. Because the information is platform-specific, please refer to the following documentations for the GPU-related metrics: