Using LLMBoost Python API

In the last tutorial, you used the llmboost command to use LLMBoost interactively. You can do everything you did and more from a Python program (for example, to process a batch of requests or to integrate LLM inference into an application).

This tutorial will showcase a simple Python script that sets up an inference engine, submits a prompt, and retrieves the response.

As with the llmboost interactive command, the goal of the LLMBoost Python API is to shield the user from the details of running different models and inference engines on different hardware platform choices. Please refer to the LLMBoost SDK Deep Dive for a more in-depth presentation of the Python API.

Step 0: Before you start

Enter the following to set up the environment variables and start the LLMBoost container:

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
Set the license file path LICENSE_FILE to your license file location. Please contact us through contact@mangoboost.io if you don't have a llmboost license.
Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

On NVIDIA
On AMD

docker run -it --rm \
  --network host \
  --gpus all \
  --pid=host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-cuda \
  bash

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-rocm \
  bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

An LLMBoost container should start in an interactive command shell session. The commands in the rest of this tutorial should be entered into the command shell prompt within the container.

Step 1: Using LLM inference in a Python program

You can find the Python script for this example in /workspace/release/tutorial/llmboost_sdk_tutorial1.py. You can run this script by typing the following command into the LLMBoost container command shell prompt.

python3 /workspace/release/tutorial/llmboost_sdk_tutorial1.py

To examine the script, open /workspace/release/tutorial/llmboost_sdk_tutorial1.py in an editor.

This script first imports the llmboost library. When main() starts, the first line creates an LLMBoost object called llm for the specified language model. On the next line, llm.start() starts the inference engine.

from llmboost import LLMBoost

def main():
    llm = LLMBoost(model_name="meta-llama/Llama-3.1-8B-Instruct")
    llm.start()

In the next set of lines, llm.apply_format() automatically formats the system and user prompt inputs into the format compatible with the chosen language model and inference engine. llm.issues_inputs() then submits the properly formatted prompts to the inference engine.

    # Prepare formatted inputs
    formatted_input = llm.apply_format([
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the cutest cat?"}
    ])
    llm.issues_inputs([formatted_input])

Next, the script sets up a polling loop to query the inference engine repeatedly using llm.get_output(). llm.get_output() returns once for each complete response returned for a user prompt. For each output returned, another loop prints out the tokens contained. The polling loop exits after seeing a final token carrying the key-value "finished":True.

    output_received = False
    while not output_received:
        outputs = llm.get_output()
        for out in outputs:
            print(out["val"], end="")
            if out["finished"]:
                print("\n")
                output_received = True

Finally, the main function shuts down the inference engine before exiting.

    llm.stop()


if __name__ == "__main__":
    main()

Step 2: Asynchronous I/O

In the first script, llm.get_output() returns only after the complete response for a prompt is available. If the printed output is consumed in real-time, the user perceives a poor response time on lengthy responses. This second Python script, /workspace/release/tutorial/llmboost_sdk_tutorial2.py, is very similar to the first one except it begins printing tokens as soon as they are available. Give it a try by typing:

python3 /workspace/release/tutorial/llmboost_sdk_tutorial2.py

To examine how this is done, open /workspace/release/tutorial/llmboost_sdk_tutorial2.py in an editor.

The top of the script imports the asyncio library in addition to the llmboost library. In main(), when creating the LLMBoost object, the streaming option and the enable_async_output options are enabled. The rest of the lines to format and submit a prompt are unchanged from the first example.

import asyncio
from llmboost import LLMBoost


async def main():
    llm = LLMBoost(
        model_name="meta-llama/Llama-3.1-8B-Instruct",
        streaming=True,
        enable_async_output=True
    )
    llm.start()

    # Format input
    formatted_input = llm.apply_format([
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the benefits of using quantized models."}
    ])

    # Issue the input
    llm.issues_inputs([formatted_input])

Like in the first example, the script sets up a polling loop to query the inference engine repeatedly, but this time using llm.aget_output() which returns new tokens as the tokens become available. By printing the available tokens as early as possible, the user can begin reading the response earlier and perceives the system as being more responsive.

    # Stream output
    while True:
        output = await llm.aget_output()
        if isinstance(output, list):
            output = output[0]

        print(output["val"], end="", flush=True)
        if output.get("finished", False):
            print()
            break

    llm.stop()

Note that in this example, main() needs to be called with asyncio.run() to ensure the system does not buffer and delay printing.

if __name__ == "__main__":
    asyncio.run(main())

Step 3: Additional examples to try

If you plan to use LLMBoost through Python, please refer to LLMBoost SDK Deep Dive for more discussions and examples. We invite everyone to try the following two additional Python scripts. You can try invoking these two scripts with different language models to measure and compare the models' performance and accuracy.

benchmark.py measures the raw performance of LLMBoost on fixed-length input and fixed-length output dummy data. To run it, enter:

python3 /workspace/apps/benchmark.py --model_name meta-llama/Llama-3.1-8B-Instruct --input_len 128 --output_len 128 --num_prompts 1000

accuracy.py measures the performance and inference accuracy of LLMBoost on a popular open-source benchmarking dataset. To run it, enter:

python3 /workspace/apps/accuracy.py --model_name meta-llama/Llama-3.1-8B-Instruct --num_prompts 1000

Step 0: Before you start​

Step 1: Using LLM inference in a Python program​

Step 2: Asynchronous I/O​

Step 3: Additional examples to try​

Step 0: Before you start

Step 1: Using LLM inference in a Python program

Step 2: Asynchronous I/O

Step 3: Additional examples to try