Pretrain Python-API
An unified API of LLMBoostPretrain
is provided to make your LLM pretraining simple, customizable and efficient. LLMBoostPretrain
supports customizable components and configuration for pretraining large language models with all main kinds of parallelisms. The parameters allow users to define the dataset pipeline, model architecture, optimizer settings, precision, logging, and distributed training behavior.
Step 0: Knowing LLMBoostPretrain
API
LLMBoostPretrain
is the main class of LLMBoost's pretraining. It receives input parameters of model architectures (num of layers, hidden layer size, etc.) and all set of training stategy and setups (parallelism, fp8 precision, nccl setup and etc). The belowing box details the explanations of LLMBoostPretrain
input parameters.
📌 Click to expand parameter descriptions
Modular Providers:
- train_valid_test_datasets_provider (
func
): Function to generate training, validation, and test datasets. - [Required]. - model_provider (
func
): Function that builds and returns a complete model instance, including its configuration, transformer layer specification (e.g., TE or legacy), and optional preprocessing/postprocessing logic. - [Required]. - model_type (
func
): Enum indicating whether the model is encoder, decoder, or both. - [Required]. - forward_step (
func
): Function that defines forward pass and loss computation. - [Required].
Model Architecture:
- num_layers (
int
): Number of transformer layers. - [Required]. - hidden_size (
int
): Hidden dimension size. - [Required]. - ffn_hidden_size (
int
): Feedforward network dimension size. - [Required]. - num_attention_heads (
int
): Number of attention heads per layer. - [Required]. - max_position_embeddings (
int
): Maximum supported sequence length. - [Required]. - num_query_groups (
int
): Number of query groups (used in GQA). - [Required]. - seq_length (
int
): Input sequence length during training. - [Required]. - untie_embeddings_and_output_weights (
bool
): Whether to decouple input embedding and output weights. Defaultly set toTrue
. - no_position_embedding (
bool
): If True, disables position embeddings. Defaultly set toTrue
. - disable_bias_linear (
bool
): If True, disables biases in linear layers. Defaultly set toTrue
. - swiglu (
bool
): Use SwiGLU activation function in FFNs. Defaultly set toTrue
. - position_embedding_type (
str
): Type of positional embedding, e.g., "rope". Defaultly set torope
. - disable_te_fused_rope (
bool
): Disables TransformerEngine's fused RoPE. Defaultly set toFalse
. - init_method_std (
float
): Std deviation for model weight initialization. Defaultly set to0.02
. - attention_dropout (
float
): Dropout rate applied in attention layers. Defaultly set to0.0
. - hidden_dropout (
float
): Dropout rate applied after linear layers. Defaultly set to0.0
. - normalization (
str
): Normalization method to use (e.g., "RMSNorm"). Defaultly set toRMSNorm
. - group_query_attention (
bool
): Enables Grouped Query Attention if True. Defaultly set toTrue
. - use_flash_attn (
bool
): Enables FlashAttention for efficient attention computation. Defaultly set toTrue
. - rotary_base (
Option[int]
): Base frequency used in RoPE.
Mixture of Experts (MoE):
- num_experts (
Option[int]
): Number of MoE experts. - moe_router_topk (
Option[int]
): Number of top experts per token. - moe_router_load_balancing_type (
Option[str]
): Load balancing method ("aux_loss", etc.). - moe_aux_loss_coeff (
Option[float]
): Coefficient for MoE auxiliary loss. - moe_grouped_gemm (
Option[bool]
): Whether to use grouped GEMM for MoE. - moe_token_dispatcher_type (
Option[str]
): Token dispatching strategy ("alltoall", etc.).
Training Hyperparameters:
- micro_batch_size (
int
): Batch size per GPU. -[Required]. - global_batch_size (
int
): Total batch size across all GPUs. -[Required]. - train_iters (
int
): Total number of training iterations. -[Required]. - lr (
float
): Peak learning rate. Defaultly set to1e-4
. - min_lr (
float
): Minimum learning rate after decay. Defaultly set to1e-5
. - lr_decay_iters (
int
): Iterations over which learning rate decays. Defaultly set to320000
. - lr_decay_style (
str
): Learning rate decay strategy ("cosine", "linear", etc.). Defaultly set tocosine
. - weight_decay (
float
): Weight decay coefficient for optimizer. Defaulty set to1.0e-1
. - lr_warmup_iters (
Optional[int]
): Number of warmup iterations for learning rate. - clip_grad (
float
): Maximum allowed gradient norm. Defaultly set to1.0
. - optimizer (
str
): Optimizer type ("adam", etc.). Defaultly set toadam
.
Parallelism Strategy:
- tp (
int
): Tensor parallel size. Defaultly set to8
. - pp (
int
): Pipeline parallel size. Defaultly set to1
. - cp (
int
): Context parallel size. Defaultly set to1
. - ep (
int
): Expert parallel size. Defaultly set to1
. - sp (
int
): Enables sequence parallelism. Defaultly set to1
. - no_async_tensor_model_parallel_allreduce (
bool
): Disables async all-reduce across tensor parallel groups. Defaultly set toTrue
. - no_masked_softmax_fusion (
bool
): Disables fused masked softmax. Defaultly set toTrue
. - no_gradient_accumulation_fusion (
bool
): Disables gradient accumulation fusion. Defaulty set toTrue
.
Data and Tokenization:
- tokenizer_model (
str
): Path to tokenizer model file. -[Required]. - tensorboard_dir (
str
): Path to store TensorBoard logs. -[Required]. - tokenizer_type (
str
): Tokenizer implementation type (e.g., "HuggingFaceTokenizer"). Defaultly set toHuggingFaceTokenizer
. - dataloader_type (
str
): Data loading strategy ("cyclic", etc.). Defaultly set tocyclic
. - eval_interval (
int
): Interval (in iterations) to run evaluation. Defaultly set to320000
. - mock_data (
bool
): If True, uses mock data for testing. Defaultly set toTrue
.
Precision Settings:
- bf16 (
bool
): Use bfloat16 precision. Defaultly set toTrue
. - te_fp8 (
bool
): Use FP8 precision with TransformerEngine. Defaultly set toFalse
.
Logging & Checkpointing:
- log_interval (
int
): Iteration interval for logging stats. Defaultly set to1
. - save_interval (
int
): Iteration interval for saving checkpoints. Defaultly set to5000
. - log_throughput (
bool
): Whether to log training throughput. Defaultly set toTrue
. - no_save_optim (
bool
): If True, skips saving optimizer state. Defaultly set toTrue
. - eval_iters (
int
): Number of evaluation iterations. Defaultly set to "-1". - no_load_optim (
Optional[bool]
): Skips loading optimizer state from checkpoint. - no_load_rng (
Optional[bool]
): Skips loading RNG state from checkpoint.
Distributed Training:
- distributed_backend (
str
): Backend used for distributed training ("nccl", "gloo", etc.). Defaultly set tonccl
. - distributed_timeout_minutes (
int
): Timeout duration (in minutes) for distributed backend. Defaultly set to120
. - use_distributed_optimizer (
bool
): Use distributed optimizer. Defaultly set toTrue
. - overlap_param_gather (
bool
): Overlap parameter gathering with backward pass. Defaultly set toTrue
. - overlap_grad_reduce (
bool
): Overlap gradient reduction with computation. Defaultly set toTrue
. - use_contiguous_parameters_in_local_ddp (
bool
): Use contiguous parameter layout in DDP. Defaultly set toFalse
.
Other Settings:
- use_mcore_models (
bool
): Enables mCore-optimized model definitions. Default set toTrue
. - gemm_tuning (
bool
): Enables autotuning of GEMM kernels. Default set toTrue
. - args_defaults (
Dict[]
): Dictionary of default args for overriding command-line inputs. Default set to{}
.
LLMBoostPretrain
also provides an simple function to start the pretraining after you instantialize the engine. You can use LLMBoostPretrain.run()
to start the pretraining.
Step 1: Launch the Docker
Please start the LLMBoost container by running the following command.
- On NVIDIA
- On AMD
docker run --rm -it \
--gpus all \
--network host \
--ipc host \
--uts host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--group-add video \
--device /dev/dri:/dev/dri \
mangollm/mb-llmboost-training:cuda-prod
docker run --rm -it \
--network host \
--ipc host \
--uts host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--group-add video \
--device /dev/dri:/dev/dri \
--device /dev/kfd:/dev/kfd \
mangollm/mb-llmboost-training:rocm-prod
Note: You might need permission to access this docker, please contact our support
contact@mangoboost.io
.
Once you get into the docker, the following commands are all run within the docker container.
Since the model tokenizer is defaultly downloading from huggingface. Please authenticate to HuggingFace via the HuggingFace CLI by running the following command:
# EXAMPLE COMMANDS
export HUGGING_FACE_HUB_TOKEN=<your-hf-token>
huggingface-cli login
Also, to use the full functionality of our pretraining software, please also provide your llmboost license within the docker. Please contact us through contact@mangoboost.io
if you don't have a llmboost license. You can put your license inside the docker by running:
echo "<your-llmboost-license>" > /workspace/llmboost_license.skm
Step 2: Using LLM Pretraining in a Python program
The belowing python script gives an example to pretrain a Llama-liked model. You can also find this example script in /workspace/apps/examples/pretrain/llm_pretrain_example.py
.
# /workspace/apps/examples/pretrain/llm_pretrain_example.py
from llmboost.llmboost_pretrain import LLMBoostPretrain
from llmboost.pretrain.dataset_provider.gpt_dataset_provider import (
gpt_dataset_provider,
)
from llmboost.pretrain.model_provider.gpt_model_provider import gpt_model_provider
from llmboost.pretrain.forward_step.gpt_forward_step import gpt_forward_step
from megatron.core.enums import ModelType
if __name__ == "__main__":
# example of llama2-7B pretraining
llmboost_trainer = LLMBoostPretrain(
gpt_dataset_provider,
gpt_model_provider,
ModelType.encoder_or_decoder,
gpt_forward_step,
args_defaults={"tokenizer_type": "GPT2BPETokenizer"},
num_layers=32,
hidden_size=1024,
ffn_hidden_size=14336,
num_attention_heads=8,
seq_length=4096,
max_position_embeddings=128000,
num_query_groups=8,
# training hyperparams
micro_batch_size=1,
global_batch_size=4,
train_iters=50,
# training stategy
tp=1,
# data arguments
tokenizer_model="meta-llama/Llama-2-7b-chat-hf",
tensorboard_dir="logs/log_0.txt",
te_fp8=0,
)
llmboost_trainer.run()
Then, you could run the training by the following command, which will start training the llama-liked model on 4 GPUs. You are expected to see the model are trained by 50 iterations and the training loss drops by epochs.
torchrun --nproc_per_node=4 /workspace/apps/examples/pretrain/llm_pretrain_example.py
To run multi-node pretraining, you could specify the master node by setting MASTER_ADDR
, MASTER_PORT
, and RANK
for each worker. i.e. torchrun --MASTER_ADDR <master node ip> --MASTER_PORT <port> --RANK <node rank> <script to run>
. Please refer to PyTorch's torchrun documentation for more details.