Megatron training configuration files#
2025-11-12
19 min read time
Megatron training enables scaling transformer models from millions to trillions of parameters by efficiently utilizing hundreds or thousands of GPUs across multiple nodes.
The Megatron tests check:
Container orchestration: Docker setup with ROCm/RDMA
Multi-node communication: NCCL/RCCL initialization
Model convergence: Loss decreases and no NaN/Inf values
Performance targets: Throughput and memory usage within expected ranges
Result verification: Expected tokens/sec and TFLOPS metrics
Change the parameters as needed in the Megatron training configuration files: mi3xx_singlenode_megatron_llama.json and mi3xx_distributed_megatron_llama.json for single node and distributed node configurations, respectively.
Further, you can configure the mi35x_singlenode_megatron_llama.json file to run Megatron on a single node MI35x.
Note
Parameters with the
<changeme>value must have that value modified to your specifications.{user-id}will be resolved to the current username in the runtime. You can also manually change this value to your username.
Single node configuration#
This is the mi3xx_singlenode_megatron_llama.json configuration file:
mi3xx_singlenode_megatron_llama.json
{
"config":
{
"container_image": "rocm/megatron-lm:v25.5_py310",
"container_name": "megatron_llama3.1_310",
"_example_nnodes": "4",
"nnodes": "<changeme>-no of nodes to run singlenode training",
"master_address": "<changeme>",
"_example_training_iterations": "30",
"training_iterations": "<changeme>",
"hf_token_file": "/home/{user-id}/.hf_token",
"shm_size": "128G",
"_comments_data_cache_dir": "This path should be accessible from all nodes like a common FS like NFS for distributed training",
"data_cache_dir": "/home/{user-id}/cache",
"mock_data": "True",
"log_dir": "/home/{user-id}/LOG_DIR",
"dataset_source":
{
},
"container_config":
{
"device_list": [ "/dev/dri", "/dev/kfd" ],
"volume_dict":
{
"/home/{user-id}": "/home/{user-id}"
}
}
},
"model_params":
{
"single_node":
{
"llama3_1_8b":
{
"mi300x":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "2",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"throughput_per_gpu": "380.0",
"tokens_per_gpu": "6500.0",
"elapsed_time_per_iteration": "12000.0"
}
},
"mi325":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "2",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"throughput_per_gpu": "380.0",
"tokens_per_gpu": "6500.0",
"elapsed_time_per_iteration": "12000.0"
}
}
},
"llama3_1_70b":
{
"mi300x":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "128",
"micro_batch_size": "1",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "8",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"throughput_per_gpu": "500.0",
"tokens_per_gpu": "1000.0"
}
},
"mi325":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "128",
"micro_batch_size": "1",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "8",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"throughput_per_gpu": "520.0",
"tokens_per_gpu": "1100.0"
}
}
}
}
}
}
Parameters#
Use the parameters in these tables to configure the training file.
config#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Docker image used to run Megatron-LM |
|
|
Name assigned to the container instance |
|
4 |
Example of number of cluster nodes participating in the job |
|
“ |
Number of nodes in the distributed job |
|
|
IP of the master/coordinator node |
|
30 |
Example of number of training iterations/steps to run in this test |
|
|
Number of training iterations/steps to run in this test |
|
|
Path to a Hugging Face token file used to download tokenized models/datasets requiring authorization |
|
256G |
Docker shared memory size mounted into container |
|
“This path should be accessible from all nodes like a common FS like NFS for distributed training” |
Comment explaining |
|
|
Dataset/cache directory |
|
True |
“True”/”False”: Use synthetic data (True) to avoid real dataset downloads in CI/smoke tests |
|
|
Path where training logs should be written on the host |
model_params/single_node/llama3_1_8b/mi300x#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Kernel devices exposed inside container |
|
N/A |
Host-to-container mounts: map host paths (home, RDMA libs, |
|
|
The directory |
|
N/A |
Model parameters |
|
N/A |
The structure (single node) |
|
N/A |
The model being used |
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
2 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"throughput_per_gpu": "380.0",
"tokens_per_gpu": "6500.0",
"elapsed_time_per_iteration": "12000.0"
}
model_params/single_node/llama3_1_8b/mi325#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
2 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"throughput_per_gpu": "380.0",
"tokens_per_gpu": "6500.0",
"elapsed_time_per_iteration": "12000.0"
}
model_params/single_node/llama3_1_70b/mi300X#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
128 |
Global batch size |
|
1 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
8 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"throughput_per_gpu": "500.0",
"tokens_per_gpu": "1000.0",
}
model_params/single_node/llama3_1_70b/mi325#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
128 |
Global batch size |
|
1 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
8 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"throughput_per_gpu": "520.0",
"tokens_per_gpu": "1100.0",
}
Single node MI35x configuration#
The mi35x_singlenode_megatron_llama.json config file is used to run Megatron on a MI35x on a single node.
mi35x_singlenode_megatron_llama.json
{
"config":
{
"container_image": "rocm/megatron-lm:v25.9_gfx950",
"container_name": "megatron_llama3.1_310",
"_example_nnodes": "4",
"nnodes": "<changeme>-no of nodes to run singlenode training",
"master_address": "localhost",
"_example_training_iterations": "30",
"training_iterations": "<changeme>",
"hf_token_file": "/home/{user-id}/.hf_token",
"shm_size": "128G",
"_comments_data_cache_dir": "This path should be accessible from all nodes like a common FS like NFS for distributed training",
"data_cache_dir": "/home/{user-id}/cache",
"mock_data": "True",
"log_dir": "/home/{user-id}/LOG_DIR",
"dataset_source":
{
},
"container_config":
{
"device_list": [ "/dev/dri", "/dev/kfd" ],
"volume_dict":
{
"/home/{user-id}": "/home/{user-id}"
}
}
},
"model_params":
{
"single_node":
{
"llama3_1_8b":
{
"mi350":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "4",
"precision": "TE_FP8",
"sequence_length": "8192",
"fsdp": "0",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "0",
"result_dict":
{
"tokens_per_gpu": "18000.0"
}
},
"mi355":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "4",
"precision": "TE_FP8",
"sequence_length": "8192",
"fsdp": "1",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "1",
"result_dict":
{
"tokens_per_gpu": "20000.0"
}
}
},
"llama3_1_70b":
{
"mi350":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "24",
"micro_batch_size": "3",
"precision": "TE_FP16",
"sequence_length": "8192",
"fsdp": "1",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "1",
"result_dict":
{
"tokens_per_gpu": "2000.0"
}
},
"mi355":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "24",
"micro_batch_size": "3",
"precision": "TE_FP16",
"sequence_length": "8192",
"fsdp": "1",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "1",
"result_dict":
{
"tokens_per_gpu": "2100.0"
}
}
}
}
}
}
Parameters#
Use the parameters in these tables to configure the training file.
config#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Docker image used to run Megatron-LM |
|
|
Name assigned to the container instance |
|
4 |
Example of number of cluster nodes participating in the job |
|
“ |
Number of nodes in the distributed job |
|
|
IP of the master/coordinator node |
|
30 |
Example of number of training iterations/steps to run in this test |
|
|
Number of training iterations/steps to run in this test |
|
|
Path to a Hugging Face token file used to download tokenized models/datasets requiring authorization |
|
128G |
Docker shared memory size mounted into container |
|
“This path should be accessible from all nodes like a common FS like NFS for distributed training” |
Comment explaining |
|
|
Dataset/cache directory |
|
True |
“True”/”False”: Use synthetic data (True) to avoid real dataset downloads in CI/smoke tests |
|
|
Path where training logs should be written on the host |
dataset_source/container_config#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Kernel devices exposed into the container |
|
N/A |
Host-to-container mounts: map host paths (home, RDMA libs, |
|
|
The user’s directory being used |
model_params/single_node/llama3_1_8b/mi350#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
2 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"tokens_per_gpu": "18000.0"
}
model_params/single_node/llama3_1_8b/mi355#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
4 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
1 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
1 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"tokens_per_gpu": "20000.0"
}
model_params/single_node/llama3_1_70b/mi350#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
24 |
Global batch size |
|
3 |
Per-device micro-batch size |
|
TE_FP16 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
1 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
1 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"tokens_per_gpu": "2000.0"
}
model_params/single_node/llama3_1_70b/mi355#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
24 |
Global batch size |
|
3 |
Per-device micro-batch size |
|
TE_FP16 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
1 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
1 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"tokens_per_gpu": "2100.0"
}
Distributed node configuration#
This is the multi-node mi3xx_distributed_megatron_llama.json configuration file:
mi3xx_distributed_megatron_llama.json
{
"config":
{
"_comments__": "Config file created for 4 nodes, change expected results based on number of nodes",
"container_image": "rocm/megatron-lm:v25.5_py310",
"container_name": "megatron_llama3.1_310",
"distributed_training": "True",
"_example_nnodes": "4",
"nnodes": "<changeme>",
"_example_master_address": "X.X.X.X",
"master_address": "<changeme>",
"_example_training_iterations": "30",
"training_iterations": "<changeme>",
"_example_nic_type": "ainic|thor2|cx7",
"nic_type": "<changeme>",
"_example_nccl_ib_hca_list": "bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7",
"nccl_ib_hca_list": "<changeme>",
"_example_nccl_socket_ifname": "ens51f1np1",
"nccl_socket_ifname": "<changeme>",
"_example_gloo_socket_ifname": "ens51f1np1",
"gloo_socket_ifname": "<changeme>",
"_example_nccl_ib_gid_index": "3",
"nccl_ib_gid_index": "<changeme>",
"nccl_debug": "ERROR",
"hf_token_file": "/home/{user-id}/.hf_token",
"shm_size": "128G",
"_comments_data_cache_dir": "This path should be accessible from all nodes like a common FS like NFS for distributed training",
"data_cache_dir": "/home/{user-id}/cache",
"mock_data": "True",
"log_dir": "/home/{user-id}/LOG_DIR",
"dataset_source":
{
},
"container_config":
{
"device_list": [ "/dev/dri", "/dev/kfd", "/dev/infiniband/rdma_cm" ],
"volume_dict":
{
"/home/{user-id}": "/home/{user-id}",
"/dev/infiniband": "/dev/infiniband",
"/usr/local/lib/libbnxt_re-rdmav34.so": "/usr/lib/x86_64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so.host",
"/lib/libibverbs.d": "/lib/libibverbs.d",
"/tmp/TRAINING_LOGS": "/workspace/Megatron-LM/output"
}
}
},
"model_params":
{
"multi_node":
{
"llama3_1_8b":
{
"mi300x":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "2",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"_example_throughput_per_gpu": "610.0",
"_example_tokens_per_gpu": "12000.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
},
"mi325":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-8B",
"model_size": "8",
"batch_size": "128",
"micro_batch_size": "2",
"precision": "TE_FP8",
"sequence_length": "8192",
"tensor_parallelism": "1",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"_example_throughput_per_gpu": "620.0",
"_example_tokens_per_gpu": "14000.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
}
},
"llama3_1_70b":
{
"mi300x":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "256",
"micro_batch_size": "4",
"precision": "TE_FP16",
"sequence_length": "8192",
"tensor_parallelism": "8",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"_example_throughput_per_gpu": "530.0",
"_example_tokens_per_gpu": "1100.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
},
"mi325":
{
"tokenizer_model": "NousResearch/Meta-Llama-3-70B",
"model_size": "70",
"batch_size": "256",
"micro_batch_size": "4",
"precision": "TE_FP16",
"sequence_length": "8192",
"tensor_parallelism": "8",
"pipeline_parallelism": "1",
"recompute": "0",
"fsdp": "0",
"result_dict":
{
"_example_throughput_per_gpu": "550.0",
"_example_tokens_per_gpu": "1200.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
}
}
}
}
}
Parameters#
Use the parameters in these tables to configure the training file.
config#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
“Config file created for 4 nodes, change expected results based on number of nodes” |
A generic comment |
|
|
Docker image used to run Megatron-LM |
|
|
Name assigned to the container instance |
|
True |
“True”/”False”: Ehether to run training across multiple nodes |
|
4 |
Example of number of cluster nodes participating in the job |
|
|
Number of cluster nodes participating in the distributed job |
|
“X.X.X.X” |
Example IP of the master/coordinator node |
|
|
IP of the master/coordinator node |
|
30 |
Example of number of training iterations/steps to run in this test |
|
|
Number of training iterations/steps to run in this test |
|
|
Example of NIC hardware type |
|
|
NIC hardware type |
|
|
Example of a comma-separated list of InfiniBand HCA device names to use for NCCL/communication (multi-rail support) |
|
|
Comma-separated list of InfiniBand HCA device names to use for NCCL/communication (multi-rail support) |
|
|
Example of a network interface name used by NCCL Network interface name used by NCCL / control channels |
|
|
Network interface name used by NCCL Network interface name used by NCCL / control channels |
|
|
Example of a network interface name used by Gloo control channels |
|
|
Network interface name used by Gloo control channels |
|
|
GID index used for IB addressing (selects which GID) |
|
3 |
Example of GID index used for IB addressing (selects which GID entry on the HCA to use) |
|
ERROR |
NCCL log level |
|
|
Path to a Hugging Face token file used to download tokenized models/datasets requiring authorization |
|
128G |
Docker shared memory size |
|
“This path should be accessible from all nodes like a common FS like NFS for distributed training” |
A comment explaining |
|
|
Dataset/cache directory (should be shared across nodes for distributed training unless using per-node copies) |
|
True |
“True”/”False”: Use synthetic data (True) to avoid real dataset downloads in CI/smoke tests |
|
|
Path where training logs should be written on the host |
dataset_source/container_config#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
|
Kernel devices exposed into the container |
|
N/A |
Host-to-container mounts: map host paths (home, RDMA libs, |
|
|
Mount user’s home directory into container at the same path |
|
|
Expose InfiniBand device nodes into container |
|
|
Mount host’s Broadcom NIC driver library into container |
|
|
Mount InfiniBand verbs library configuration directory |
|
|
Map host log directory to Megatron’s expected output location inside container |
model_params/multi_node/llama3.1-8b/mi300x#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
2 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"_example_throughput_per_gpu": "610.0",
"_example_tokens_per_gpu": "12000.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
model_params/multi_node/llama3.1-8b/mi325#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
8 |
The abbreviated model size |
|
128 |
Global batch size |
|
2 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
1 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"_example_throughput_per_gpu": "610.0",
"_example_tokens_per_gpu": "14000.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
model_params/multi_node/llama3_1_70b/mi300x#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
256 |
Global batch size |
|
4 |
Per-device micro-batch size |
|
TE_FP8 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
8 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"_example_throughput_per_gpu": "530.0",
"_example_tokens_per_gpu": "1100.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}
model_params/multi_node/llama3_1_70b/mi325#
Configuration parameters |
Default values |
Description |
|---|---|---|
|
N/A |
The GPU being used |
|
|
HF model identifier or local path used to initialize tokenizer |
|
70 |
The abbreviated model size |
|
256 |
Global batch size |
|
4 |
Per-device micro-batch size |
|
TE_FP16 |
Numeric precision mode used |
|
8192 |
Maximum sequence length / context size |
|
8 |
Degree of tensor-model parallelism |
|
1 |
Pipeline parallel stages count |
|
0 |
Enable activation recomputation/checkpointing to reduce memory at cost of extra compute |
|
0 |
Whether FSDP-style fully-sharded data-parallel is enabled |
This section also contains the result_dict parameter. It describes the expected/target metrics used by tests to verify and run correctness and performance:
result_dict
"result_dict":
{
"_example_throughput_per_gpu": "550.0",
"_example_tokens_per_gpu": "1200.0",
"throughput_per_gpu": "<changeme>",
"tokens_per_gpu": "<changeme>"
}