GEMM tuning for model inferencing with vLLM

GEMM tuning for model inferencing with vLLM#

Note
Tuning must be done for specific tensor-parallel-size.

Collect GEMM shape details#

Collect GEMM shape details used in this model inference, and ensure tp is specified based on targeted setup accordingly (split over N GPUs).

VLLM_TUNE_GEMM=1 VLLM_UNTUNE_FILE=untuned_gemm.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]

Conduct GEMM tuning#

The indtype/outdtype must be correctly specified, and aligned with dtype used for model inferencing.

For quantized int4 models, use f16 or bf16. Refer to ROCm vLLM Github (Sep 23, 2024, included in v0.6.1.post1+rocm release) as a starting point.

The generated untuned_gemm.csv contains dtype info that is used by default, and is only overridden by indtype and outdtype when specified.

Important
Do not use the tp option, as it is designed for other usecases.

Note
If gradlib fails with OOM then set CACHE_INVALIDATE_BUFFERS to a primary number (such as 11 or 7 or 3 or even 1).

python <vllm_path>/gradlib/gradlib/gemm_tuner.py --input_file untuned_gemm.csv --tuned_file tuned_gemm_tpN.csv [--indtype f16 --outdtype f16]

Run vLLM inference with tuned GEMM#

Enter the following command.

VLLM_TUNE_FILE=tuned_gemm_tpN.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]