GEMM tuning for model inferencing with vLLM#
Note
Tuning must be done for specific tensor-parallel-size.
Collect GEMM shape details#
Collect GEMM shape details used in this model inference, and ensure tp
is specified based on targeted setup accordingly (split over N GPUs).
VLLM_TUNE_GEMM=1 VLLM_UNTUNE_FILE=untuned_gemm.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]
Conduct GEMM tuning#
The indtype
/outdtype
must be correctly specified, and aligned with dtype
used for model inferencing.
For quantized int4 models, use f16 or bf16. Refer to ROCm vLLM Github (Sep 23, 2024, included in v0.6.1.post1+rocm release) as a starting point.
The generated untuned_gemm.csv
contains dtype
info that is used by default, and is only overridden by indtype
and outdtype
when specified.
Important
Do not use thetp
option, as it is designed for other usecases.
Note
Ifgradlib
fails with OOM then setCACHE_INVALIDATE_BUFFERS
to a primary number (such as 11 or 7 or 3 or even 1).
python <vllm_path>/gradlib/gradlib/gemm_tuner.py --input_file untuned_gemm.csv --tuned_file tuned_gemm_tpN.csv [--indtype f16 --outdtype f16]
Run vLLM inference with tuned GEMM#
Enter the following command.
VLLM_TUNE_FILE=tuned_gemm_tpN.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]