GEMM tuning for model inferencing with vLLM#
Note
Tuning must be done for specific tensor-parallel-size.
Collect GEMM shape details#
Collect GEMM shape details used in this model inference, and ensure tp is specified based on targeted setup accordingly (split over N GPUs).
VLLM_TUNE_GEMM=1 VLLM_UNTUNE_FILE=untuned_gemm.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]
Conduct GEMM tuning#
The indtype/outdtype must be correctly specified, and aligned with dtype used for model inferencing.
For quantized int4 models, use f16 or bf16. Refer to ROCm vLLM Github (Sep 23, 2024, included in v0.6.1.post1+rocm release) as a starting point.
The generated untuned_gemm.csv contains dtype info that is used by default, and is only overridden by indtype and outdtype when specified.
Important
Do not use thetpoption, as it is designed for other usecases.
Note
Ifgradlibfails with OOM then setCACHE_INVALIDATE_BUFFERSto a primary number (such as 11 or 7 or 3 or even 1).
python <vllm_path>/gradlib/gradlib/gemm_tuner.py --input_file untuned_gemm.csv --tuned_file tuned_gemm_tpN.csv [--indtype f16 --outdtype f16]
Run vLLM inference with tuned GEMM#
Enter the following command.
VLLM_TUNE_FILE=tuned_gemm_tpN.csv python <vllm_path>/benchmarks/benchmark_throughput.py --model <model_path> --trust-remote-code --dataset <dataset_path>/ShareGPT_V3_unfiltered_cleaned_split.json --num_prompts 1000 [--distributed_executor_backend mp --tensor-parallel-size N]