Clients#
There are 4 client executables that can be used with hipBLASLt.
hipblaslt-test
hipblaslt-bench
example_hipblaslt_preference
example_hipblaslt_groupedgemm
These clients can be built by following the instructions in the Build and Install hipBLASLt github page . After building the hipBLASLt clients, they can be found in the directory hipBLASLt/build/release/clients/staging
.
The next section will cover a brief explanation and the usage of each hipBLASLt clients.
hipblaslt-test#
hipblaslt-test is the main regression gtest for hipBLASLt. All test items should pass.
Run full test items:
./hipblaslt-test
Run partial test items with filter:
./hipblaslt-test --gtest_filter=<test pattern>
Demo “quick” test:
./hipblaslt-test --gtest_filter=*quick*
hipblaslt-bench#
hipblaslt-bench is used to measure performance and to verify the correctness of hipBLASLt functions.
It has a command line interface.
For example, run fp32 GEMM with validation:
./hipblaslt-bench --precision f32_r -v 1
transA,transB,M,N,K,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us
N,N,128,128,128,1,128,16384,0,128,16384,128,16384,128,16384,f32_r,f32_r,none,0, 415.278, 10.
For more information:
./hipblaslt-bench --help
--sizem |-m <value> Specific matrix size: the number of rows or columns in matrix. (Default value is: 128)
--sizen |-n <value> Specific matrix the number of rows or columns in matrix (Default value is: 128)
--sizek |-k <value> Specific matrix size: the number of columns in A and rows in B. (Default value is: 128)
--lda <value> Leading dimension of matrix A.
--ldb <value> Leading dimension of matrix B.
--ldc <value> Leading dimension of matrix C.
--ldd <value> Leading dimension of matrix D.
--any_stride Do not modify input strides based on leading dimensions
--stride_a <value> Specific stride of strided_batched matrix A, second dimension * leading dimension.
--stride_b <value> Specific stride of strided_batched matrix B, second dimension * leading dimension.
--stride_c <value> Specific stride of strided_batched matrix C, second dimension * leading dimension.
--stride_d <value> Specific stride of strided_batched matrix D, second dimension * leading dimension.
--alpha <value> specifies the scalar alpha (Default value is: 1)
--beta <value> specifies the scalar beta (Default value is: 0)
--function |-f <value> BLASLt function to test. Options: matmul (Default value is: matmul)
--precision |-r <value> Precision of matrix A,B,C,D Options: f32_r,f16_r,bf16_r (Default value is: f16_r)
--compute_type <value> Precision of computation. Options: s,f32_r (Default value is: f32_r)
--scale_type <value> Precision of scalar. Options: f16_r,bf16_r
--initialization <value> Intialize matrix data.Options: rand_int, trig_float, hpl(floating) (Default value is: hpl)
--transA <value> N = no transpose, T = transpose, C = conjugate transpose (Default value is: N)
--transB <value> N = no transpose, T = transpose, C = conjugate transpose (Default value is: N)
--batch_count <value> Number of matrices. Only applicable to batched and strided_batched routines (Default value is: 1)
--HMM Parameter requesting the use of HipManagedMemory
--verify |-v <value> Validate GPU results with CPU? 0 = No, 1 = Yes (default: No) (Default value is: )
--iters |-i <value> Iterations to run inside timing loop (Default value is: 10)
--cold_iters |-j <value> Cold Iterations to run before entering the timing loop (Default value is: 2)
--algo <value> Reserved. (Default value is: 0)
--solution_index <value> Reserved. (Default value is: 0)
--activation_type <value> Options: None, gelu, relu (Default value is: none)
--activation_arg1 <value> Reserved. (Default value is: 0)
--activation_arg2 <value> Reserved. (Default value is: inf)
--bias_type <value> Precision of bias vector.Options: f16_r,bf16_r,f32_r,default(same with D type)
--bias_vector Apply bias vector
--scaleD_vector Apply scaleD vector
--device <value> Set default device to be used for subsequent program runs (Default value is: 0)
--c_noalias_d C and D are stored in separate memory
--workspace <value> Set fixed workspace memory size instead of using hipblaslt managed memory (Default value is: 0)
--log_function_name Function name precedes other itmes.
--function_filter <value> Simple strstr filter on function name only without wildcards
--help |-h produces this help message
--version <value> Prints the version number
example_hipblaslt_preference#
example_hipblaslt_preference is a basic sample hipBLASLt app. Beginner can get start from its sample source code.
For more information:
./example_hipblaslt_preference --help
Usage: ./example_hipblaslt_preference <options>
options:
-h, --help Show this help message
-v, --verbose Verbose output
-V, --validate Verify results
-m m GEMM_STRIDED argument m
-n n GEMM_STRIDED argument n
-k k GEMM_STRIDED argument k
--lda lda GEMM_STRIDED argument lda
--ldb ldb GEMM_STRIDED argument ldb
--ldc ldc GEMM_STRIDED argument ldc
--ldd ldd GEMM_STRIDED argument ldd
--trans_a trans_a GEMM_STRIDED argument trans_a
--trans_b trans_b GEMM_STRIDED argument trans_b
--datatype datatype GEMM_STRIDED argument in out datatype:fp32
--stride_a stride_a GEMM_STRIDED argument stride_a
--stride_b stride_b GEMM_STRIDED argument stride_b
--stride_c stride_c GEMM_STRIDED argument stride_c
--stride_d stride_d GEMM_STRIDED argument stride_d
--alpha alpha GEMM_STRIDED argument alpha
--beta beta GEMM_STRIDED argument beta
--act act GEMM_STRIDED set activation type: relu or gelu
--bias bias GEMM_STRIDED enable bias: 0 or 1 (default is 0)
--scaleD scaleD GEMM_STRIDED enable scaleD: 0 or 1 (default is 0)
--header header Print header for output (default is enabled)
--timing timing Bechmark GPU kernel performance:0 or 1 (default is 1)
For example, to measure performance of fp32 gemm:
./example_hipblaslt_preference --datatype fp32 --trans_a N --trans_b N -m 4096 -n 4096 -k 4096 --alpha 1 --beta 1
On a mi210 machine the above command outputs a performance of 13509 Gflops below:
transAB, M, N, K, lda, ldb, ldc, stride_a, stride_b, stride_c, batch_count, alpha, beta, bias, activationType, ms, tflops
NN, 4096, 4096, 4096, 4096, 4096, 4096, 16777216, 16777216, 16777216, 1, 1, 1, 0, none, 10.173825, 13.509074
The user can copy and change the above command. For example, to change the datatype to IEEE-16 bit and the size to 2048:
./example_hipblaslt_preference --datatype fp16 --trans_a N --trans_b N -m 2048 -n 2048 -k 2048 --alpha 1 --beta 1
Note that example_hipblaslt_preference also has the flag -V
for correctness checks.
example_hipblaslt_groupedgemm#
example_hipblaslt_groupedgemm is a sample app for hipblaslt grouped gemm.
For more information:
./example_hipblaslt_groupedgemm --help
Usage: ./example_hipblaslt_groupedgemm <options>
options:
-h, --help Show this help message
-v, --verbose Verbose output
-V, --validate Verify results
--bench_count Number of benchmark runs (default is 1)
--sync_count Number of sync runs (default is 1)
--request_solutions Number of solutions to run (default is 1)
--num_streams Run gemms by multi streams (default is 1)
--grouped_gemm Run gemms by grouped gemm kernel (default is 0)
--datatype datatype GEMM_STRIDED argument in out: fp32, fp16, bf16 (default is fp32)
--trans_a trans_a GEMM_STRIDED argument trans_a: N or T (default is N)
--trans_b trans_b GEMM_STRIDED argument trans_b: N or T (default is N)
-m m GEMM_STRIDED argument m
-n n GEMM_STRIDED argument n
-k k GEMM_STRIDED argument k
--batch_count batch_count GEMM_STRIDED argument batch_count
--lda lda GEMM_STRIDED argument lda
--ldb ldb GEMM_STRIDED argument ldb
--ldc ldc GEMM_STRIDED argument ldc
--ldd ldd GEMM_STRIDED argument ldd
--stride_a stride_a GEMM_STRIDED argument stride_a
--stride_b stride_b GEMM_STRIDED argument stride_b
--stride_c stride_c GEMM_STRIDED argument stride_c
--stride_d stride_d GEMM_STRIDED argument stride_d
--alpha alpha GEMM_STRIDED argument alpha (default is 1)
--beta beta GEMM_STRIDED argument beta (default is 0)
--act act GEMM_STRIDED set activation type: relu, gelu, none (default is none)
--bias bias GEMM_STRIDED set bias: 0 or 1 (default is 0)
--scaleD scaleD GEMM_STRIDED enable scaleD: 0 or 1 (default is 0)
--cpu_time cpu_time Bechmark timing using cpu time: 0 or 1 (default is 0)
For example, to solve 3 gemms by groupgemm sample app:
./example_hipblaslt_groupedgemm --datatype fp16 --trans_a T --trans_b N -m 1024 -n 1024 -k 1024 -m 512 -n 512 -k 512 -m 2048 -n 1024 -k 512 --sync_count 10 --grouped_gemm 1 --request_solutions 10