Installation#

The quickest way to install is using prebuilt packages that are released with ROCm. Alternatively, there are instructions to build from source.

Available ROCm packages are:

hiptensor (library + header files for development).
hiptensor-dev (library development package).
hiptensor-samples (sample executables).
hiptensor-tests (test executables).
hiptensor-clients (samples and test executables).

Prerequisites#

A ROCm 6.0 enabled platform. More information at ROCm Github.

Installing pre-built packages#

To install hipTensor on Ubuntu or Debian, use:

sudo apt-get update
sudo apt-get install hiptensor hiptensor-dev hiptensor-samples hiptensor-tests

To install hipTensor on CentOS, use:

sudo yum update
sudo yum install hiptensor hiptensor-dev hiptensor-samples hiptensor-tests

To install hipTensor on SLES, use:

sudo dnf upgrade
sudo dnf install hiptensor hiptensor-dev hiptensor-samples hiptensor-tests

Once installed, hipTensor can be used just like any other library with a C++ API.

Building and installing hipTensor#

For most users building from source is not necessary, as hipTensor can be used after installing the pre-built packages as described above. If still desired, here are the instructions to build hipTensor from source:

System requirements#

As a general rule, 8GB of system memory is required for a full hipTensor build. This value can be lower if hipTensor is built without tests. This value may also increase in the future as more functions are added.

GPU support#

AMD CDNA class GPU featuring matrix core support: gfx908, gfx90a, gfx940, gfx941, gfx942 labeled as gfx9.

Note

Double precision FP64 datatype support requires gfx90a, gfx940, gfx941 or gfx942.

Dependencies#

hipTensor is designed to have minimal external dependencies such that it is light-weight and portable.

Minimum ROCm version support is 6.0.
Minimum cmake version support is 3.14.
Minimum ROCm-cmake version support is 0.8.0.
Minimum Composable Kernel version support is composable_kernel 1.1.0 for ROCm 6.0.2 (or ROCm package composablekernel-dev).
Minimum HIP runtime version support is 4.3.0 (or ROCm package ROCm hip-runtime-amd).
Minimum LLVM dev package version support is 7.0 (available as ROCm package rocm-llvm-dev).

Note

It is best to use available ROCm packages from the same release where applicable.

Download hipTensor#

The hipTensor source code is available on hipTensor Github. hipTensor has a minimum ROCm support version 6.0. To check the ROCm Version on your system, use:

apt show rocm-libs -a

For Centos use

yum info rocm-libs

The ROCm version has major, minor, and patch fields, possibly followed by a build specific identifier. For example, a ROCm version 4.0.0.40000-23 corresponds to major = 4, minor = 0, patch = 0, and build identifier 40000-23. There are GitHub branches at the hipTensor site with names rocm-major.minor.x where major and minor are the same as in the ROCm version. To download hipTensor on ROCm version 4.0.0.40000-23, use:

git clone -b release/rocm-rel-x.y https://github.com/ROCmSoftwarePlatform/hipTensor.git
cd hipTensor

Replace x.y in the above command with the version of ROCm installed on your machine. For example, if you have ROCm 5.0 installed, then replace release/rocm-rel-x.y with release/rocm-rel-5.0.

Build documentation#

To build documentation locally, run:

cd docs

sudo apt-get update
sudo apt-get install doxygen
sudo apt-get install texlive-latex-base texlive-latex-extra

pip3 install -r sphinx/requirements.txt

python3 -m sphinx -T -E -b latex -d _build/doctrees -D language=en . _build/latex

cd _build/latex

pdflatex hiptensor.tex

Running the above commands generates hiptensor.pdf. Alternatively, the latest docs build can be found at hipTensor docs.

Build configuration#

You can choose to build any of the following:

library only
library and samples
library and tests
library, samples and tests

You only need the hipTensor library for calling and linking to hipTensor API from your code. The clients contain the tests and sample codes.

Below are the project options available to build hipTensor library with or without clients.

Option	Description	Default Value
AMDGPU_TARGETS	Build code for specific GPU target(s)	`gfx908:xnack-`; `gfx90a:xnack-`; `gfx90a:xnack+`; `gfx940`; `gfx941`; `gfx942`
HIPTENSOR_BUILD_TESTS	Build Tests	ON
HIPTENSOR_BUILD_SAMPLES	Build Samples	ON
HIPTENSOR_BUILD_COMPRESSED_DBG	Enable compressed debug symbols	ON
HIPTENSOR_DATA_LAYOUT_COL_MAJOR	Set hiptensor default data layout to column major	ON

Here are some example project configurations:

Configuration	Command
Basic	`CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> .`
Targeting gfx908	`CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack-`
Debug build	`CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B<build_dir> . -DCMAKE_BUILD_TYPE=Debug`

Build library#

By default, the project is configured in Release mode.

To build the library alone, run:

CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B <build_dir> . -DHIPTENSOR_BUILD_TESTS=OFF -DHIPTENSOR_BUILD_SAMPLES=OFF

After configuration, build using:

cmake --build <build_dir> -- -j<nproc>

Note

We recommend using a minimum of 16 threads to build hipTensor with any tests (-j16).

Build library and samples#

To build library and samples, run:

CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B <build_dir> . -DHIPTENSOR_BUILD_TESTS=OFF -DHIPTENSOR_BUILD_SAMPLES=ON

After configuration, build using:

cmake --build <build_dir> -- -j<nproc>

The samples folder in <build_dir> contains executables in the table below.

Executable Name	Description
`simple_bilinear_contraction_bf16_bf16_bf16_bf16_compute_bf16`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using half-precision brain float inputs, output and compute types
`simple_bilinear_contraction_f16_f16_f16_f16_compute_f16`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using half-precision floating point inputs, output and compute types
`simple_bilinear_contraction_f32_f32_f32_f32_compute_bf16`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using single-precision floating point input and output, half-precision brain float compute types
`simple_bilinear_contraction_f32_f32_f32_f32_compute_f16`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using single-precision floating point input and output, half-precision floating point compute types
`simple_bilinear_contraction_f32_f32_f32_f32_compute_f32`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using single-precision floating point input, output and compute types
`simple_bilinear_contraction_cf32_cf32_cf32_cf32_compute_cf32`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using complex single-precision floating point input, output and compute types
`simple_bilinear_contraction_f64_f64_f64_f64_compute_f32`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using double-precision floating point input, output and single precision floating point compute types
`simple_bilinear_contraction_f64_f64_f64_f64_compute_f64`	A simple bilinear contraction [D = alpha * (A x B) + beta * C] using double-precision floating point input, output and compute types
`simple_scale_contraction_bf16_bf16_bf16_compute_bf16`	A simple scale contraction [D = alpha * (A x B) ] using half-precision brain float inputs, output and compute types
`simple_scale_contraction_f16_f16_f16_compute_f16`	A simple scale contraction [D = alpha * (A x B) ] using half-precision floating point inputs, output and compute types
`simple_scale_contraction_f32_f32_f32_compute_bf16`	A simple scale contraction [D = alpha * (A x B) ] using single-precision floating point input and output, half-precision brain float compute types
`simple_scale_contraction_f32_f32_f32_compute_f16`	A simple scale contraction [D = alpha * (A x B) ] using single-precision floating point input and output, half-precision floating point compute types
`simple_scale_contraction_f32_f32_f32_compute_f32`	A simple scale contraction [D = alpha * (A x B) ] using single-precision floating point input, output and compute types
`simple_scale_contraction_cf32_cf32_cf32_compute_cf32`	A simple scale contraction [D = alpha * (A x B) ] using complex single-precision floating point input, output and compute types
`simple_scale_contraction_f64_f64_f64_compute_f32`	A simple scale contraction [D = alpha * (A x B) ] using double-precision floating point input, output and single precision floating point compute types
`simple_scale_contraction_f64_f64_f64_compute_f64`	A simple scale contraction [D = alpha * (A x B) ] using double-precision floating point input, output and compute types
`simple_permutation`	A simple permutation using single-precision floating point input and output types

Build library and tests#

To build library and tests, run:

CC=/opt/rocm/bin/amdclang CXX=/opt/rocm/bin/amdclang++ cmake -B <build_dir> . -DHIPTENSOR_BUILD_TESTS=ON -DHIPTENSOR_BUILD_SAMPLES=OFF

After configuration, build using:

cmake --build <build_dir> -- -j<nproc>

The tests in <build_dir> contain executables as given in the table below.

Executable name	Description
`logger_test`	Unit test to validate hipTensor Logger APIs
`yaml_test`	Unit test to validate the YAML functionality used to bundle and run test suites
`bilinear_contraction_test_m1n1k1`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 2
`bilinear_contraction_test_m2n2k2`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 4
`bilinear_contraction_test_m3n3k3`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 6
`bilinear_contraction_test_m4n4k4`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 8
`bilinear_contraction_test_m5n5k5`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 10
`bilinear_contraction_test_m6n6k6`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with half, single and mixed precision datatypes of rank 12
`complex_bilinear_contraction_test_m1n2k1`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 2
`complex_bilinear_contraction_test_m2n2k2`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 4
`complex_bilinear_contraction_test_m3n3k3`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 6
`complex_bilinear_contraction_test_m4n4k4`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 8
`complex_bilinear_contraction_test_m5n5k5`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 10
`complex_bilinear_contraction_test_m6n6k6`	Bilinear contraction test [D = alpha * (A x B) + beta * C] with complex single and double precision datatypes of rank 12
`scale_contraction_test_m1n1k1`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 2
`scale_contraction_test_m2n2k2`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 4
`scale_contraction_test_m3n3k3`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 6
`scale_contraction_test_m4n4k4`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 8
`scale_contraction_test_m5n5k5`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 10
`scale_contraction_test_m6n6k6`	Scale contraction test [D = alpha * (A x B)] with half, single and mixed precision datatypes of rank 12
`complex_scale_contraction_test_m1n1k1`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 2
`complex_scale_contraction_test_m2n2k2`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 4
`complex_scale_contraction_test_m3n3k3`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 6
`complex_scale_contraction_test_m4n4k4`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 8
`complex_scale_contraction_test_m5n5k5`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 10
`complex_scale_contraction_test_m6n6k6`	Scale contraction test [D = alpha * (A x B)] with complex single and double precision datatypes of rank 12
`rank2_permutation_test`	Permutation test with half and single precision datatypes of rank 2
`rank3_permutation_test`	Permutation test with half and single precision datatypes of rank 3
`rank4_permutation_test`	Permutation test with half and single precision datatypes of rank 4
`rank5_permutation_test`	Permutation test with half and single precision datatypes of rank 5
`rank6_permutation_test`	Permutation test with half and single precision datatypes of rank 6

Make targets list#

When building hipTensor during the make step, we can specify make targets instead of defaulting make all. The following table highlights relationships between high level grouped targets and individual targets.

Group Target	Individual Targets
hiptensor_samples	simple_bilinear_contraction_bf16_bf16_bf16_bf16_compute_bf16
	simple_bilinear_contraction_f16_f16_f16_f16_compute_f16
	simple_bilinear_contraction_f32_f32_f32_f32_compute_bf16
	simple_bilinear_contraction_f32_f32_f32_f32_compute_f16
	simple_bilinear_contraction_f32_f32_f32_f32_compute_f32
	simple_bilinear_contraction_cf32_cf32_cf32_cf32_compute_cf32
	simple_bilinear_contraction_f64_f64_f64_f64_compute_f32
	simple_bilinear_contraction_f64_f64_f64_f64_compute_f64
	simple_scale_contraction_bf16_bf16_bf16_compute_bf16
	simple_scale_contraction_f16_f16_f16_compute_f16
	simple_scale_contraction_f32_f32_f32_compute_bf16
	simple_scale_contraction_f32_f32_f32_compute_f16
	simple_scale_contraction_f32_f32_f32_compute_f32
	simple_scale_contraction_cf32_cf32_cf32_compute_cf32
	simple_scale_contraction_f64_f64_f64_compute_f32
	simple_scale_contraction_f64_f64_f64_compute_f64
	simple_permutation
hiptensor_tests	logger_test
	yaml_test
	bilinear_contraction_test_m1n1k1
	bilinear_contraction_test_m2n2k2
	bilinear_contraction_test_m3n3k3
	bilinear_contraction_test_m4n4k4
	bilinear_contraction_test_m5n5k5
	bilinear_contraction_test_m6n6k6
	complex_bilinear_contraction_test_m1n1k1
	complex_bilinear_contraction_test_m2n2k2
	complex_bilinear_contraction_test_m3n3k3
	complex_bilinear_contraction_test_m4n4k4
	complex_bilinear_contraction_test_m5n5k5
	complex_bilinear_contraction_test_m6n6k6
	scale_contraction_test_m1n1k1
	scale_contraction_test_m2n2k2
	scale_contraction_test_m3n3k3
	scale_contraction_test_m4n4k4
	scale_contraction_test_m5n5k5
	scale_contraction_test_m6n6k6
	complex_scale_contraction_test_m1n1k1
	complex_scale_contraction_test_m2n2k2
	complex_scale_contraction_test_m3n3k3
	complex_scale_contraction_test_m4n4k4
	complex_scale_contraction_test_m5n5k5
	complex_scale_contraction_test_m6n6k6
	rank2_permutation_test
	rank3_permutation_test
	rank4_permutation_test
	rank5_permutation_test
	rank6_permutation_test

Build performance#

Depending on the resources available to the build machine and the build configuration selected, hipTensor build times can be on the order of an hour or more. Here are some things you can do to reduce build times:

Target a specific GPU (e.g., -D AMDGPU_TARGETS=gfx908:xnack-)
Use lots of threads (e.g., -j32)
If they aren’t needed, specify either HIPTENSOR_BUILD_TESTS or HIPTENSOR_BUILD_SAMPLES as OFF to disable client builds.
During the make command, build a specific target, e.g: logger_test.

Test run lengths#

Depending on the resources available to the machine running the selected tests, hipTensor test runtimes can be on the order of an hour or more. Here are some things you can do to reduce run-times:

CTest will invoke the entire test suite. You may invoke tests individually by name.
Use GoogleTest filters, targeting specific test cases:

<test_exe> --gtest_filter=*name_filter*

Manually adjust the test cases coverage. Using your favorite text editor, you can modify test YAML configs to affect the test parameter coverage.
Alternatively, use your own testing YAML config with a reduced parameter set.
For tests with large tensor ranks, avoid using larger lengths to reduce computational load.

Test verbosity and file redirection#

Tests support logging arguments that can be used to control verbosity and output redirection.

<test_exe> -y "testing_params.yaml" -o "output.csv" --omit 1

Compact	Verbose	Description
-y <input_file>.yaml		override read testing parameters from input file
-o <output_file>.csv		redirect gtest output to file
	–omit <code>	code = 1: Omit gtest SKIPPED tests
		code = 2: Omit gtest FAILED tests
		code = 4: Omit gtest PASSED tests
		code = 8: Omit all gtest output
		code = <N>: OR combination of 1, 2, 4

Installation

Contents

Installation#

Prerequisites#

Installing pre-built packages#

Building and installing hipTensor#

System requirements#

GPU support#

Dependencies#

Download hipTensor#

Build documentation#

Build configuration#

Build library#

Build library and samples#

Build library and tests#

Make targets list#

Build performance#

Test run lengths#

Test verbosity and file redirection#