Basics#

This document covers the basic information about rocALUTION APIs and their usage.

Operators and vectors#

The main objects in rocALUTION are linear operators and vectors. All objects can be moved to an accelerator at run-time. The linear operators are defined as local or global metrices (i.e. on a single-node or distributed/multi-node) and local stencils (i.e. matrix-free linear operations). The only template parameter of the operators and vectors is the data type (ValueType). The operator data type could be float, double, complex float, or complex double, while the vector data type can be int, float, double, complex float or complex double (int is used mainly for the permutation vectors). In the current version, cross ValueType object operations are not supported. The following figure gives an overview of supported operators and vectors. For more details, refer to the Design and philosophy.

operator and vector classes

Fig. 1 Operator and vector classes.#

Each object contains a local copy of the hardware descriptor created by the rocalution::init_rocalution() function. This allows you to modify it according to your needs and to obtain two or more objects with different hardware specifications (e.g. different amounts of OpenMP threads, HIP block sizes, etc.).

Local operators and vectors#

The local operators and vectors correspond to the local metrices and stencils, and local vectors. The term “local” implies the fact that they stay on a single system. A system can contain several CPUs via UMA or NUMA memory system, as well as an accelerator.

template<typename ValueType>
class LocalMatrix : public rocalution::Operator<ValueType>

LocalMatrix class.

A LocalMatrix is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

A number of matrix formats are supported. These are CSR, BCSR, MCSR, COO, DIA, ELL, HYB, and DENSE.

Note

For CSR type matrices, the column indices must be sorted in increasing order. For COO matrices, the row indices must be sorted in increasing order. The function Check can be used to check whether a matrix contains valid data. For CSR and COO matrices, the function Sort can be used to sort the row or column indices respectively.

Template Parameters:

ValueType – - can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class LocalStencil : public rocalution::Operator<ValueType>

LocalStencil class.

A LocalStencil is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

Template Parameters:

ValueType – - can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class LocalVector : public rocalution::Vector<ValueType>

LocalVector class.

A LocalVector is called local, because it will always stay on a single system. The system can contain several CPUs via UMA or NUMA memory system or it can contain an accelerator.

Template Parameters:

ValueType – - can be int, float, double, std::complex<float> and std::complex<double>

Global operators and vectors#

Global operators and vectors correspond to the global matrix and global vectors. The term “global” implies the fact that they stay on a single or multiple nodes in a network. For this type of computation, the communication is based on MPI.

template<typename ValueType>
class GlobalMatrix : public rocalution::Operator<ValueType>

GlobalMatrix class.

A GlobalMatrix is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.

A number of matrix formats are supported. These are CSR, BCSR, MCSR, COO, DIA, ELL, HYB, and DENSE.

Note

For CSR type matrices, the column indices must be sorted in increasing order. For COO matrices, the row indices must be sorted in increasing order. The function Check can be used to check whether a matrix contains valid data. For CSR and COO matrices, the function Sort can be used to sort the row or column indices respectively.

Template Parameters:

ValueType – - can be int, float, double, std::complex<float> and std::complex<double>

template<typename ValueType>
class GlobalVector : public rocalution::Vector<ValueType>

GlobalVector class.

A GlobalVector is called global, because it can stay on a single or on multiple nodes in a network. For this type of communication, MPI is used.

Template Parameters:

ValueType – - can be int, float, double, std::complex<float> and std::complex<double>

Backend descriptor and user control#

Naturally, not all routines and algorithms can be performed efficiently on many-core systems (i.e. on accelerators). To provide full functionality, the library has internal mechanisms to check if a particular routine is implemented on the accelerator. If not, the object is moved to the host and the routine is computed there. This ensures that the application runs (maybe not in the most efficient way) with any accelerator regardless of the availability of the required functionality for it.

Initialization of rocALUTION#

The body of a rocALUTION code should simply contain the header file and the namespace of the library. The program must contain an initialization call to init_rocalution that checks and allocates the hardware and a finalizing call to stop_rocalution that releases the allocated hardware.

int rocalution::init_rocalution(int rank = -1, int dev_per_node = 1)

Initialize rocALUTION platform.

init_rocalution defines a backend descriptor with information about the hardware and its specifications. All objects created after that contain a copy of this descriptor. If the specifications of the global descriptor are changed (e.g. set different number of threads) and new objects are created, only the new objects will use the new configurations.

For control, the library provides the following functions

Example
#include <rocalution/rocalution.hpp>

using namespace rocalution;

int main(int argc, char* argv[])
{
    init_rocalution();

    // ...

    stop_rocalution();

    return 0;
}

Parameters:
  • rank[in] specifies MPI rank when multi-node environment

  • dev_per_node[in] number of accelerator devices per node, when in multi-GPU environment

int rocalution::stop_rocalution(void)

Shutdown rocALUTION platform.

stop_rocalution shuts down the rocALUTION platform.

Thread-core mapping#

The number of threads used by rocALUTION can be modified by the function set_omp_threads_rocalution or by the global OpenMP environment variable (for Unix-like OS this is OMP_NUM_THREADS). During the initialization phase, the library provides affinity thread-core mapping:

  • If the number of cores (including SMT cores) is greater than or equal to twice the number of threads, then all the threads can occupy every second core ID (e.g. 0,2,4,…). This is to avoid having two threads working on the same physical core, when SMT is enabled.

  • If the number of threads is less than or equal to the number of cores (including SMT), and the previous clause is false, then the threads can occupy every core ID (e.g. 0,1,2,3,…).

  • If none of the above criteria is matched, then the default thread-core mapping is used (typically set by the operating system).

Note

The thread-core mapping is available for Unix-like operating systems only.

Note

The user can disable the thread affinity with set_omp_affinity_rocalution, before initializing the library.

OpenMP threshold size#

When working on a small problem, OpenMP host backend might be slightly slower than using no OpenMP. This is mainly attributed to the small amount of work, which every thread should perform and the large overhead of forking/joining threads. This can be avoided by the OpenMP threshold size parameter in rocALUTION. The default threshold is set to 10.000, which means that all metrices under (and equal to) this size use only one thread (irrespective of the number of OpenMP threads set in the system). To modify the threshold, use set_omp_threshold_rocalution.

Accelerator selection#

To select the accelerator device id to be used for the computation, use set_device_rocalution.

Disable the accelerator#

To disable the accelerator without having to re-compile the library, use disable_accelerator_rocalution.

Backend information#

To print the detailed information about the current backend / accelerator in use as well as the available accelerators, use info_rocalution.

MPI and multi-accelerators#

When initializing the library with MPI, you need to pass the rank of the MPI process as well as the number of accelerators available on each node. Basically, this way you can specify the mapping of MPI process and accelerators - the allocated accelerator is rank % num_dev_per_node. Thus, you can run two MPI processes on systems with two accelerators by specifying the number of devices to 2, as illustrated in the example code below.

#include <rocalution.hpp>
#include <mpi.h>

using namespace rocalution;

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    MPI_Comm comm = MPI_COMM_WORLD;

    int num_processes;
    int rank;

    MPI_Comm_size(comm, &num_processes);
    MPI_Comm_rank(comm, &rank);

    int nacc_per_node = 2;

    init_rocalution(rank, nacc_per_node);

    // ... do some work

    stop_rocalution();

    return 0;
}

Automatic object tracking#

rocALUTION supports automatic object tracking. After the initialization of the library, all objects created by the user application can be tracked. Once stop_rocalution is called, all memory from tracked objects gets deallocated. This avoids memory leaks when the objects are allocated but not freed. The user can enable or disable the tracking by editing src/utils/def.hpp. By default, automatic object tracking is disabled.

Verbose output#

rocALUTION provides different levels of output messages. The VERBOSE_LEVEL can be modified in src/utils/def.hpp before the compilation of the library. By setting a higher level, you can obtain more detailed information about the internal calls and data transfers to and from the accelerators. By default, the VERBOSE_LEVEL is set to 2.

Verbose output and MPI#

To prevent all MPI processes from printing information to stdout, the default configuration allows only RANK 0 to output information. You can change the RANK or allow all processes to print by setting LOG_MPI_RANK to 1 in src/utils/def.hpp. If file logging is enabled, all ranks write into the corresponding log files.

Debug output#

Debug output prints almost every detail in the program, including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for metrices, vectors, solvers, and preconditioners. The flag DEBUG_MODE can be set in src/utils/def.hpp. When enabled, additional assert()s are checked during the computation. This might significantly reduce the performance of some operations.

File logging#

To enable rocALUTION trace file logging, set the environment variable ROCALUTION_LAYER to 1. rocALUTION then logs each rocALUTION function call including object constructor/destructor, address of the object, memory allocation, data transfers, all function calls for matrices, vectors, solvers, and preconditioners. The log file is placed in the working directory. The log file naming convention is rocalution-rank-<rank>-<time_since_epoch_in_msec>.log. By default, the environment variable ROCALUTION_LAYER is unset and logging is disabled.

Note

Performance might degrade when logging is enabled.

Versions#

For checking the rocALUTION version in an application, use pre-defined macros:

#define __ROCALUTION_VER_MAJOR  // version major
#define __ROCALUTION_VER_MINOR  // version minor
#define __ROCALUTION_VER_PATCH  // version patch
#define __ROCALUTION_VER_TWEAK  // commit id (sha-1)

#define __ROCALUTION_VER_PRE    // version pre-release (alpha or beta)

#define __ROCALUTION_VER        // version

The final __ROCALUTION_VER holds the version number as 10000 * major + 100 * minor + patch, as defined in src/base/version.hpp.in.