Training and serving a FinalNet CTR model with Triton Inference Server

Training and serving a FinalNet CTR model with Triton Inference Server#

Authors: Lin Sun, Dipto Deb
Knowledge level: Beginner

This tutorial demonstrates how to serve a trained model for high-performance inference on AMD Instinct™ GPUs using the ROCm port of Triton Inference Server, a production-grade model server. It shows how to deploy a model with AMD’s ROCm™ software stack and the MIGraphX execution provider and then measure how fast the model runs.

To make the example complete and reproducible, this tutorial uses a real click-through rate (CTR) prediction model called FinalNet and walks you through the process end-to-end: the model is trained first, then deployed and benchmarked. Model training is not the focus here, but is included only to make the tutorial self-contained. If you already have a trained model, you will be able to skip straight to deployment; see Already have a trained model?.

This tutorial assumes no prior experience with Triton Inference Server, ONNX, or the recommendation models. If some of the terms in this tutorial are new to you, don’t worry—the Key concepts section will explain the terms in plain language.

Companion blog post: This tutorial accompanies the AMD ROCm blog Serving CTR Recommendation Models with Triton Inference Server using the ONNX Runtime Backend, which introduces ONNX Runtime and Python backend support in the ROCm build of Triton Inference Server. Refer to it for additional background and a broader set of supported workloads.

What you will learn#

Core skills: Deploying and benchmarking a model for inference on AMD GPUs.

How to obtain a ROCm-enabled Triton Inference Server container image.
How to deploy an ONNX model on Triton Inference Server using the ONNX Runtime backend accelerated by the MIGraphX execution provider.
How to measure the server’s throughput and latency on your own hardware with perf_analyzer, and how to interpret the results.

Optional skills: Produce and deploy a model.

How to train the FinalNet CTR model on the public Criteo_x4 dataset inside a PyTorch ROCm container.
How to export a trained PyTorch model to the portable ONNX format.

Workflow at a glance#

Step 1: Get the server image. Pull the prebuilt ROCm fork of Triton Inference Server image from Docker Hub.
Step 2: Train (optional). Train FinalNet on the Criteo_x4 dataset.
Step 3: Export (optional). Convert the trained checkpoint to the ONNX format.
Step 4: Deploy. Serve the ONNX model on Triton Inference Server with the MIGraphX execution provider.
Step 5 and Step 6: Measure. Run a client-side benchmark against the running server.

Already have a trained model?#

This is an end-to-end demo, with the deployment and benchmark steps being core and the training and export steps being optional. After Step 1, if you already have:

A trained FinalNet checkpoint → skip Step 2 and start at Step 3 to export it to ONNX.
An ONNX model (FinalNet or your own) → skip Step 2 and Step 3 entirely. Run the cells in Environment setup to set WORKSPACE and create the directory layout, copy your model to ${WORKSPACE}/model_repository/FinalNet_onnx/1/model.onnx, then jump straight to Step 4 to deploy (if your model is not FinalNet, also edit the input/output names and shapes in config.pbtxt to match your model).

All commands are designed to be run on the host machine. Steps that need the training environment are routed into a container by docker exec.

Key concepts#

This section introduces the main technical concepts involved in this tutorial.

Click-through rate (CTR) prediction#

CTR prediction estimates the probability that a user will click on a particular item—an ad, a product, or a recommendation. It is one of the most economically important machine-learning problems, powering recommender systems and online advertising. The input is typically a mix of numerical features (for example, counts) and categorical features (for example, a device type or a user ID), and the output is a single probability between 0 and 1.

FinalNet#

FinalNet is a neural-network architecture for CTR prediction introduced at SIGIR 2023. It currently sits at the top of the BARS Criteo_x4 leaderboard, a community benchmark that ranks CTR models on a standardized dataset. It is used here as a realistic, production-relevant architecture to train and serve CTR models.

The Criteo_x4 dataset#

Criteo_x4 is a large, publicly available advertising dataset widely used to benchmark CTR models. Each row describes one ad impression with 13 numerical and 26 categorical features, plus a label indicating whether the ad was clicked. It is publicly available and requires no access token to download.

Triton Inference Server#

Triton Inference Server is an open-source model-serving system. You hand it a trained model and it exposes HTTP/gRPC endpoints that clients call to run inference, while it handles batching, concurrency, and multiple model formats for you. This tutorial uses the ROCm fork of Triton Inference Server, which adds support for AMD GPUs.

ONNX and the ONNX Runtime backend#

ONNX (Open Neural Network Exchange) is a portable, framework-independent file format for trained models. Exporting our PyTorch model to ONNX allows Triton Inference Server to load and run it through its ONNX Runtime backend—a generic engine that executes ONNX graphs—without requiring the original training code.

ROCm and the MIGraphX execution provider#

ROCm is AMD’s open software platform for GPU computing. MIGraphX is AMD’s graph-optimization and inference engine. When ONNX Runtime runs with the MIGraphX execution provider, it compiles the ONNX graph into optimized kernels for AMD GPUs, which enables fast inference on AMD Instinct hardware.

AMD Instinct GPUs#

AMD Instinct GPUs (such as the MI300X, MI325X, and MI355X) are AMD’s data-center accelerators designed for AI training and inference.

`perf_analyzer`#

perf_analyzer is the standard benchmarking client that ships with Triton Inference Server. It sends a controlled stream of synthetic requests to a running server and reports throughput and latency, which lets you measure serving performance on your own hardware.

Prerequisites#

Hardware#

An AMD Instinct GPU: This tutorial was tested on an AMD Instinct MI300X and MI325X. Ensure you are using an AMD Instinct GPU or compatible hardware with ROCm support, and that your system meets the official requirements.

Software#

The table below lists the exact versions this tutorial was tested against. Other versions may work, but they have not been validated.

Component	Version (tested)	Reference
ROCm	7.2.0	Quick-start install guide
Docker	24.0.7	Docker install guide
`rocm/primus:v26.2`	v26.2	Training image (pulled in Step 2)
`rocm/tritoninferenceserver:tritoninferenceserver-25.12. amd1_rocm7.2_ubuntu24.04_py3.12`	25.12 (rocm7.2)	Triton Inference Server image (pulled in Step 1)
`nvcr.io/nvidia/tritonserver:25.12-py3-sdk`	25.12-py3-sdk	Benchmark client used in Step 5
Python	3.12.11	For the results-plotting cells

You do not need to clone or build any of the components manually—images of the components are pulled automatically by the cell commands below, and the Criteo_x4 dataset is publicly available (no login or API token required).

Tip: Confirm Docker is installed and that it can be run without sudo by running docker run hello-world in a terminal before you start.

Environment setup#

Before running the rest of the tutorial, make sure your machine is ready and that you can run this notebook on it.

1. Confirm that your GPU is visible#

ROCm ships rocm-smi, a command-line tool that lists the AMD GPUs the system can see. Run the cell below to confirm that your AMD Instinct GPU is visible. If the command is not found or no GPU appears, your ROCm installation or drivers need to be fixed before you continue.

%%bash
# List the AMD GPUs visible to the host. You should see your Instinct GPU(s) here.
rocm-smi --showproductname

2. Run this notebook in JupyterLab#

This notebook is available at the AI Developer Hub GitHub repository.

If you are already reading this inside JupyterLab, skip to the next step. Otherwise, here is how to launch this notebook on your host machine.

Install JupyterLab into a Python environment on the host:

python3 -m pip install jupyterlab

From the directory that contains this notebook, start the server bound to localhost:

python3 -m jupyter lab --no-browser --ip=127.0.0.1 --port=8888 --notebook-dir=.

On startup it prints a URL containing a one-time token, in the format of http://127.0.0.1:8888/lab?token=<long-token>.

If the GPU is on a remote machine you accessed via SSH, open a second terminal on your local machine and forward the port so you can reach the server in your local browser:

ssh -L 8888:localhost:8888 <user>@<gpu-host>

Paste the printed http://127.0.0.1:8888/lab?token=... URL into your browser, then open this notebook inside JupyterLab. To stop the server when you are done, press Ctrl+C twice in the terminal where it is running.

3. Create a workspace directory#

All cloned repositories, model artifacts, and the MIGraphX compilation cache are expected to be under a single host directory referred to as WORKSPACE. The cell below sets WORKSPACE as an environment variable so that every later cell can reuse it to create the directory layout that Triton Inference Server expects—these are the directories under WORKSPACE:

model_repository/FinalNet_onnx/1/: Where the exported ONNX model will be placed. Triton Inference Server requires a numbered version sub-directory (1) inside each model folder.
migraphx_cache/: A persistent cache for compiled GPU kernels, so the server does not need to recompile the model on every restart.

Feel free to change the path for WORKSPACE to any location that has a few GBs of free space.

import os
from pathlib import Path

# Change this path if you prefer a different location.
WORKSPACE = Path.home() / "triton_finalnet_workspace"

# Export as an environment variable so the %%bash cells below can use ${WORKSPACE}.
os.environ["WORKSPACE"] = str(WORKSPACE)

(WORKSPACE / "model_repository" / "FinalNet_onnx" / "1").mkdir(parents=True, exist_ok=True)
(WORKSPACE / "migraphx_cache").mkdir(parents=True, exist_ok=True)

print(f"WORKSPACE = {WORKSPACE}")

Step 1: Get the Triton Inference Server Docker image#

The ROCm fork of Triton Inference Server adds AMD GPU support through the MIGraphX execution provider and the HIP runtime. This tutorial uses a pre-built Docker image AMD publishes so no manual building is required.

The image used in this tutorial corresponds to the rocm7.2_r25.12 release. Run the cell below to pull it. The download may take several minutes on first run, depending on your network connection.

%%bash
docker pull rocm/tritoninferenceserver:tritoninferenceserver-25.12.amd1_rocm7.2_ubuntu24.04_py3.12
echo "Triton Inference Server image pull complete."

Confirm the image is now present locally by running this command, which should list the image:

%%bash
docker images | grep tritoninferenceserver

Step 2: Train FinalNet on AMD GPUs#

Skip this step if you already have a model. This training step is optional for the workflow. If you already have a trained FinalNet checkpoint, skip to Step 3 to export it. If you already have an ONNX model, skip to Step 4 to deploy it. Otherwise, continue here to create a model (see Already have a trained model? above for details).

The FinalNet model is trained on the Criteo_x4 dataset. The training uses the linsun12/FuxiCTR fork (a CTR training framework with fixes for recent PyTorch versions) and the model architecture and hyperparameters come from the BARS benchmark repository.

All training commands run inside a rocm/primus:v26.2 container (a ready-to-use PyTorch-on-ROCm image). The container is run in detached mode and the WORKSPACE folder is mounted onto it, with subsequent steps of the tutorial driven by the docker exec command.

Start the training container#

First, pull the training image and launch a long-lived container named train_CTR:

%%bash
docker pull rocm/primus:v26.2

docker run \
    --name train_CTR \
    --device=/dev/kfd \
    --device=/dev/dri \
    -d --net=host \
    --ipc=host \
    -v ${WORKSPACE}:/workspace \
    rocm/primus:v26.2 \
    sleep infinity

echo "Container train_CTR started."

Clone the training repositories#

Inside the container, clone the BARS benchmark configs, the FuxiCTR training framework, and the Datasets reference repo, and then install FuxiCTR and its dependencies:

%%bash
# Clone BARS benchmark configs, the FuxiCTR training fork, and the Datasets reference repo
docker exec train_CTR bash -c "
cd /workspace
git clone https://github.com/reczoo/BARS.git
git clone https://github.com/linsun12/FuxiCTR.git
cd FuxiCTR && pip install -r requirements.txt -q && pip install -e . -q
cd /workspace
git clone https://github.com/reczoo/Datasets.git
echo 'Repository setup complete.'
"

Download the Criteo_x4 dataset#

Download and unzip the public Criteo_x4 dataset (about 4 GB compressed) into the location FuxiCTR expects. No login or token is required:

%%bash
# Download and extract the Criteo_x4 dataset (~4 GB compressed)
docker exec train_CTR bash -c "
mkdir -p /workspace/FuxiCTR/model_zoo/data/Criteo/Criteo_x4
cd /workspace/FuxiCTR/model_zoo/data/Criteo/Criteo_x4
wget -q -O Criteo_x4.zip \
    'https://huggingface.co/datasets/reczoo/Criteo_x4/resolve/main/Criteo_x4.zip?download=true'
apt-get update -q
apt-get install -y -q unzip
unzip -q Criteo_x4.zip
echo 'Dataset extracted:'
ls -lh *.csv
"

What the dataset looks like#

Criteo_x4 is a tabular dataset: every row is one ad impression, and the columns are the input features and the label to be predicted. Each row has:

label: 1 if the ad was clicked, 0 otherwise (this is the target the model predicts).
13 numerical features, conventionally named I1–I13: Integer counts (some may be empty when a value is missing).
26 categorical features, conventionally named C1–C26: Anonymized 32-bit hashed identifiers (8-character hex strings) to avoid exposing raw user data.

A single row looks like the following, shown here as label followed by a number of numerical and categorical columns (in real life a row has all the 13 + 26 columns, which are comma or tab-separated):

label, I1, I2, I3, ..., I13,  C1,        C2,        ..., C26
1,     5,  110, 0,  ..., 2,    68fd1e64,  80e26c9b,  ..., 3a171bb3
0,     ,   1,   14, ..., 1,    05db9164,  8947f767,  ..., 49d68486

Using your own data#

FuxiCTR is configuration-driven, so you can train on your own CTR data without changing the model code by:

Exporting your data (train/validation/test) to CSV files, each with one column per feature plus a binary label column.
Describing those columns in a dataset configuration (a dataset_config.yaml entry), declaring each column’s name and type ( numeric for continuous values and categorical for IDs/strings), along with the paths to your CSV files.
In Launch training below, pointing run_expid.py to the configuration for your own data instead of the configuration for Criteo_x4.

See the FuxiCTR documentation and the sample dataset_config files under model_zoo/ for the exact configuration schema. The deployment, ONNX export, and benchmarking steps that follow remain unchanged; only the input feature shapes in config.pbtxt (Step 4) need to match your feature counts.

Launch training#

Start training FinalNet using the BARS benchmark configuration. The full logs are streamed to run.log inside the container, so you can inspect the progress at any time:

%%bash
# Train FinalNet on Criteo_x4 using the BARS benchmark configuration.
# Full logs are written to run.log inside the container.
docker exec train_CTR bash -c "
cd /workspace/FuxiCTR/model_zoo/FinalNet
python run_expid.py \
    --config /workspace/BARS/ranking/ctr/FinalNet/FinalNet_criteo_x4_001/FinalNet_criteo_x4_tuner_config_05 \
    --expid FinalNet_criteo_x4_001_041_449ccb21 \
    --gpu 0 2>&1 | tee run.log
"

When training finishes, FuxiCTR writes the trained model checkpoint into a dataset-specific subdirectory of checkpoints/ inside the container. For this configuration, the directory is model_zoo/FinalNet/checkpoints/criteo_x4_001_a5e05ce7/FinalNet_criteo_x4_001_041_449ccb21.model, where the prefix criteo_x4_001_a5e05ce7 is the dataset ID declared in the BARS configuration. Step 3 uses this checkpoint as input to the ONNX export script. You can compare your run against the public BARS leaderboard entry for the FinalNet_criteo_x4_001 configuration.

Note: Training a CTR model on the full Criteo_x4 dataset is a long-running job, and the time it takes depends on your hardware. You can let the training run in the background and continue to explore this tutorial.

Step 3: Export the trained model to ONNX#

To serve the model with Triton Inference Server, it is converted from a PyTorch checkpoint into the portable ONNX format. The conversion script export_finalnet_to_onnx.py reads the FuxiCTR checkpoint and configuration, reconstructs the model in PyTorch, and traces it to an ONNX graph (which is a portable representation of the model’s computation steps).

The script writes the result directly into the WORKSPACE as model_repository/FinalNet_onnx/1/model.onnx, so Triton Inference Server can load it without any further restructuring.

The script is downloaded into the FuxiCTR FinalNet directory (model_zoo/FinalNet/) and run from there. Run the export inside the same training container:

%%bash
# Download the export script into the FuxiCTR FinalNet directory and run it from
# there, so its `from src import FinalNet` import resolves against model_zoo/FinalNet/src.
docker exec train_CTR bash -c "
wget -q -O /workspace/FuxiCTR/model_zoo/FinalNet/export_finalnet_to_onnx.py \
    https://raw.githubusercontent.com/ROCm/triton-inference-server-server/rocm7.2_r25.12/docs/perf_benchmark/FinalNet/export_finalnet_to_onnx.py

cd /workspace/FuxiCTR/model_zoo/FinalNet
python export_finalnet_to_onnx.py \
    --checkpoint /workspace/FuxiCTR/model_zoo/FinalNet/checkpoints/criteo_x4_001_a5e05ce7/FinalNet_criteo_x4_001_041_449ccb21.model \
    --config-dir /workspace/BARS/ranking/ctr/FinalNet/FinalNet_criteo_x4_001/FinalNet_criteo_x4_tuner_config_05 \
    --expid FinalNet_criteo_x4_001_041_449ccb21 \
    --output /workspace/model_repository/FinalNet_onnx/1/model.onnx
"

Step 4: Deploy FinalNet with the MIGraphX execution provider#

Note: This is the core of the tutorial. If you have skipped any step above, make sure you have run the cells in Environment setup and your ONNX model is in place at ${WORKSPACE}/model_repository/FinalNet_onnx/1/model.onnx before continuing. If you brought your own non-FinalNet model, edit the input/output names and shapes in the config.pbtxt file below to match your model.

Your ONNX model in the repository is now ready to be served. Triton Inference Server loads the model through its ONNX Runtime backend. The execution_accelerators setting in the model’s config.pbtxt file routes execution through the MIGraphX execution provider, which compiles the graph into optimized AMD GPU kernels using MIGraphX.

Model configuration#

Triton Inference Server reads a config.pbtxt file that describes the model’s inputs, outputs, batching behavior, and accelerator. The canonical file lives at docs/perf_benchmark/FinalNet/amd/config.pbtxt in the ROCm fork; it looks like this:

name: "FinalNet_onnx"
backend: "onnxruntime"
max_batch_size: 8192

input [
  { name: "numerical_features"  data_type: TYPE_FP32  dims: [ 13 ] },
  { name: "categorical_features" data_type: TYPE_INT64 dims: [ 26 ] }
]
output [
  { name: "output" data_type: TYPE_FP32 dims: [ 1 ] }
]

parameters {
  key: "execution_accelerators"
  value: {
    string_value: "{\"gpu_execution_accelerator\": [{\"name\": \"migraphx\"}]}"
  }
}

instance_group [ { count: 1  kind: KIND_GPU } ]

dynamic_batching {
  preferred_batch_size: [ 64, 128, 256, 512, 1024, 2048, 4096, 8192 ]
  max_queue_delay_microseconds: 100
}

A few things to note:

backend: "onnxruntime" selects the ONNX Runtime backend.
The execution_accelerators parameter enables the migraphx GPU accelerator.
dynamic_batching allows Triton Inference Server to group incoming requests into larger batches for better GPU utilization.

The next cell downloads this canonical file into your model repository at ${WORKSPACE}/model_repository/FinalNet_onnx/config.pbtxt.

%%bash
# Download the AMD config.pbtxt into the model repository
wget -q -O ${WORKSPACE}/model_repository/FinalNet_onnx/config.pbtxt \
    https://raw.githubusercontent.com/ROCm/triton-inference-server-server/rocm7.2_r25.12/docs/perf_benchmark/FinalNet/amd/config.pbtxt

# This single-GPU tutorial serves a single model instance. The upstream perf-benchmark
# config requests 32 instances on one GPU, which oversubscribes the device and makes
# perf_analyzer stall at higher concurrency; pin instance_group to a single instance.
sed -i -E 's/count:[[:space:]]*[0-9]+/count: 1/' ${WORKSPACE}/model_repository/FinalNet_onnx/config.pbtxt

echo "Model repository layout:"
find ${WORKSPACE}/model_repository -type f | sort

Start the Triton Inference Server#

The next cell launches the Triton Inference Server container, mounts the model repository and the MIGraphX cache, and polls the server’s readiness endpoint until it’s ready. The first start is much slower than later ones because the MIGraphX execution provider has to compile the ONNX graph into GPU kernels for every model instance and store them in the mounted cache. On a cold start, this can take several minutes while subsequent restarts reuse the cache and should take only seconds.

Note: If you re-run this cell, remove the existing container first with docker rm -f tritonserver_container.

%%bash
# Start the Triton Inference Server container.
# The MIGraphX cache is mounted so compiled kernels persist across restarts.
TRITON_IMAGE="rocm/tritoninferenceserver:tritoninferenceserver-25.12.amd1_rocm7.2_ubuntu24.04_py3.12"

docker run \
    --name tritonserver_container \
    --device=/dev/kfd \
    --device=/dev/dri \
    --ipc=host \
    -d \
    -p 8000:8000 \
    -p 8001:8001 \
    -p 8002:8002 \
    --net=host \
    -e ORT_MIGRAPHX_MODEL_CACHE_PATH=/migraphx_cache \
    -e ORT_MIGRAPHX_CACHE_PATH=/migraphx_cache \
    -v ${WORKSPACE}/model_repository:/models \
    -v ${WORKSPACE}/migraphx_cache:/migraphx_cache \
    "${TRITON_IMAGE}" \
    tritonserver --model-repository=/models --exit-on-error=false

# Poll the readiness endpoint instead of using a fixed sleep: a cold start
# compiles MIGraphX kernels for all model instances and can take several minutes.
echo "Waiting for the server to become ready (first start compiles MIGraphX kernels; this can take several minutes)..."
for i in $(seq 1 20); do
    code=$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v2/health/ready 2>/dev/null || true)
    if [ "$code" = "200" ]; then
        echo "Server ready after ~$((i * 20))s (HTTP 200)."
        break
    fi
    echo "  attempt ${i}/20: not ready yet (status: ${code:-no response}); waiting 20s..."
    sleep 20
done
docker logs tritonserver_container 2>&1 | tail -8

Verify the server is ready#

Triton Inference Server exposes a readiness endpoint on its HTTP port (8000). An HTTP 200 response from /v2/health/ready means the model is loaded and the server is ready to accept inference requests:

%%bash
# A 200 response means Triton Inference Server is up and the model is ready.
curl -s -o /dev/null -w "HTTP %{http_code}\n" localhost:8000/v2/health/ready

Step 5: Measure serving performance with `perf_analyzer`#

With the server running, the next step is to measure how well it performs under different loads. This answers a practical question: With this model on this GPU, how many requests can the server handle, and how quickly? This is answered by measuring two quantities while gradually increasing the load:

Throughput (inferences per second): How many inferences the server completes per unit time. From the hardware perspective, it tells you how fully the GPU is utilized: as you send more concurrent requests, throughput climbs until the GPU saturates and the curve flattens. From the service perspective, it represents the capacity of the server—for example, how many predictions per second a recommendation service can deliver, and therefore how many GPUs you would need to handle your traffic.
Latency (milliseconds per request, usually reported as p50/p99 percentiles): How long an individual request waits for its answer. From the hardware perspective, latency rises as requests queue behind one another once the GPU is busy. From the service perspective, it is the delay your end user (or upstream service) actually experiences. CTR predictions typically sit on the critical path of serving a page, so a latency budget (e.g., “p99 must stay under X ms”) is often a hard requirement.

There is a trade-off between the two quantities: pushing for throughput usually increases latency. The sweep below lets you find the operating point that maximizes throughput while still meeting your latency budget. This is measured on your own hardware using perf_analyzer, which sends a controlled stream of synthetic requests at increasing concurrency levels and reports throughput and latency for each level.

perf_analyzer is run from the official Triton Inference Server SDK container image (nvcr.io/nvidia/tritonserver:25.12-py3-sdk), which ships a perf_analyzer binary compatible with the server. The flags used here are:

Flag	Value	Meaning
`-m FinalNet_onnx`	`FinalNet_onnx`	Target model name on the server
`--input-data=random`	random	Fill input tensors with random data
`-b 8192`	8192	Batch size per request
`--concurrency-range 1:32:2`	1 to 32, step 2	Sweep across client concurrency levels
`--measurement-interval 10000`	10000 ms	Measure each concurrency level over a 10 s window for stable numbers
`-f /results/perf_results.csv`	`/results/perf_results.csv`	Write the per-concurrency results to a CSV file

Note on the sweep range: At batch size 8192, the GPU already saturates well before high concurrency, so the throughput curve flattens early. The sweep is capped at concurrency 32, and a 10-second measurement window per level is used. Pushing the range much higher (e.g., 1:72:2) will only add queued requests that stall the run without revealing new information. If you want to explore higher concurrency, raise the upper bound gradually and keep the longer measurement interval.

${WORKSPACE} is mounted into the container as /results for the CSV file to be saved on the host for plotting in the next step.

%%bash
# Note: no -it flag — a notebook cell has no interactive TTY, so -it would fail here.
docker run --rm --net=host \
    -v ${WORKSPACE}:/results \
    nvcr.io/nvidia/tritonserver:25.12-py3-sdk \
    perf_analyzer \
        -m FinalNet_onnx \
        --input-data=random \
        -b 8192 \
        --concurrency-range 1:32:2 \
        --measurement-interval 10000 \
        -f /results/perf_results.csv

How to read the output#

perf_analyzer prints a statistics block plus a summary table per concurrency level. For each level you will see:

Inferences/Second: Throughput, i.e., how many inferences the server completes per second. Multiply by the batch size to convert to sample-level throughput.
Latency percentiles (p50, p90, p99): How long requests took. The p99 value is the tail latency, i.e. 99% of requests complete within it.

In general, throughput rises with concurrency until the GPU saturates, while latency grows as requests begin to queue. The right operating point depends on your latency target. All results are written to perf_results.csv for analysis in the next step.

Step 6: Plot your own results#

perf_analyzer wrote a perf_results.csv file to your WORKSPACE in the previous step. The cells below load your CSV file and plot throughput and tail latency against client concurrency, so you can visualize how the server behaves on your hardware.

This notebook ships with no pre-recorded numbers, the plots are generated entirely from the CSV file you just produced.

try:
    import matplotlib
except ImportError:
    import subprocess, sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "matplotlib", "-q"])
print("matplotlib ready.")

The next cell loads the CSV file and plots the curves with matplotlib. If the CSV file is not found, the cell prints a hint instead of failing—run Step 5 to generate it.

import csv
import os
from pathlib import Path
import matplotlib.pyplot as plt

AMD_RED, AMD_ORANGE = "#ED1C24", "#F58220"
csv_path = Path(os.environ["WORKSPACE"]) / "perf_results.csv"

if not csv_path.exists():
    print(f"No results file at {csv_path}.")
    print("Run the perf_analyzer cell in Step 5 to generate it, then re-run this cell.")
else:
    concurrency, throughput, p99_ms = [], [], []
    with open(csv_path, newline="") as f:
        for row in csv.DictReader(f):
            concurrency.append(int(float(row["Concurrency"])))
            throughput.append(float(row["Inferences/Second"]))
            p99_ms.append(float(row["p99 latency"]) / 1000.0)  # microseconds -> ms

    order = sorted(range(len(concurrency)), key=lambda i: concurrency[i])
    concurrency = [concurrency[i] for i in order]
    throughput = [throughput[i] for i in order]
    p99_ms = [p99_ms[i] for i in order]
    print(f"Loaded {len(concurrency)} concurrency levels from {csv_path.name}.")

Finally, render the throughput and latency curves from the data loaded above:

if csv_path.exists():
    fig, axes = plt.subplots(1, 2, figsize=(13, 4))

    axes[0].plot(concurrency, throughput, marker="o", color=AMD_RED)
    axes[0].set(xlabel="Client concurrency", ylabel="Throughput (inferences/sec)",
                title="FinalNet throughput on your hardware")

    axes[1].plot(concurrency, p99_ms, marker="s", color=AMD_ORANGE)
    axes[1].set(xlabel="Client concurrency", ylabel="p99 latency (ms)",
                title="FinalNet tail latency on your hardware")

    for ax in axes:
        ax.grid(True, linestyle="--", alpha=0.45)

    fig.tight_layout()
    plt.show()
else:
    print("Run Step 5 and the load cell above first.")

Clean up#

When you are finished, stop and remove the containers created by this tutorial to free up GPU memory and disk space:

%%bash
# Stop and remove the tutorial containers (ignores errors if they are already gone).
docker rm -f train_CTR tritonserver_container 2>/dev/null || true
echo "Cleanup complete."

Summary#

In this tutorial, you handled a CTR prediction model all the way from training to deployment and benchmarking, using a GPU-accelerated inference service on AMD Instinct hardware. You have:

Trained FinalNet with FuxiCTR on the public Criteo_x4 dataset inside a ROCm PyTorch container.
Exported the trained checkpoint to the portable ONNX format.
Deployed the model on Triton Inference Server using the ONNX Runtime backend accelerated by the MIGraphX execution provider.
Measured throughput and latency on your own hardware with perf_analyzer, and plotted the results.

From here, you could try serving a different model, experimenting with the dynamic_batching settings in config.pbtxt, or integrating the server into an application through its HTTP/gRPC API.