Building AI pipelines for voice assistants using ROCm, LlamaIndex, and RAG#
The following notebook demonstrates how to use AMD GPUs with LlamaIndex and Retrieval-Augmented Generation (RAG). It takes an input audio recording, transcribes it to text, sends the transcribed text to the RAG model, and generates a response in text format, which is then converted to speech and saved as an audio file.
Prerequisites#
This tutorial was developed and tested using the following setup.
Operating system#
Ubuntu 22.04: Ensure your system is running Ubuntu version 22.04.
Hardware#
AMD GPUs: This tutorial was tested on an AMD Instinct™ MI300X and an AMD Radeon™ W7900. Ensure you are using an AMD GPU with ROCm support and that your system meets the official requirements.
Software#
This tutorial was tested on both AMD Radeon and AMD Instinct GPUs using the following setup:
ROCm 6.2.0
Python 3.10
PyTorch 2.3.0
Objectives#
After completing this tutorial, you should understand the following concepts:
Multi-model pipeline
LlamaIndex with ROCm on AMD GPUs
Prepare the inference environment#
To set up the inference environment, follow these steps:
Create a conda environment:
conda create -n rocm python=3.10
Activate the environment:
conda activate rocm
Install the PyTorch for ROCm software:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
Install Ollama (if not previously installed). This step requires
curl
:sudo apt install curl -y curl -fsSL https://ollama.com/install.sh | sh
Launch the Ollama server if it isn’t already running:
ollama serve &
Pull
llama3
with Ollama:ollama pull llama3
Install the example dependencies:
pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama llama-index-embeddings-huggingface openai-whisper transformers ChatTTS
Include an audio file (for example,
summarize_question.wav
). Place it in the current working directory.
Install and launch Jupyter#
Inside the Docker container, install Jupyter using the following command:
pip install jupyter
Then start the Jupyter server:
jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
Note: Ensure port 8888
is not already in use on your system before running the above command. If it is, you can specify a different port by replacing --port=8888
with another port number, for example, --port=8890
.
Import the packages#
Import the following packages:
os: Operating-system-dependent functionality.
whisper: A speech-recognition library.
torch: A PyTorch library for tensor computations and deep learning.
llama_index.core: Core functionality for the Llama Index.
llama_index.embeddings.huggingface: Support for embedding HuggingFace.
llama_index.llms.ollama: Functionality for the Ollama language model.
ChatTTS: A text-to-speech conversion library.
torchaudio: An audio-processing library.
IPython.display: For displaying audio in Jupyter notebooks.
# System Imports
import os
import numpy as np
import re
# Imports for Speech to Text
import whisper
import torch
# Imports for RAG Model
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
# Imports for Text to Speech
import ChatTTS
import torchaudio
from IPython.display import Audio
Set up the environment#
Optionally, set the environment variables to enable experimental features in PyTorch ROCm.
Verify the PyTorch version and GPU availability.
Select the computation device:
Use the GPU if available and print its properties.
Fall back to the CPU otherwise.
# Set the environment variable for experimental features (optional)
os.environ['TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL'] = '1'
os.environ['HIP_VISIBLE_DEVICES'] = "0"
print(f"Torch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Check GPU availability and properties
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device.type == "cuda":
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
print("Using CPU (no GPU detected)")
Transcribe speech to text#
The following section performs speech-to-text transcription using the Whisper model.
First, download the sample audio file for this tutorial.
!curl -L https://raw.githubusercontent.com/ROCm/gpuaidev/main/docs/notebooks/assets/summarize_question.wav -o summarize_question.wav
Load the audio file and transcribe the speech content into text.
AUDIO_FILE = "summarize_question.wav"
Audio(AUDIO_FILE, rate=24_000, autoplay=True)
Now transcribe the speech content into text.
# Speech-to-Text with Whisper
try:
model = whisper.load_model("base")
result = model.transcribe(AUDIO_FILE)
input_text = result["text"]
print(f"Transcribed text: {input_text}")
except Exception as e:
print(f"Error in speech-to-text: {e}")
exit(1)
Add the RAG model#
To use a RAG model, provide the context that you’d like the LLM to use for the queries. This example is configured to use the documents in the data
folder. If you don’t have any documents yet, you can add your own or download the one below.
DATA_DIR = "./data"
# Check if the data directory exists, and create it if it doesn't
if not os.path.exists(DATA_DIR):
os.makedirs(DATA_DIR)
print(f"Data directory '{DATA_DIR}' created. Please add a file of your choosing or use the cell below to download sample text.")
exit(1)
else:
# Check if data directory is empty
if not os.listdir(DATA_DIR):
print(f"Data directory '{DATA_DIR}' is empty. Please add a file of your choosing or use the cell below to download sample text.")
exit(1)
If the data
directory is empty, run the following cell:
## OPTIONAL - Run this cell if your data directory is empty
!mkdir -p data && curl -L https://www.gutenberg.org/cache/epub/11/pg11.txt -o data/pg11.txt
Verify the data file now exists in the data
directory.
# View the files in your data directory
print("Files in data directory:", os.listdir("data"))
documents = SimpleDirectoryReader(DATA_DIR).load_data()
For the embedding model, use “bge-base” from HuggingFaceEmbedding
. Confirm that the Ollama server is running because it supplies Llama-3 for the LLM.
Next, create a VectorStoreIndex
from the loaded documents and initialize a query engine with the index. Then issue your query using the text output from the Whisper model. Print the response so you can compare it against the audio output in the next step.
# Initialize embedding and LLM models
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
try:
Settings.llm = Ollama(model="llama3", request_timeout=360.0)
except Exception as e:
print(f"Error connecting to Ollama server: {e}")
exit(1)
# Build and query the vector index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True, response_mode="compact", similarity_top_k=3)
response = query_engine.query(input_text)
# Function to convert StreamingResponse to string
def streaming_response_to_string(streaming_response):
text = ""
for chunk in streaming_response.response_gen:
if isinstance(chunk, dict) and "text" in chunk:
text += chunk["text"]
else:
text += str(chunk)
return text
# Convert response to string
response_text = streaming_response_to_string(response)
print(f"Generated response: {response_text}")
Perform text-to-speech conversion#
The following example performs text-to-speech conversion using the ChatTTS library and saves the output audio to a file.
This example uses the following constants:
OUTPUT_AUDIO_FILE
(str
): The name of the output audio file.SAMPLE_RATE
(int
): The sample rate for the output audio file.
It provides the following functionality:
Initializes a
ChatTTS.Chat
object.Loads the chat model without compilation for faster loading. (Set
compile=True
for better performance.)Converts the response text from the previous step to speech.
Saves the generated audio to the specified output file using torchaudio.
OUTPUT_AUDIO_FILE = "voice_pipeline_response.wav"
SAMPLE_RATE = 24000
# Text cleanup function for TTS
def sanitize_input(text):
sanitized_text = text.replace('-', '') # Remove hyphens
sanitized_text = sanitized_text.replace('(', '').replace(')', '') # Remove parentheses
return sanitized_text.strip()
# Text-to-Speech processing
try:
sanitized_response = re.sub(r"[^a-zA-Z0-9.,?! ]", "", response_text) # Remove special characters
print(f"Sanitized response for TTS: {sanitized_response}")
sanitized_response = [sanitized_response]
chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = chat.sample_random_speaker(),
)
wavs = chat.infer(
sanitized_response,
params_infer_code=params_infer_code,
)
try:
torchaudio.save(OUTPUT_AUDIO_FILE, torch.from_numpy(wavs[0]).unsqueeze(0), SAMPLE_RATE)
except:
torchaudio.save(OUTPUT_AUDIO_FILE, torch.from_numpy(wavs[0]), SAMPLE_RATE)
except Exception as e:
print(f"Error in text-to-speech: {e}")
exit(1)
finally:
if 'chat' in locals():
chat.unload()
Play the following cell to hear the generated speech.
Audio(wavs[0], rate=24_000, autoplay=True)