Running models from Hugging Face#
2024-09-10
8 min read time
Hugging Face hosts the world’s largest AI model repository for developers to obtain transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in developing and deploying AI solutions.
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
Using Hugging Face Transformers#
First, install the Hugging Face Transformers library, which lets you easily import any of the transformer models into your Python application.
pip install transformers
Here is an example of running GPT2:
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core models should also function correctly.
Here are some mainstream models to get you started:
Using Hugging Face with Optimum-AMD#
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the Optimum-AMD page on Hugging Face for guidance on using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
Hugging Face libraries natively support AMD Instinct accelerators. For other ROCm-capable hardware, support is currently not validated, but most features are expected to work without issues.
Installation#
Install Optimum-AMD using pip.
pip install --upgrade --upgrade-strategy eager optimum[amd]
Or, install from source.
git clone https://github.com/huggingface/optimum-amd.git
cd optimum-amd
pip install -e .
Flash Attention#
Use the Hugging Face team’s example Dockerfile to use Flash Attention with ROCm.
docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash . volume=$PWD docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd transformers_pytorch_amd_gpu_flash:latest
Use Flash Attention 2 with Transformers by adding the
use_flash_attention_2
parameter tofrom_pretrained()
:import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b") with torch.device("cuda"): model = AutoModelForCausalLM.from_pretrained( "tiiuae/falcon-7b", torch_dtype=torch.float16, use_flash_attention_2=True, )
GPTQ#
To enable GPTQ, hosted wheels are available for ROCm.
First, install Optimum-AMD.
Install AutoGPTQ using pip. Refer to AutoGPTQ Installation for in-depth guidance.
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD accelerators supporting ROCm, specify the
ROCM_VERSION
environment variable.ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
Load GPTQ-quantized models in Transformers using the backend AutoGPTQ library:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ") with torch.device("cuda"): model = AutoModelForCausalLM.from_pretrained( "TheBloke/Llama-2-7B-Chat-GPTQ", torch_dtype=torch.float16, )
ONNX#
Hugging Face Optimum also supports the ONNX Runtime integration. For ONNX models, usage is straightforward.
Specify the provider argument in the
ORTModel.from_pretrained()
method:from optimum.onnxruntime import ORTModelForSequenceClassification .. ort_model = ORTModelForSequenceClassification.from_pretrained( .. provider="ROCMExecutionProvider" )
Try running a BERT text classification ONNX model with ROCm:
from optimum.onnxruntime import ORTModelForSequenceClassification from optimum.pipelines import pipeline from transformers import AutoTokenizer import onnxruntime as ort session_options = ort.SessionOptions() session_options.log_severity_level = 0 ort_model = ORTModelForSequenceClassification.from_pretrained( "distilbert-base-uncased-finetuned-sst-2-english", export=True, provider="ROCMExecutionProvider", session_options=session_options ) tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") result = pipe("Both the music and visual were astounding, not to mention the actors performance.")