Running models from Hugging Face#

Applies to Linux and Windows

2024-05-30

7 min read time

Hugging Face hosts the world’s largest AI model repository for developers to obtain transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in developing and deploying AI solutions.

This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.

Using Hugging Face Transformers#

First, install the Hugging Face Transformers library, which lets you easily import any of the transformer models into your Python application.

pip install transformers

Here is an example of running GPT2:

from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

model = GPT2Model.from_pretrained('gpt2')

text = "Replace me with any text you'd like."

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

Mainstream transformer models are regularly tested on supported hardware platforms. Models derived from those core models should also function correctly.

Here are some mainstream models to get you started:

Using Hugging Face with Optimum-AMD#

Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.

For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the Optimum-AMD page on Hugging Face for guidance on using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.

Hugging Face libraries natively support AMD Instinct accelerators. For other ROCm-capable hardware, support is currently not validated, but most features are expected to work without issues.

Installation#

Install Optimum-AMD using pip.

pip install --upgrade --upgrade-strategy eager optimum[amd]

Or, install from source.

git clone https://github.com/huggingface/optimum-amd.git
cd optimum-amd
pip install -e .

Flash Attention#

  1. Use the Hugging Face team’s example Dockerfile to use Flash Attention with ROCm.

    docker build -f Dockerfile -t transformers_pytorch_amd_gpu_flash .
    volume=$PWD
    docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $volume:/workspace --name transformer_amd
    transformers_pytorch_amd_gpu_flash:latest
    
  2. Use Flash Attention 2 with Transformers by adding the use_flash_attention_2 parameter to from_pretrained():

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
    
    with torch.device("cuda"):
      model = AutoModelForCausalLM.from_pretrained(
      "tiiuae/falcon-7b",
      torch_dtype=torch.float16,
      use_flash_attention_2=True,
      )
    

GPTQ#

To enable GPTQ, hosted wheels are available for ROCm.

  1. First, install Optimum-AMD.

  2. Install AutoGPTQ using pip. Refer to AutoGPTQ Installation for in-depth guidance.

    pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
    

    Or, to install from source for AMD accelerators supporting ROCm, specify the ROCM_VERSION environment variable.

    ROCM_VERSION=6.1 pip install -vvv --no-build-isolation -e .
    
  3. Load GPTQ-quantized models in Transformers using the backend AutoGPTQ library:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM
    
    tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-Chat-GPTQ")
    
    with torch.device("cuda"):
      model = AutoModelForCausalLM.from_pretrained(
      "TheBloke/Llama-2-7B-Chat-GPTQ",
      torch_dtype=torch.float16,
      )
    

ONNX#

Hugging Face Optimum also supports the ONNX Runtime integration. For ONNX models, usage is straightforward.

  1. Specify the provider argument in the ORTModel.from_pretrained() method:

    from optimum.onnxruntime import ORTModelForSequenceClassification
    ..
    ort_model = ORTModelForSequenceClassification.from_pretrained(
    ..
    provider="ROCMExecutionProvider"
    )
    
  2. Try running a BERT text classification ONNX model with ROCm:

    from optimum.onnxruntime import ORTModelForSequenceClassification
    from optimum.pipelines import pipeline
    from transformers import AutoTokenizer
    import onnxruntime as ort
    
    session_options = ort.SessionOptions()
    
    session_options.log_severity_level = 0
    
    ort_model = ORTModelForSequenceClassification.from_pretrained(
       "distilbert-base-uncased-finetuned-sst-2-english",
       export=True,
       provider="ROCMExecutionProvider",
       session_options=session_options
       )
    
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
    
    pipe = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0")
    
    result = pipe("Both the music and visual were astounding, not to mention the actors performance.")