TokenizeVocabulary

Contents

TokenizeVocabulary#

2025-07-04

21 min read time

Applies to Linux

Constructor#

TokenizeVocabulary(vocabulary)

A vocabulary object used to tokenize input text.

TokenizeVocabulary.tokenize(text[, ...])

Parameters text cudf string series The strings to be tokenized. delimiter str Delimiter to identify tokens. Default is whitespace. default_id int Value to use for tokens not found in the vocabulary. Default is -1.