hipdf.core.subword_tokenizer.SubwordTokenizer#
21 min read time
Applies to Linux
- class hipdf.core.subword_tokenizer.SubwordTokenizer(hash_file: str, do_lower_case: bool = True)#
Bases:
object
Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer. This function requires about 21x the number of character bytes in the input strings column as working memory.
Parameters#
- hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function- do_lowerbool, Default is True
If set to True, original text will be lowercased before encoding.
Returns#
SubwordTokenizer
Methods
__init__
(hash_file[, do_lower_case])