hipdf.core.subword_tokenizer.SubwordTokenizer

hipdf.core.subword_tokenizer.SubwordTokenizer#

21 min read time

Applies to Linux

class hipdf.core.subword_tokenizer.SubwordTokenizer(hash_file: str, do_lower_case: bool = True)#

Bases: object

Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer. This function requires about 21x the number of character bytes in the input strings column as working memory.

Parameters#

hash_filestr

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function

do_lowerbool, Default is True

If set to True, original text will be lowercased before encoding.

Returns#

SubwordTokenizer

__init__(hash_file: str, do_lower_case: bool = True)#

Methods

__init__(hash_file[, do_lower_case])

__init__(hash_file: str, do_lower_case: bool = True)#