hipdf.core.tokenize_vocabulary.TokenizeVocabulary.tokenize

Contents

hipdf.core.tokenize_vocabulary.TokenizeVocabulary.tokenize#

21 min read time

Applies to Linux

TokenizeVocabulary.tokenize(text, delimiter: str = '', default_id: int = -1) → Series#

Parameters#

textcudf string series: The strings to be tokenized.
delimiterstr: Delimiter to identify tokens. Default is whitespace.
default_idint: Value to use for tokens not found in the vocabulary. Default is -1.

Returns#

Tokenized strings