hipdf.core.column.string.StringMethods.ngrams_tokenize

Contents

hipdf.core.column.string.StringMethods.ngrams_tokenize#

21 min read time

Applies to Linux

StringMethods.ngrams_tokenize(n: int = 2, delimiter: str = ' ', separator: str = '_') → SeriesOrIndex#

Generate the n-grams using tokens from each string. This will tokenize each string and then generate ngrams for each string.

Parameters#

nint, Default 2.: The degree of the n-gram (number of consecutive tokens).
delimiterstr, Default is white-space.: The character used to locate the split points of each string.
sepstr, Default is ‘_’.: The separator to use between tokens within an n-gram.

Returns#

Series or Index of object.

Examples#

>>> import cudf
>>> ser = cudf.Series(['this is the', 'best book'])
>>> ser.str.ngrams_tokenize(n=2, sep='_')
0      this_is
1       is_the
2    best_book
dtype: object