hipdf.core.column.string.StringMethods.ngrams_tokenize

hipdf.core.column.string.StringMethods.ngrams_tokenize#

21 min read time

Applies to Linux

StringMethods.ngrams_tokenize(n: int = 2, delimiter: str = ' ', separator: str = '_') SeriesOrIndex#

Generate the n-grams using tokens from each string. This will tokenize each string and then generate ngrams for each string.

Parameters#

nint, Default 2.

The degree of the n-gram (number of consecutive tokens).

delimiterstr, Default is white-space.

The character used to locate the split points of each string.

sepstr, Default is ‘_’.

The separator to use between tokens within an n-gram.

Returns#

Series or Index of object.

Examples#

>>> import cudf
>>> ser = cudf.Series(['this is the', 'best book'])
>>> ser.str.ngrams_tokenize(n=2, sep='_')
0      this_is
1       is_the
2    best_book
dtype: object