hipdf.core.column.string.StringMethods.minhash

hipdf.core.column.string.StringMethods.minhash#

21 min read time

Applies to Linux

StringMethods.minhash(seed: np.uint32, a: ColumnLike, b: ColumnLike, width: int) SeriesOrIndex#

Compute the minhash of a strings column.

This uses the MurmurHash3_x86_32 algorithm for the hash function.

Calculation uses the formula (hv * a + b) % mersenne_prime where hv is the hash of a substring of width characters, a and b are provided values and mersenne_prime is 2^61-1.

Parameters#

seeduint32

The seed used for the hash algorithm.

aColumnLike

Values for minhash calculation. Must be of type uint32.

bColumnLike

Values for minhash calculation. Must be of type uint32.

widthint

The width of the substring to hash.

Examples#

>>> import cudf
>>> import numpy as np
>>> s = cudf.Series(['this is my', 'favorite book'])
>>> a = cudf.Series([1, 2, 3], dtype=np.uint32)
>>> b = cudf.Series([4, 5, 6], dtype=np.uint32)
>>> s.str.minhash(0, a=a, b=b, width=5)
0    [1305480171, 462824409, 74608232]
1       [32665388, 65330773, 97996158]
dtype: list