hipdf.core.column.string.StringMethods.minhash#
21 min read time
Applies to Linux
- StringMethods.minhash(seed: np.uint32, a: ColumnLike, b: ColumnLike, width: int) SeriesOrIndex#
Compute the minhash of a strings column.
This uses the MurmurHash3_x86_32 algorithm for the hash function.
Calculation uses the formula (hv * a + b) % mersenne_prime where hv is the hash of a substring of width characters, a and b are provided values and mersenne_prime is 2^61-1.
Parameters#
- seeduint32
The seed used for the hash algorithm.
- aColumnLike
Values for minhash calculation. Must be of type uint32.
- bColumnLike
Values for minhash calculation. Must be of type uint32.
- widthint
The width of the substring to hash.
Examples#
>>> import cudf >>> import numpy as np >>> s = cudf.Series(['this is my', 'favorite book']) >>> a = cudf.Series([1, 2, 3], dtype=np.uint32) >>> b = cudf.Series([4, 5, 6], dtype=np.uint32) >>> s.str.minhash(0, a=a, b=b, width=5) 0 [1305480171, 462824409, 74608232] 1 [32665388, 65330773, 97996158] dtype: list