hipdf.MultiIndex#
88 min read time
- class hipdf.MultiIndex(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, **kwargs)#
Bases:
Frame
,BaseIndex
,NotIterable
A multi-level or hierarchical index.
Provides N-Dimensional indexing into Series and DataFrame objects.
Parameters#
- levelssequence of arrays
The unique labels for each level.
- codes: sequence of arrays
Integers for each level designating which label at each location.
- sortorderoptional int
Not yet supported
- names: optional sequence of objects
Names for each of the index levels.
- copybool, default False
Copy the levels and codes.
- verify_integritybool, default True
Check that the levels/codes are consistent and valid. Not yet supported
Attributes#
names nlevels dtypes levels codes
Methods#
from_arrays from_tuples from_product from_frame set_levels set_codes to_frame to_flat_index sortlevel droplevel swaplevel reorder_levels remove_unused_levels get_level_values get_loc drop
Returns#
MultiIndex
Examples#
>>> import cudf >>> cudf.MultiIndex( ... levels=[[1, 2], ['blue', 'red']], codes=[[0, 0, 1, 1], [1, 0, 1, 0]]) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], )
- __init__(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, **kwargs)#
Methods
__init__
([levels, codes, sortorder, names, ...])abs
()Return a Series/DataFrame with absolute numeric value of each element.
all
([axis, skipna, level])Return whether all elements are True in DataFrame.
any
([axis, skipna, level])Return whether any elements is True in DataFrame.
append
(other)Append a collection of MultiIndex objects together
argsort
([by, axis, kind, order, ascending, ...])Return the integer indices that would sort the Series values.
astype
(dtype[, copy])Create an Index with values cast to dtypes.
copy
([names, dtype, levels, codes, deep, name])Returns copy of MultiIndex object.
deserialize
(header, frames)Generate an object from a serialized representation.
device_deserialize
(header, frames)Perform device-side deserialization tasks.
device_serialize
()Serialize data and metadata associated with device memory.
difference
(other[, sort])Return a new Index with elements from the index that are not in other.
dot
(other[, reflect])Get dot product of frame and other, (binary operator dot).
drop_duplicates
([keep, nulls_are_equal])Drop duplicate rows in index.
droplevel
([level])Removes the specified levels from the MultiIndex.
dropna
([how])Drop null rows from Index.
duplicated
([keep])Indicate duplicate index values.
equals
(other)Test whether two objects contain the same elements.
factorize
([sort, na_sentinel, use_na_sentinel])fillna
(value)Fill null values with the specified value.
find_label_range
(loc)Translate a label-based slice to an index-based slice
from_arrow
(data)Convert from PyArrow Table to Frame
from_frame
(df[, names])Make a MultiIndex from a DataFrame.
from_pandas
(multiindex[, nan_as_null])Convert from a Pandas MultiIndex
from_product
(arrays[, names])Make a MultiIndex from the cartesian product of multiple iterables.
from_tuples
(tuples[, names])Convert list of tuples to MultiIndex.
get_level_values
(level)Return the values at the requested level
get_loc
(key[, method, tolerance])Get location for a label or a tuple of labels.
get_slice_bound
(label, side[, kind])Calculate slice bound that corresponds to given label.
head
([n])Return the first n rows.
host_deserialize
(header, frames)Perform device-side deserialization tasks.
host_serialize
()Serialize data and metadata associated with host memory.
intersection
(other[, sort])Form the intersection of two Index objects.
Check if the Index only consists of booleans.
Check if the Index holds categorical data.
Check if the Index is a floating type.
Check if the Index only consists of integers.
Check if the Index holds Interval objects.
Check if the Index only consists of numeric data.
Check if the Index is of the object dtype.
isin
(values[, level])Return a boolean array where the index values are in values.
isna
()Identify missing values.
isnull
()Identify missing values.
join
(other[, how, level, return_indexers, sort])Compute join_index and indexers to conform data structures to the new index.
kurt
([axis, skipna, level, numeric_only])Return Fisher's unbiased kurtosis of a sample.
kurtosis
([axis, skipna, level, numeric_only])Return Fisher's unbiased kurtosis of a sample.
mask
(cond[, other, inplace])Replace values where the condition is True.
max
([axis, skipna, level, numeric_only])Return the maximum of the values in the DataFrame.
mean
([axis, skipna, level, numeric_only])Return the mean of the values for the requested axis.
median
([axis, skipna, level, numeric_only])Return the median of the values for the requested axis.
memory_usage
([deep])Return the memory usage of an object.
min
([axis, skipna, level, numeric_only])Return the minimum of the values in the DataFrame.
Convert nans (if any) to nulls
notna
()Identify non-missing values.
notnull
()Identify non-missing values.
nunique
([dropna])Returns a per column mapping with counts of unique values for each column.
pipe
(func, *args, **kwargs)Apply
func(self, *args, **kwargs)
.prod
([axis, skipna, dtype, level, ...])Return product of the values in the DataFrame.
product
([axis, skipna, dtype, level, ...])Return product of the values in the DataFrame.
rename
(names[, inplace])Alter MultiIndex level names
repeat
(repeats[, axis])Repeat elements of a Index.
rolling
(window[, min_periods, center, axis, ...])Rolling window calculations.
searchsorted
(values[, side, ascending, ...])Find indices where elements should be inserted to maintain order
serialize
()Generate an equivalent serializable representation of an object.
set_names
(names[, level, inplace])Set Index or MultiIndex name.
shift
([periods, freq])Not yet implemented
skew
([axis, skipna, level, numeric_only])Return unbiased Fisher-Pearson skew of a sample.
sort_values
([return_indexer, ascending, ...])Return a sorted copy of the index, and optionally return the indices that sorted the index itself.
std
([axis, skipna, level, ddof, numeric_only])Return sample standard deviation of the DataFrame.
sum
([axis, skipna, dtype, level, ...])Return sum of the values in the DataFrame.
swaplevel
([i, j])Swap level i with level j.
tail
([n])Returns the last n rows as a new DataFrame or Series
take
(indices)Return a new index containing the rows specified by indices
to_arrow
()Convert to arrow Table
to_cupy
([dtype, copy, na_value])Convert the Frame to a CuPy array.
Converts a cuDF object into a DLPack tensor.
to_frame
([index, name, allow_duplicates])Create a DataFrame with the levels of the MultiIndex as columns.
to_hdf
(path_or_buf, key, *args, **kwargs)Write the contained data to an HDF5 file using HDFStore.
to_json
([path_or_buf])Convert the cuDF object to a JSON string.
to_list
()to_numpy
()Convert the Frame to a NumPy array.
to_pandas
([nullable])Convert to a Pandas Index.
to_series
([index, name])Create a Series with both index and values equal to the index keys.
Convert to string
tolist
()union
(other[, sort])Form the union of two Index objects.
unique
()Return unique values in the index.
var
([axis, skipna, level, ddof, numeric_only])Return unbiased variance of the DataFrame.
where
(cond[, other, inplace])Replace values where the condition is False.
Attributes
Returns the codes of the underlying MultiIndex.
Return True if there are any NaNs or nulls.
Return boolean if values in the object are monotonic_increasing.
Return if the index is monotonic decreasing (only equal or decreasing) values.
Return if the index is monotonic increasing (only equal or increasing) values.
Return if the index has unique values.
Returns list of levels in the MultiIndex
Returns the name of the Index.
Returns a tuple containing the name of the Index.
Dimension of the data.
Integer number of levels in this MultiIndex.
Get a tuple representing the dimensionality of the data.
Return the number of elements in the underlying data.
Not yet implemented.
Return a CuPy representation of the MultiIndex.
Return a numpy representation of the MultiIndex.
- __init__(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, **kwargs)#
- property names#
Returns a tuple containing the name of the Index.
- to_series(index=None, name=None)#
Create a Series with both index and values equal to the index keys. Useful with map for returning an indexer based on an index.
Parameters#
- indexIndex, optional
Index of resulting Series. If None, defaults to original index.
- namestr, optional
Name of resulting Series. If None, defaults to name of original index.
Returns#
- Series
The dtype will be based on the type of the Index values.
- astype(dtype, copy: bool = True)#
Create an Index with values cast to dtypes.
The class of a new Index is determined by dtype. When conversion is impossible, a ValueError exception is raised.
Parameters#
- dtype
numpy.dtype
Use a
numpy.dtype
to cast entire Index object to.- copybool, default False
By default, astype always returns a newly allocated object. If copy is set to False and internal requirements on dtype are satisfied, the original data is used to create a new Index or the original Index is returned.
Returns#
- Index
Index with values cast to specified dtype.
Examples#
>>> import cudf >>> index = cudf.Index([1, 2, 3]) >>> index Int64Index([1, 2, 3], dtype='int64') >>> index.astype('float64') Float64Index([1.0, 2.0, 3.0], dtype='float64')
- dtype
- rename(names, inplace=False)#
Alter MultiIndex level names
Parameters#
- nameslist of label
Names to set, length must be the same as number of levels
- inplacebool, default False
If True, modifies objects directly, otherwise returns a new
MultiIndex
instance
Returns#
None or MultiIndex
Examples#
Renaming each levels of a MultiIndex to specified name:
>>> midx = cudf.MultiIndex.from_product( ... [('A', 'B'), (2020, 2021)], names=['c1', 'c2']) >>> midx.rename(['lv1', 'lv2']) MultiIndex([('A', 2020), ('A', 2021), ('B', 2020), ('B', 2021)], names=['lv1', 'lv2']) >>> midx.rename(['lv1', 'lv2'], inplace=True) >>> midx MultiIndex([('A', 2020), ('A', 2021), ('B', 2020), ('B', 2021)], names=['lv1', 'lv2'])
names
argument must be a list, and must have same length asMultiIndex.levels
:>>> midx.rename(['lv0']) Traceback (most recent call last): ValueError: Length of names must match number of levels in MultiIndex.
- set_names(names, level=None, inplace=False)#
Set Index or MultiIndex name. Able to set new names partially and by level.
Parameters#
- nameslabel or list of label
Name(s) to set.
- levelint, label or list of int or label, optional
If the index is a MultiIndex, level(s) to set (None for all levels). Otherwise level must be None.
- inplacebool, default False
Modifies the object directly, instead of creating a new Index or MultiIndex.
Returns#
- Index
The same type as the caller or None if inplace is True.
See Also#
cudf.Index.rename : Able to set new names without level.
Examples#
>>> import cudf >>> idx = cudf.Index([1, 2, 3, 4]) >>> idx Int64Index([1, 2, 3, 4], dtype='int64') >>> idx.set_names('quarter') Int64Index([1, 2, 3, 4], dtype='int64', name='quarter') >>> idx = cudf.MultiIndex.from_product([['python', 'cobra'], ... [2018, 2019]]) >>> idx MultiIndex([('python', 2018), ('python', 2019), ( 'cobra', 2018), ( 'cobra', 2019)], ) >>> idx.names FrozenList([None, None]) >>> idx.set_names(['kind', 'year'], inplace=True) >>> idx.names FrozenList(['kind', 'year']) >>> idx.set_names('species', level=0, inplace=True) >>> idx.names FrozenList(['species', 'year'])
- property name#
Returns the name of the Index.
- copy(names=None, dtype=None, levels=None, codes=None, deep=False, name=None)#
Returns copy of MultiIndex object.
Returns a copy of MultiIndex. The levels and codes value can be set to the provided parameters. When they are provided, the returned MultiIndex is always newly constructed.
Parameters#
- namessequence of objects, optional (default None)
Names for each of the index levels.
- dtypeobject, optional (default None)
MultiIndex dtype, only supports None or object type
Deprecated since version 23.02: The dtype parameter is deprecated and will be removed in a future version of cudf. Use the astype method instead.
- levelssequence of arrays, optional (default None)
The unique labels for each level. Original values used if None.
Deprecated since version 23.02: The levels parameter is deprecated and will be removed in a future version of cudf.
- codessequence of arrays, optional (default None)
Integers for each level designating which label at each location. Original values used if None.
Deprecated since version 23.02: The codes parameter is deprecated and will be removed in a future version of cudf.
- deepBool (default False)
If True, ._data, ._levels, ._codes will be copied. Ignored if levels or codes are specified.
- nameobject, optional (default None)
To keep consistent with Index.copy, should not be used.
Returns#
Copy of MultiIndex Instance
Examples#
>>> df = cudf.DataFrame({'Close': [3400.00, 226.58, 3401.80, 228.91]}) >>> idx1 = cudf.MultiIndex( ... levels=[['2020-08-27', '2020-08-28'], ['AMZN', 'MSFT']], ... codes=[[0, 0, 1, 1], [0, 1, 0, 1]], ... names=['Date', 'Symbol']) >>> idx2 = idx1.copy( ... levels=[['day1', 'day2'], ['com1', 'com2']], ... codes=[[0, 0, 1, 1], [0, 1, 0, 1]], ... names=['col1', 'col2'])
>>> df.index = idx1 >>> df Close Date Symbol 2020-08-27 AMZN 3400.00 MSFT 226.58 2020-08-28 AMZN 3401.80 MSFT 228.91
>>> df.index = idx2 >>> df Close col1 col2 day1 com1 3400.00 com2 226.58 day2 com1 3401.80 com2 228.91
- property codes#
Returns the codes of the underlying MultiIndex.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a':[1, 2, 3], 'b':[10, 11, 12]}) >>> midx = cudf.MultiIndex.from_frame(df) >>> midx MultiIndex([(1, 10), (2, 11), (3, 12)], names=['a', 'b']) >>> midx.codes FrozenList([[0, 1, 2], [0, 1, 2]])
- get_slice_bound(label, side, kind=None)#
Calculate slice bound that corresponds to given label. Returns leftmost (one-past-the-rightmost if
side=='right'
) position of given label.Parameters#
label : object side : {‘left’, ‘right’} kind : {‘ix’, ‘loc’, ‘getitem’}
Returns#
- int
Index of label.
- property nlevels#
Integer number of levels in this MultiIndex.
- property levels#
Returns list of levels in the MultiIndex
Returns#
List of Series objects
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a':[1, 2, 3], 'b':[10, 11, 12]}) >>> cudf.MultiIndex.from_frame(df) MultiIndex([(1, 10), (2, 11), (3, 12)], names=['a', 'b']) >>> midx = cudf.MultiIndex.from_frame(df) >>> midx MultiIndex([(1, 10), (2, 11), (3, 12)], names=['a', 'b']) >>> midx.levels [Int64Index([1, 2, 3], dtype='int64', name='a'), Int64Index([10, 11, 12], dtype='int64', name='b')]
- property ndim#
Dimension of the data. For MultiIndex ndim is always 2.
- isin(values, level=None)#
Return a boolean array where the index values are in values.
Compute boolean array of whether each index value is found in the passed set of values. The length of the returned boolean array matches the length of the index.
Parameters#
- valuesset, list-like, Index or Multi-Index
Sought values.
- levelstr or int, optional
Name or position of the index level to use (if the index is a MultiIndex).
Returns#
- is_containedcupy array
CuPy array of boolean values.
Notes#
When level is None, values can only be MultiIndex, or a set/list-like tuples. When level is provided, values can be Index or MultiIndex, or a set/list-like tuples.
Examples#
>>> import cudf >>> import pandas as pd >>> midx = cudf.from_pandas(pd.MultiIndex.from_arrays([[1,2,3], ... ['red', 'blue', 'green']], ... names=('number', 'color'))) >>> midx MultiIndex([(1, 'red'), (2, 'blue'), (3, 'green')], names=['number', 'color'])
Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.
>>> midx.isin(['red', 'orange', 'yellow'], level='color') array([ True, False, False])
To check across the levels of a MultiIndex, pass a list of tuples:
>>> midx.isin([(1, 'red'), (3, 'red')]) array([ True, False, False])
- where(cond, other=None, inplace=False)#
Replace values where the condition is False.
Parameters#
- condbool Series/DataFrame, array-like
Where cond is True, keep the original value. Where False, replace with corresponding value from other. Callables are not supported.
- other: scalar, list of scalars, Series/DataFrame
Entries where cond is False are replaced with corresponding value from other. Callables are not supported. Default is None.
DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.
Series expects only scalar or series like with same length
- inplacebool, default False
Whether to perform the operation in place on the data.
Returns#
Same type as caller
Examples#
>>> import cudf >>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]}) >>> df.where(df % 2 == 0, [-1, -1]) A B 0 -1 -1 1 4 -1 2 -1 8
>>> ser = cudf.Series([4, 3, 2, 1, 0]) >>> ser.where(ser > 2, 10) 0 4 1 3 2 10 3 10 4 10 dtype: int64 >>> ser.where(ser > 2) 0 4 1 3 2 <NA> 3 <NA> 4 <NA> dtype: int64
- property size#
Return the number of elements in the underlying data.
Returns#
size : Size of the DataFrame / Index / Series / MultiIndex
Examples#
Size of an empty dataframe is 0.
>>> import cudf >>> df = cudf.DataFrame() >>> df Empty DataFrame Columns: [] Index: [] >>> df.size 0 >>> df = cudf.DataFrame(index=[1, 2, 3]) >>> df Empty DataFrame Columns: [] Index: [1, 2, 3] >>> df.size 0
DataFrame with values
>>> df = cudf.DataFrame({'a': [10, 11, 12], ... 'b': ['hello', 'rapids', 'ai']}) >>> df a b 0 10 hello 1 11 rapids 2 12 ai >>> df.size 6 >>> df.index RangeIndex(start=0, stop=3) >>> df.index.size 3
Size of an Index
>>> index = cudf.Index([]) >>> index Float64Index([], dtype='float64') >>> index.size 0 >>> index = cudf.Index([1, 2, 3, 10]) >>> index Int64Index([1, 2, 3, 10], dtype='int64') >>> index.size 4
Size of a MultiIndex
>>> midx = cudf.MultiIndex( ... levels=[["a", "b", "c", None], ["1", None, "5"]], ... codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]], ... names=["x", "y"], ... ) >>> midx MultiIndex([( 'a', '1'), ( 'a', '5'), ( 'b', <NA>), ( 'c', <NA>), (<NA>, '1')], names=['x', 'y']) >>> midx.size 5
- take(indices)#
Return a new index containing the rows specified by indices
Parameters#
- indicesarray-like
Array of ints indicating which positions to take.
- axisint
The axis over which to select values, always 0.
allow_fill : Unsupported fill_value : Unsupported
Returns#
- outIndex
New object with desired subset of rows.
Examples#
>>> idx = cudf.Index(['a', 'b', 'c', 'd', 'e']) >>> idx.take([2, 0, 4, 3]) StringIndex(['c' 'a' 'e' 'd'], dtype='object')
- __getitem__(index)#
- to_frame(index=True, name=<no_default>, allow_duplicates=False)#
Create a DataFrame with the levels of the MultiIndex as columns.
Column ordering is determined by the DataFrame constructor with data as a dict.
Parameters#
- indexbool, default True
Set the index of the returned DataFrame as the original MultiIndex.
- namelist / sequence of str, optional
The passed names should substitute index level names.
- allow_duplicatesbool, optional default False
Allow duplicate column labels to be created. Note that this parameter is non-functional because duplicates column labels aren’t supported in cudf.
Returns#
DataFrame
Examples#
>>> import cudf >>> mi = cudf.MultiIndex.from_tuples([('a', 'c'), ('b', 'd')]) >>> mi MultiIndex([('a', 'c'), ('b', 'd')], )
>>> df = mi.to_frame() >>> df 0 1 a c a c b d b d
>>> df = mi.to_frame(index=False) >>> df 0 1 0 a c 1 b d
>>> df = mi.to_frame(name=['x', 'y']) >>> df x y a c a c b d b d
- get_level_values(level)#
Return the values at the requested level
Parameters#
level : int or label
Returns#
An Index containing the values at the requested level.
- classmethod from_tuples(tuples, names=None)#
Convert list of tuples to MultiIndex.
Parameters#
- tupleslist / sequence of tuple-likes
Each tuple is the index of one row/column.
- nameslist / sequence of str, optional
Names for the levels in the index.
Returns#
MultiIndex
See Also#
- MultiIndex.from_productMake a MultiIndex from cartesian product
of iterables.
MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
Examples#
>>> tuples = [(1, 'red'), (1, 'blue'), ... (2, 'red'), (2, 'blue')] >>> cudf.MultiIndex.from_tuples(tuples, names=('number', 'color')) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], names=['number', 'color'])
- to_numpy()#
Convert the Frame to a NumPy array.
Parameters#
- dtypestr or
numpy.dtype
, optional The dtype to pass to
numpy.asarray()
.- copybool, default True
Whether to ensure that the returned value is not a view on another array. This parameter must be
True
since cuDF must copy device memory to host to provide a numpy array.- na_valueAny, default None
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
Returns#
numpy.ndarray
- dtypestr or
- property values_host#
Return a numpy representation of the MultiIndex.
Only the values in the MultiIndex will be returned.
Returns#
- outnumpy.ndarray
The values of the MultiIndex.
Examples#
>>> import cudf >>> midx = cudf.MultiIndex( ... levels=[[1, 3, 4, 5], [1, 2, 5]], ... codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]], ... names=["x", "y"], ... ) >>> midx.values_host array([(1, 1), (1, 5), (3, 2), (4, 2), (5, 1)], dtype=object) >>> type(midx.values_host) <class 'numpy.ndarray'>
- property values#
Return a CuPy representation of the MultiIndex.
Only the values in the MultiIndex will be returned.
Returns#
- out: cupy.ndarray
The values of the MultiIndex.
Examples#
>>> import cudf >>> midx = cudf.MultiIndex( ... levels=[[1, 3, 4, 5], [1, 2, 5]], ... codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]], ... names=["x", "y"], ... ) >>> midx.values array([[1, 1], [1, 5], [3, 2], [4, 2], [5, 1]]) >>> type(midx.values) <class 'cupy...ndarray'>
- classmethod from_frame(df, names=None)#
Make a MultiIndex from a DataFrame.
Parameters#
- dfDataFrame
DataFrame to be converted to MultiIndex.
- nameslist-like, optional
If no names are provided, use the column names, or tuple of column names if the columns is a MultiIndex. If a sequence, overwrite names with the given sequence.
Returns#
- MultiIndex
The MultiIndex representation of the given DataFrame.
See Also#
MultiIndex.from_tuples : Convert list of tuples to MultiIndex. MultiIndex.from_product : Make a MultiIndex from cartesian product
of iterables.
Examples#
>>> import cudf >>> df = cudf.DataFrame([['HI', 'Temp'], ['HI', 'Precip'], ... ['NJ', 'Temp'], ['NJ', 'Precip']], ... columns=['a', 'b']) >>> df a b 0 HI Temp 1 HI Precip 2 NJ Temp 3 NJ Precip >>> cudf.MultiIndex.from_frame(df) MultiIndex([('HI', 'Temp'), ('HI', 'Precip'), ('NJ', 'Temp'), ('NJ', 'Precip')], names=['a', 'b'])
Using explicit names, instead of the column names
>>> cudf.MultiIndex.from_frame(df, names=['state', 'observation']) MultiIndex([('HI', 'Temp'), ('HI', 'Precip'), ('NJ', 'Temp'), ('NJ', 'Precip')], names=['state', 'observation'])
- classmethod from_product(arrays, names=None)#
Make a MultiIndex from the cartesian product of multiple iterables.
Parameters#
- iterableslist / sequence of iterables
Each iterable has unique labels for each level of the index.
- nameslist / sequence of str, optional
Names for the levels in the index. If not explicitly provided, names will be inferred from the elements of iterables if an element has a name attribute
Returns#
MultiIndex
See Also#
MultiIndex.from_tuples : Convert list of tuples to MultiIndex. MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
Examples#
>>> numbers = [0, 1, 2] >>> colors = ['green', 'purple'] >>> cudf.MultiIndex.from_product([numbers, colors], ... names=['number', 'color']) MultiIndex([(0, 'green'), (0, 'purple'), (1, 'green'), (1, 'purple'), (2, 'green'), (2, 'purple')], names=['number', 'color'])
- swaplevel(i=-2, j=-1)#
Swap level i with level j. Calling this method does not change the ordering of the values.
Parameters#
- iint or str, default -2
First level of index to be swapped.
- jint or str, default -1
Second level of index to be swapped.
Returns#
- MultiIndex
A new MultiIndex.
Examples#
>>> import cudf >>> mi = cudf.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']], ... codes=[[0, 0, 1, 1], [0, 1, 0, 1]]) >>> mi MultiIndex([('a', 'bb'), ('a', 'aa'), ('b', 'bb'), ('b', 'aa')], ) >>> mi.swaplevel(0, 1) MultiIndex([('bb', 'a'), ('aa', 'a'), ('bb', 'b'), ('aa', 'b')], )
- droplevel(level=-1)#
Removes the specified levels from the MultiIndex.
Parameters#
- levellevel name or index, list-like
Integer, name or list of such, specifying one or more levels to drop from the MultiIndex
Returns#
A MultiIndex or Index object, depending on the number of remaining levels.
Examples#
>>> import cudf >>> idx = cudf.MultiIndex.from_frame( ... cudf.DataFrame( ... { ... "first": ["a", "a", "a", "b", "b", "b"], ... "second": [1, 1, 2, 2, 3, 3], ... "third": [0, 1, 2, 0, 1, 2], ... } ... ) ... )
Dropping level by index:
>>> idx.droplevel(0) MultiIndex([(1, 0), (1, 1), (2, 2), (2, 0), (3, 1), (3, 2)], names=['second', 'third'])
Dropping level by name:
>>> idx.droplevel("first") MultiIndex([(1, 0), (1, 1), (2, 2), (2, 0), (3, 1), (3, 2)], names=['second', 'third'])
Dropping multiple levels:
>>> idx.droplevel(["first", "second"]) Int64Index([0, 1, 2, 0, 1, 2], dtype='int64', name='third')
- to_pandas(nullable=False, **kwargs)#
Convert to a Pandas Index.
Parameters#
- nullablebool, Default False
If
nullable
isTrue
, the resulting index will have a corresponding nullable Pandas dtype. If there is no corresponding nullable Pandas dtype present, the resulting dtype will be a regular pandas dtype. Ifnullable
isFalse
, the resulting index will either convert null values tonp.nan
orNone
depending on the dtype.
Examples#
>>> import cudf >>> idx = cudf.Index([-3, 10, 15, 20]) >>> idx Int64Index([-3, 10, 15, 20], dtype='int64') >>> idx.to_pandas() Int64Index([-3, 10, 15, 20], dtype='int64') >>> type(idx.to_pandas()) <class 'pandas.core.indexes.numeric.Int64Index'> >>> type(idx) <class 'cudf.core.index.Int64Index'>
- classmethod from_pandas(multiindex, nan_as_null=<no_default>)#
Convert from a Pandas MultiIndex
Raises#
TypeError for invalid input type.
Examples#
>>> import cudf >>> import pandas as pd >>> pmi = pd.MultiIndex(levels=[['a', 'b'], ['c', 'd']], ... codes=[[0, 1], [1, 1]]) >>> cudf.from_pandas(pmi) MultiIndex([('a', 'd'), ('b', 'd')], )
- property is_unique#
Return if the index has unique values.
- property dtype#
- property is_monotonic_increasing#
Return if the index is monotonic increasing (only equal or increasing) values.
- property is_monotonic_decreasing#
Return if the index is monotonic decreasing (only equal or decreasing) values.
- fillna(value)#
Fill null values with the specified value.
Parameters#
- valuescalar
Scalar value to use to fill nulls. This value cannot be a list-likes.
Returns#
filled : MultiIndex
Examples#
>>> import cudf >>> index = cudf.MultiIndex( ... levels=[["a", "b", "c", None], ["1", None, "5"]], ... codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]], ... names=["x", "y"], ... ) >>> index MultiIndex([( 'a', '1'), ( 'a', '5'), ( 'b', <NA>), ( 'c', <NA>), (<NA>, '1')], names=['x', 'y']) >>> index.fillna('hello') MultiIndex([( 'a', '1'), ( 'a', '5'), ( 'b', 'hello'), ( 'c', 'hello'), ('hello', '1')], names=['x', 'y'])
- memory_usage(deep=False)#
Return the memory usage of an object.
Parameters#
- deepbool
The deep parameter is ignored and is only included for pandas compatibility.
Returns#
The total bytes used.
- difference(other, sort=None)#
Return a new Index with elements from the index that are not in other.
This is the set difference of two Index objects.
Parameters#
other : Index or array-like sort : False or None, default None
Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by cudf.
None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.
False : Do not sort the result.
Returns#
difference : Index
Examples#
>>> import cudf >>> idx1 = cudf.Index([2, 1, 3, 4]) >>> idx1 Int64Index([2, 1, 3, 4], dtype='int64') >>> idx2 = cudf.Index([3, 4, 5, 6]) >>> idx2 Int64Index([3, 4, 5, 6], dtype='int64') >>> idx1.difference(idx2) Int64Index([1, 2], dtype='int64') >>> idx1.difference(idx2, sort=False) Int64Index([2, 1], dtype='int64')
- append(other)#
Append a collection of MultiIndex objects together
Parameters#
other : MultiIndex or list/tuple of MultiIndex objects
Returns#
appended : Index
Examples#
>>> import cudf >>> idx1 = cudf.MultiIndex( ... levels=[[1, 2], ['blue', 'red']], ... codes=[[0, 0, 1, 1], [1, 0, 1, 0]] ... ) >>> idx2 = cudf.MultiIndex( ... levels=[[3, 4], ['blue', 'red']], ... codes=[[0, 0, 1, 1], [1, 0, 1, 0]] ... ) >>> idx1 MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], ) >>> idx2 MultiIndex([(3, 'red'), (3, 'blue'), (4, 'red'), (4, 'blue')], ) >>> idx1.append(idx2) MultiIndex([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue'), (3, 'red'), (3, 'blue'), (4, 'red'), (4, 'blue')], )
- get_loc(key, method=None, tolerance=None)#
Get location for a label or a tuple of labels.
The location is returned as an integer/slice or boolean mask.
Parameters#
key : label or tuple of labels (one for each level) method : None
Returns#
- locint, slice object or boolean mask
If index is unique, search result is unique, return a single int.
If index is monotonic, index is returned as a slice object.
Otherwise, cudf attempts a best effort to convert the search result into a slice object, and will return a boolean mask if failed to do so. Notice this can deviate from Pandas behavior in some situations.
Examples#
>>> import cudf >>> mi = cudf.MultiIndex.from_tuples( ... [('a', 'd'), ('b', 'e'), ('b', 'f')]) >>> mi.get_loc('b') slice(1, 3, None) >>> mi.get_loc(('b', 'e')) 1 >>> non_monotonic_non_unique_idx = cudf.MultiIndex.from_tuples( ... [('c', 'd'), ('b', 'e'), ('a', 'f'), ('b', 'e')]) >>> non_monotonic_non_unique_idx.get_loc('b') # differ from pandas slice(1, 4, 2)
- union(other, sort=None)#
Form the union of two Index objects.
Parameters#
other : Index or array-like sort : bool or None, default None
Whether to sort the resulting Index.
None : Sort the result, except when
self and other are equal.
self or other has length 0.
False : do not sort the result.
Returns#
union : Index
Examples#
Union of an Index >>> import cudf >>> import pandas as pd >>> idx1 = cudf.Index([1, 2, 3, 4]) >>> idx2 = cudf.Index([3, 4, 5, 6]) >>> idx1.union(idx2) Int64Index([1, 2, 3, 4, 5, 6], dtype=’int64’)
MultiIndex case
>>> idx1 = cudf.MultiIndex.from_pandas( ... pd.MultiIndex.from_arrays( ... [[1, 1, 2, 2], ["Red", "Blue", "Red", "Blue"]] ... ) ... ) >>> idx1 MultiIndex([(1, 'Red'), (1, 'Blue'), (2, 'Red'), (2, 'Blue')], ) >>> idx2 = cudf.MultiIndex.from_pandas( ... pd.MultiIndex.from_arrays( ... [[3, 3, 2, 2], ["Red", "Green", "Red", "Green"]] ... ) ... ) >>> idx2 MultiIndex([(3, 'Red'), (3, 'Green'), (2, 'Red'), (2, 'Green')], ) >>> idx1.union(idx2) MultiIndex([(1, 'Blue'), (1, 'Red'), (2, 'Blue'), (2, 'Green'), (2, 'Red'), (3, 'Green'), (3, 'Red')], ) >>> idx1.union(idx2, sort=False) MultiIndex([(1, 'Red'), (1, 'Blue'), (2, 'Red'), (2, 'Blue'), (3, 'Red'), (3, 'Green'), (2, 'Green')], )
- abs()#
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns#
- DataFrame/Series
Absolute value of each element.
Examples#
Absolute numeric values in a Series
>>> s = cudf.Series([-1.10, 2, -3.33, 4]) >>> s.abs() 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
- all(axis=0, skipna=True, level=None, **kwargs)#
Return whether all elements are True in DataFrame.
Parameters#
- axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
- 0 or ‘index’reduce the index, return a Series
whose index is the original column labels.
- 1 or ‘columns’reduce the columns, return a Series
whose index is the original index.
None : reduce all axes, return a scalar.
- skipna: bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
Returns#
Series
Notes#
Parameters currently not supported are bool_only, level.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]}) >>> df.all() a True b False dtype: bool
- any(axis=0, skipna=True, level=None, **kwargs)#
Return whether any elements is True in DataFrame.
Parameters#
- axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
- 0 or ‘index’reduce the index, return a Series
whose index is the original column labels.
- 1 or ‘columns’reduce the columns, return a Series
whose index is the original index.
None : reduce all axes, return a scalar.
- skipna: bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
Returns#
Series
Notes#
Parameters currently not supported are bool_only, level.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]}) >>> df.any() a True b True dtype: bool
- argsort(by=None, axis=0, kind='quicksort', order=None, ascending=True, na_position='last')#
Return the integer indices that would sort the Series values.
Parameters#
- bystr or list of str, default None
Name or list of names to sort by. If None, sort by all columns.
- axis{0 or “index”}
Has no effect but is accepted for compatibility with numpy.
- kind{‘mergesort’, ‘quicksort’, ‘heapsort’, ‘stable’}, default ‘quicksort’
Choice of sorting algorithm. See
numpy.sort()
for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. Only quicksort is supported in cuDF.- orderNone
Has no effect but is accepted for compatibility with numpy.
- ascendingbool or list of bool, default True
If True, sort values in ascending order, otherwise descending.
- na_position{‘first’ or ‘last’}, default ‘last’
Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
Returns#
cupy.ndarray: The indices sorted based on input.
Examples#
Series
>>> import cudf >>> s = cudf.Series([3, 1, 2]) >>> s 0 3 1 1 2 2 dtype: int64 >>> s.argsort() 0 1 1 2 2 0 dtype: int32 >>> s[s.argsort()] 1 1 2 2 0 3 dtype: int64
DataFrame >>> import cudf >>> df = cudf.DataFrame({‘foo’: [3, 1, 2]}) >>> df.argsort() array([1, 2, 0], dtype=int32)
Index >>> import cudf >>> idx = cudf.Index([3, 1, 2]) >>> idx.argsort() array([1, 2, 0], dtype=int32)
- dot(other, reflect=False)#
Get dot product of frame and other, (binary operator dot).
Among flexible wrappers (add, sub, mul, div, mod, pow, dot) to arithmetic operators: +, -, *, /, //, %, **, @.
Parameters#
- otherSequence, Series, or DataFrame
Any multiple element data structure, or list-like object.
- reflectbool, default False
If
True
, swap the order of the operands. See https://docs.python.org/3/reference/datamodel.html#object.__ror__ for more information on when this is necessary.
Returns#
- scalar, Series, or DataFrame
The result of the operation.
Examples#
>>> import cudf >>> df = cudf.DataFrame([[1, 2, 3, 4], ... [5, 6, 7, 8]]) >>> df @ df.T 0 1 0 30 70 1 70 174 >>> s = cudf.Series([1, 1, 1, 1]) >>> df @ s 0 10 1 26 dtype: int64 >>> [1, 2, 3, 4] @ s 10
- drop_duplicates(keep='first', nulls_are_equal=True)#
Drop duplicate rows in index.
- keep{“first”, “last”, False}, default “first”
‘first’ : Drop duplicates except for the first occurrence.
‘last’ : Drop duplicates except for the last occurrence.
False
: Drop all duplicates.
- nulls_are_equal: bool, default True
Null elements are considered equal to other null elements.
- dropna(how='any')#
Drop null rows from Index.
- how{“any”, “all”}, default “any”
Specifies how to decide whether to drop a row. “any” (default) drops rows containing at least one null value. “all” drops only rows containing all null values.
- duplicated(keep='first')#
Indicate duplicate index values.
Duplicated values are indicated as
True
values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.Parameters#
- keep{‘first’, ‘last’, False}, default ‘first’
The value or values in a set of duplicates to mark as missing.
'first'
: Mark duplicates asTrue
except for the first occurrence.'last'
: Mark duplicates asTrue
except for the last occurrence.False
: Mark all duplicates asTrue
.
Returns#
cupy.ndarray[bool]
See Also#
Series.duplicated : Equivalent method on cudf.Series. DataFrame.duplicated : Equivalent method on cudf.DataFrame. Index.drop_duplicates : Remove duplicate values from Index.
Examples#
By default, for each set of duplicated values, the first occurrence is set to False and all others to True:
>>> import cudf >>> idx = cudf.Index(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> idx.duplicated() array([False, False, True, False, True])
which is equivalent to
>>> idx.duplicated(keep='first') array([False, False, True, False, True])
By using ‘last’, the last occurrence of each set of duplicated values is set to False and all others to True:
>>> idx.duplicated(keep='last') array([ True, False, True, False, False])
By setting keep to
False
, all duplicates are True:>>> idx.duplicated(keep=False) array([ True, False, True, False, True])
- property empty#
- equals(other)#
Test whether two objects contain the same elements.
This function allows two objects to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type.
Parameters#
- otherIndex, Series, DataFrame
The other object to be compared with.
Returns#
- bool
True if all elements are the same in both objects, False otherwise.
Examples#
>>> import cudf
Comparing Series with equals:
>>> s = cudf.Series([1, 2, 3]) >>> other = cudf.Series([1, 2, 3]) >>> s.equals(other) True >>> different = cudf.Series([1.5, 2, 3]) >>> s.equals(different) False
Comparing DataFrames with equals:
>>> df = cudf.DataFrame({1: [10], 2: [20]}) >>> df 1 2 0 10 20 >>> exactly_equal = cudf.DataFrame({1: [10], 2: [20]}) >>> exactly_equal 1 2 0 10 20 >>> df.equals(exactly_equal) True
For two DataFrames to compare equal, the types of column values must be equal, but the types of column labels need not:
>>> different_column_type = cudf.DataFrame({1.0: [10], 2.0: [20]}) >>> different_column_type 1.0 2.0 0 10 20 >>> df.equals(different_column_type) True
- factorize(sort=False, na_sentinel=None, use_na_sentinel=None)#
- find_label_range(loc: slice) slice #
Translate a label-based slice to an index-based slice
Parameters#
- loc
slice to search for.
Notes#
As with all label-based searches, the slice is right-closed.
Returns#
New slice translated into integer indices of the index (right-open).
- classmethod from_arrow(data)#
Convert from PyArrow Table to Frame
Parameters#
data : PyArrow Table
Raises#
TypeError for invalid input type.
Examples#
>>> import cudf >>> import pyarrow as pa >>> data = pa.table({"a":[1, 2, 3], "b":[4, 5, 6]}) >>> cudf.core.frame.Frame.from_arrow(data) a b 0 1 4 1 2 5 2 3 6
- property has_duplicates#
- property hasnans#
Return True if there are any NaNs or nulls.
Returns#
- outbool
If Series has at least one NaN or null value, return True, if not return False.
Examples#
>>> import cudf >>> import numpy as np >>> index = cudf.Index([1, 2, np.nan, 3, 4], nan_as_null=False) >>> index Float64Index([1.0, 2.0, nan, 3.0, 4.0], dtype='float64') >>> index.hasnans True
hasnans returns True for the presence of any NA values:
>>> index = cudf.Index([1, 2, None, 3, 4]) >>> index Int64Index([1, 2, <NA>, 3, 4], dtype='int64') >>> index.hasnans True
- head(n=5)#
Return the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of n, this function returns all rows except the last n rows, equivalent to
df[:-n]
.Parameters#
- nint, default 5
Number of rows to select.
Returns#
- DataFrame or Series
The first n rows of the caller object.
Examples#
Series
>>> ser = cudf.Series(['alligator', 'bee', 'falcon', ... 'lion', 'monkey', 'parrot', 'shark', 'whale', 'zebra']) >>> ser 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra dtype: object
Viewing the first 5 lines
>>> ser.head() 0 alligator 1 bee 2 falcon 3 lion 4 monkey dtype: object
Viewing the first n lines (three in this case)
>>> ser.head(3) 0 alligator 1 bee 2 falcon dtype: object
For negative values of n
>>> ser.head(-3) 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot dtype: object
DataFrame
>>> df = cudf.DataFrame() >>> df['key'] = [0, 1, 2, 3, 4] >>> df['val'] = [float(i + 10) for i in range(5)] # insert column >>> df.head(2) key val 0 0 10.0 1 1 11.0
- intersection(other, sort=False)#
Form the intersection of two Index objects.
This returns a new Index with elements common to the index and other.
Parameters#
other : Index or array-like sort : False or None, default False
Whether to sort the resulting index.
False : do not sort the result.
None : sort the result, except when self and other are equal or when the values cannot be compared.
Returns#
intersection : Index
Examples#
>>> import cudf >>> import pandas as pd >>> idx1 = cudf.Index([1, 2, 3, 4]) >>> idx2 = cudf.Index([3, 4, 5, 6]) >>> idx1.intersection(idx2) Int64Index([3, 4], dtype='int64')
MultiIndex case
>>> idx1 = cudf.MultiIndex.from_pandas( ... pd.MultiIndex.from_arrays( ... [[1, 1, 3, 4], ["Red", "Blue", "Red", "Blue"]] ... ) ... ) >>> idx2 = cudf.MultiIndex.from_pandas( ... pd.MultiIndex.from_arrays( ... [[1, 1, 2, 2], ["Red", "Blue", "Red", "Blue"]] ... ) ... ) >>> idx1 MultiIndex([(1, 'Red'), (1, 'Blue'), (3, 'Red'), (4, 'Blue')], ) >>> idx2 MultiIndex([(1, 'Red'), (1, 'Blue'), (2, 'Red'), (2, 'Blue')], ) >>> idx1.intersection(idx2) MultiIndex([(1, 'Red'), (1, 'Blue')], ) >>> idx1.intersection(idx2, sort=False) MultiIndex([(1, 'Red'), (1, 'Blue')], )
- is_boolean()#
Check if the Index only consists of booleans.
Deprecated since version 23.04: Use cudf.api.types.is_bool_dtype instead.
Returns#
- bool
Whether or not the Index only consists of booleans.
See Also#
is_integer : Check if the Index only consists of integers. is_floating : Check if the Index is a floating type. is_numeric : Check if the Index only consists of numeric data. is_object : Check if the Index is of the object dtype. is_categorical : Check if the Index holds categorical data. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index([True, False, True]) >>> idx.is_boolean() True >>> idx = cudf.Index(["True", "False", "True"]) >>> idx.is_boolean() False >>> idx = cudf.Index([1, 2, 3]) >>> idx.is_boolean() False
- is_categorical()#
Check if the Index holds categorical data.
Deprecated since version 23.04: Use cudf.api.types.is_categorical_dtype instead.
Returns#
- bool
True if the Index is categorical.
See Also#
CategoricalIndex : Index for categorical data. is_boolean : Check if the Index only consists of booleans. is_integer : Check if the Index only consists of integers. is_floating : Check if the Index is a floating type. is_numeric : Check if the Index only consists of numeric data. is_object : Check if the Index is of the object dtype. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_categorical() True >>> idx = cudf.Index([1, 3, 5, 7]) >>> idx.is_categorical() False >>> s = cudf.Series(["Peter", "Victor", "Elisabeth", "Mar"]) >>> s 0 Peter 1 Victor 2 Elisabeth 3 Mar dtype: object >>> s.index.is_categorical() False
- is_floating()#
Check if the Index is a floating type.
The Index may consist of only floats, NaNs, or a mix of floats, integers, or NaNs.
Deprecated since version 23.04: Use cudf.api.types.is_float_dtype instead.
Returns#
- bool
Whether or not the Index only consists of only consists of floats, NaNs, or a mix of floats, integers, or NaNs.
See Also#
is_boolean : Check if the Index only consists of booleans. is_integer : Check if the Index only consists of integers. is_numeric : Check if the Index only consists of numeric data. is_object : Check if the Index is of the object dtype. is_categorical : Check if the Index holds categorical data. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_floating() True >>> idx = cudf.Index([1.0, 2.0, np.nan, 4.0]) >>> idx.is_floating() True >>> idx = cudf.Index([1, 2, 3, 4, np.nan], nan_as_null=False) >>> idx.is_floating() True >>> idx = cudf.Index([1, 2, 3, 4]) >>> idx.is_floating() False
- is_integer()#
Check if the Index only consists of integers.
Deprecated since version 23.04: Use cudf.api.types.is_integer_dtype instead.
Returns#
- bool
Whether or not the Index only consists of integers.
See Also#
is_boolean : Check if the Index only consists of booleans. is_floating : Check if the Index is a floating type. is_numeric : Check if the Index only consists of numeric data. is_object : Check if the Index is of the object dtype. is_categorical : Check if the Index holds categorical data. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index([1, 2, 3, 4]) >>> idx.is_integer() True >>> idx = cudf.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_integer() False >>> idx = cudf.Index(["Apple", "Mango", "Watermelon"]) >>> idx.is_integer() False
- is_interval()#
Check if the Index holds Interval objects.
Deprecated since version 23.04: Use cudf.api.types.is_interval_dtype instead.
Returns#
- bool
Whether or not the Index holds Interval objects.
See Also#
IntervalIndex : Index for Interval objects. is_boolean : Check if the Index only consists of booleans. is_integer : Check if the Index only consists of integers. is_floating : Check if the Index is a floating type. is_numeric : Check if the Index only consists of numeric data. is_object : Check if the Index is of the object dtype. is_categorical : Check if the Index holds categorical data.
Examples#
>>> import cudf >>> import pandas as pd >>> idx = cudf.from_pandas( ... pd.Index([pd.Interval(left=0, right=5), ... pd.Interval(left=5, right=10)]) ... ) >>> idx.is_interval() True >>> idx = cudf.Index([1, 3, 5, 7]) >>> idx.is_interval() False
- property is_monotonic#
Return boolean if values in the object are monotonic_increasing.
This property is an alias for
is_monotonic_increasing
.Returns#
bool
- is_numeric()#
Check if the Index only consists of numeric data.
Deprecated since version 23.04: Use cudf.api.types.is_any_real_numeric_dtype instead.
Returns#
- bool
Whether or not the Index only consists of numeric data.
See Also#
is_boolean : Check if the Index only consists of booleans. is_integer : Check if the Index only consists of integers. is_floating : Check if the Index is a floating type. is_object : Check if the Index is of the object dtype. is_categorical : Check if the Index holds categorical data. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_numeric() True >>> idx = cudf.Index([1, 2, 3, 4.0]) >>> idx.is_numeric() True >>> idx = cudf.Index([1, 2, 3, 4]) >>> idx.is_numeric() True >>> idx = cudf.Index([1, 2, 3, 4.0, np.nan]) >>> idx.is_numeric() True >>> idx = cudf.Index(["Apple", "cold"]) >>> idx.is_numeric() False
- is_object()#
Check if the Index is of the object dtype.
Deprecated since version 23.04: Use cudf.api.types.is_object_dtype instead.
Returns#
- bool
Whether or not the Index is of the object dtype.
See Also#
is_boolean : Check if the Index only consists of booleans. is_integer : Check if the Index only consists of integers. is_floating : Check if the Index is a floating type. is_numeric : Check if the Index only consists of numeric data. is_categorical : Check if the Index holds categorical data. is_interval : Check if the Index holds Interval objects.
Examples#
>>> import cudf >>> idx = cudf.Index(["Apple", "Mango", "Watermelon"]) >>> idx.is_object() True >>> idx = cudf.Index(["Watermelon", "Orange", "Apple", ... "Watermelon"]).astype("category") >>> idx.is_object() False >>> idx = cudf.Index([1.0, 2.0, 3.0, 4.0]) >>> idx.is_object() False
- isna()#
Identify missing values.
Return a boolean same-sized object indicating if the values are
<NA>
.<NA>
values gets mapped toTrue
values. Everything else gets mapped toFalse
values.<NA>
values include:Values where null mask is set.
NaN
in float dtype.NaT
in datetime64 and timedelta64 types.
Characters such as empty strings
''
orinf
in case of float are not considered<NA>
values.Returns#
- DataFrame/Series/Index
Mask of bool values for each element in the object that indicates whether an element is an NA value.
Examples#
Show which entries in a DataFrame are NA.
>>> import cudf >>> import numpy as np >>> import pandas as pd >>> df = cudf.DataFrame({'age': [5, 6, np.NaN], ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df age born name toy 0 5 <NA> Alfred <NA> 1 6 1939-05-27 00:00:00.000000 Batman Batmobile 2 <NA> 1940-04-25 00:00:00.000000 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf]) >>> ser 0 5.0 1 6.0 2 <NA> 3 Inf 4 -Inf dtype: float64 >>> ser.isna() 0 False 1 False 2 True 3 False 4 False dtype: bool
Show which entries in an Index are NA.
>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf]) >>> idx Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64') >>> idx.isna() array([False, False, True, True, False, False])
- isnull()#
Identify missing values.
Return a boolean same-sized object indicating if the values are
<NA>
.<NA>
values gets mapped toTrue
values. Everything else gets mapped toFalse
values.<NA>
values include:Values where null mask is set.
NaN
in float dtype.NaT
in datetime64 and timedelta64 types.
Characters such as empty strings
''
orinf
in case of float are not considered<NA>
values.Returns#
- DataFrame/Series/Index
Mask of bool values for each element in the object that indicates whether an element is an NA value.
Examples#
Show which entries in a DataFrame are NA.
>>> import cudf >>> import numpy as np >>> import pandas as pd >>> df = cudf.DataFrame({'age': [5, 6, np.NaN], ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df age born name toy 0 5 <NA> Alfred <NA> 1 6 1939-05-27 00:00:00.000000 Batman Batmobile 2 <NA> 1940-04-25 00:00:00.000000 Joker >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf]) >>> ser 0 5.0 1 6.0 2 <NA> 3 Inf 4 -Inf dtype: float64 >>> ser.isna() 0 False 1 False 2 True 3 False 4 False dtype: bool
Show which entries in an Index are NA.
>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf]) >>> idx Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64') >>> idx.isna() array([False, False, True, True, False, False])
- join(other, how='left', level=None, return_indexers=False, sort=False)#
Compute join_index and indexers to conform data structures to the new index.
Parameters#
other : Index. how : {‘left’, ‘right’, ‘inner’, ‘outer’} return_indexers : bool, default False sort : bool, default False
Sort the join keys lexicographically in the result Index. If False, the order of the join keys depends on the join type (how keyword).
Returns: index
Examples#
>>> import cudf >>> lhs = cudf.DataFrame({ ... "a": [2, 3, 1], ... "b": [3, 4, 2], ... }).set_index(['a', 'b']).index >>> lhs MultiIndex([(2, 3), (3, 4), (1, 2)], names=['a', 'b']) >>> rhs = cudf.DataFrame({"a": [1, 4, 3]}).set_index('a').index >>> rhs Int64Index([1, 4, 3], dtype='int64', name='a') >>> lhs.join(rhs, how='inner') MultiIndex([(3, 4), (1, 2)], names=['a', 'b'])
- kurt(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return Fisher’s unbiased kurtosis of a sample.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
Returns#
Series or scalar
Notes#
Parameters currently not supported are level and numeric_only
Examples#
Series
>>> import cudf >>> series = cudf.Series([1, 2, 3, 4]) >>> series.kurtosis() -1.1999999999999904
DataFrame
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.kurt() a -1.2 b -1.2 dtype: float64
- kurtosis(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return Fisher’s unbiased kurtosis of a sample.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
Returns#
Series or scalar
Notes#
Parameters currently not supported are level and numeric_only
Examples#
Series
>>> import cudf >>> series = cudf.Series([1, 2, 3, 4]) >>> series.kurtosis() -1.1999999999999904
DataFrame
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.kurt() a -1.2 b -1.2 dtype: float64
- mask(cond, other=None, inplace=False)#
Replace values where the condition is True.
Parameters#
- condbool Series/DataFrame, array-like
Where cond is False, keep the original value. Where True, replace with corresponding value from other. Callables are not supported.
- other: scalar, list of scalars, Series/DataFrame
Entries where cond is True are replaced with corresponding value from other. Callables are not supported. Default is None.
DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.
Series expects only scalar or series like with same length
- inplacebool, default False
Whether to perform the operation in place on the data.
Returns#
Same type as caller
Examples#
>>> import cudf >>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]}) >>> df.mask(df % 2 == 0, [-1, -1]) A B 0 1 3 1 -1 5 2 5 -1
>>> ser = cudf.Series([4, 3, 2, 1, 0]) >>> ser.mask(ser > 2, 10) 0 10 1 10 2 2 3 1 4 0 dtype: int64 >>> ser.mask(ser > 2) 0 <NA> 1 <NA> 2 2 3 1 4 0 dtype: int64
- max(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return the maximum of the values in the DataFrame.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
- level: int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only: bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
Returns#
Series
Notes#
Parameters currently not supported are level, numeric_only.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.max() a 4 b 10 dtype: int64
- mean(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return the mean of the values for the requested axis.
Parameters#
- axis{0 or ‘index’, 1 or ‘columns’}
Axis for the function to be applied on.
- skipnabool, default True
Exclude NA/null values when computing the result.
- levelint or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_onlybool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- **kwargs
Additional keyword arguments to be passed to the function.
Returns#
mean : Series or DataFrame (if level specified)
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.mean() a 2.5 b 8.5 dtype: float64
- median(axis=None, skipna=True, level=None, numeric_only=None, **kwargs)#
Return the median of the values for the requested axis.
Parameters#
- skipnabool, default True
Exclude NA/null values when computing the result.
Returns#
scalar
Notes#
Parameters currently not supported are level and numeric_only.
Examples#
>>> import cudf >>> ser = cudf.Series([10, 25, 3, 25, 24, 6]) >>> ser 0 10 1 25 2 3 3 25 4 24 5 6 dtype: int64 >>> ser.median() 17.0
- min(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return the minimum of the values in the DataFrame.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
- level: int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- numeric_only: bool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
Returns#
Series
Notes#
Parameters currently not supported are level, numeric_only.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.min() a 1 b 7 dtype: int64
- nans_to_nulls()#
Convert nans (if any) to nulls
Returns#
DataFrame or Series
Examples#
Series
>>> import cudf, numpy as np >>> series = cudf.Series([1, 2, np.nan, None, 10], nan_as_null=False) >>> series 0 1.0 1 2.0 2 NaN 3 <NA> 4 10.0 dtype: float64 >>> series.nans_to_nulls() 0 1.0 1 2.0 2 <NA> 3 <NA> 4 10.0 dtype: float64
DataFrame
>>> df = cudf.DataFrame() >>> df['a'] = cudf.Series([1, None, np.nan], nan_as_null=False) >>> df['b'] = cudf.Series([None, 3.14, np.nan], nan_as_null=False) >>> df a b 0 1.0 <NA> 1 <NA> 3.14 2 NaN NaN >>> df.nans_to_nulls() a b 0 1.0 <NA> 1 <NA> 3.14 2 <NA> <NA>
- notna()#
Identify non-missing values.
Return a boolean same-sized object indicating if the values are not
<NA>
. Non-missing values get mapped toTrue
.<NA>
values get mapped toFalse
values.<NA>
values include:Values where null mask is set.
NaN
in float dtype.NaT
in datetime64 and timedelta64 types.
Characters such as empty strings
''
orinf
in case of float are not considered<NA>
values.Returns#
- DataFrame/Series/Index
Mask of bool values for each element in the object that indicates whether an element is not an NA value.
Examples#
Show which entries in a DataFrame are NA.
>>> import cudf >>> import numpy as np >>> import pandas as pd >>> df = cudf.DataFrame({'age': [5, 6, np.NaN], ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df age born name toy 0 5 <NA> Alfred <NA> 1 6 1939-05-27 00:00:00.000000 Batman Batmobile 2 <NA> 1940-04-25 00:00:00.000000 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are NA.
>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf]) >>> ser 0 5.0 1 6.0 2 <NA> 3 Inf 4 -Inf dtype: float64 >>> ser.notna() 0 True 1 True 2 False 3 True 4 True dtype: bool
Show which entries in an Index are NA.
>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf]) >>> idx Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64') >>> idx.notna() array([ True, True, False, False, True, True])
- notnull()#
Identify non-missing values.
Return a boolean same-sized object indicating if the values are not
<NA>
. Non-missing values get mapped toTrue
.<NA>
values get mapped toFalse
values.<NA>
values include:Values where null mask is set.
NaN
in float dtype.NaT
in datetime64 and timedelta64 types.
Characters such as empty strings
''
orinf
in case of float are not considered<NA>
values.Returns#
- DataFrame/Series/Index
Mask of bool values for each element in the object that indicates whether an element is not an NA value.
Examples#
Show which entries in a DataFrame are NA.
>>> import cudf >>> import numpy as np >>> import pandas as pd >>> df = cudf.DataFrame({'age': [5, 6, np.NaN], ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df age born name toy 0 5 <NA> Alfred <NA> 1 6 1939-05-27 00:00:00.000000 Batman Batmobile 2 <NA> 1940-04-25 00:00:00.000000 Joker >>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are NA.
>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf]) >>> ser 0 5.0 1 6.0 2 <NA> 3 Inf 4 -Inf dtype: float64 >>> ser.notna() 0 True 1 True 2 False 3 True 4 True dtype: bool
Show which entries in an Index are NA.
>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf]) >>> idx Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64') >>> idx.notna() array([ True, True, False, False, True, True])
- nunique(dropna: bool = True)#
Returns a per column mapping with counts of unique values for each column.
Parameters#
- dropnabool, default True
Don’t include NaN in the counts.
Returns#
- dict
Name and unique value counts of each column in frame.
- pipe(func, *args, **kwargs)#
Apply
func(self, *args, **kwargs)
.Parameters#
- funcfunction
Function to apply to the Series/DataFrame/Index.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the Series/DataFrame/Index.- argsiterable, optional
Positional arguments passed into
func
.- kwargsmapping, optional
A dictionary of keyword arguments passed into
func
.
Returns#
object : the return type of
func
.Examples#
Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> func(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(func, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((func, 'arg2'), arg1=a, arg3=c) ... )
- prod(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#
Return product of the values in the DataFrame.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
- dtype: data type
Data type to cast the result to.
- min_count: int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns#
Series
Notes#
Parameters currently not supported are level`, numeric_only.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.product() a 24 b 5040 dtype: int64
- product(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#
Return product of the values in the DataFrame.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
- dtype: data type
Data type to cast the result to.
- min_count: int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns#
Series
Notes#
Parameters currently not supported are level`, numeric_only.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.product() a 24 b 5040 dtype: int64
- rolling(window, min_periods=None, center=False, axis=0, win_type=None)#
Rolling window calculations.
Parameters#
- windowint, offset or a BaseIndexer subclass
Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset. If a BaseIndexer subclass is passed, calculates the window boundaries based on the defined
get_window_bounds
method.- min_periodsint, optional
The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or
None
,min_periods
is equal to the window size.- centerbool, optional
If
True
, the result is set at the center of the window. IfFalse
(default), the result is set at the right edge of the window.
Returns#
Rolling
object.Examples#
>>> import cudf >>> a = cudf.Series([1, 2, 3, None, 4])
Rolling sum with window size 2.
>>> print(a.rolling(2).sum()) 0 1 3 2 5 3 4 dtype: int64
Rolling sum with window size 2 and min_periods 1.
>>> print(a.rolling(2, min_periods=1).sum()) 0 1 1 3 2 5 3 3 4 4 dtype: int64
Rolling count with window size 3.
>>> print(a.rolling(3).count()) 0 1 1 2 2 3 3 2 4 2 dtype: int64
Rolling count with window size 3, but with the result set at the center of the window.
>>> print(a.rolling(3, center=True).count()) 0 2 1 3 2 2 3 2 4 1 dtype: int64
Rolling max with variable window size specified by an offset; only valid for datetime index.
>>> a = cudf.Series( ... [1, 9, 5, 4, np.nan, 1], ... index=[ ... pd.Timestamp('20190101 09:00:00'), ... pd.Timestamp('20190101 09:00:01'), ... pd.Timestamp('20190101 09:00:02'), ... pd.Timestamp('20190101 09:00:04'), ... pd.Timestamp('20190101 09:00:07'), ... pd.Timestamp('20190101 09:00:08') ... ] ... )
>>> print(a.rolling('2s').max()) 2019-01-01T09:00:00.000 1 2019-01-01T09:00:01.000 9 2019-01-01T09:00:02.000 9 2019-01-01T09:00:04.000 4 2019-01-01T09:00:07.000 2019-01-01T09:00:08.000 1 dtype: int64
Apply custom function on the window with the apply method
>>> import numpy as np >>> import math >>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64) >>> def some_func(A): ... b = 0 ... for a in A: ... b = b + math.sqrt(a) ... return b ... >>> print(b.rolling(3, min_periods=1).apply(some_func)) 0 4.0 1 9.0 2 15.0 3 18.0 4 21.0 5 24.0 dtype: float64
And this also works for window rolling set by an offset
>>> import pandas as pd >>> c = cudf.Series( ... [16, 25, 36, 49, 64, 81], ... index=[ ... pd.Timestamp('20190101 09:00:00'), ... pd.Timestamp('20190101 09:00:01'), ... pd.Timestamp('20190101 09:00:02'), ... pd.Timestamp('20190101 09:00:04'), ... pd.Timestamp('20190101 09:00:07'), ... pd.Timestamp('20190101 09:00:08') ... ], ... dtype=np.float64 ... ) >>> print(c.rolling('2s').apply(some_func)) 2019-01-01T09:00:00.000 4.0 2019-01-01T09:00:01.000 9.0 2019-01-01T09:00:02.000 11.0 2019-01-01T09:00:04.000 7.0 2019-01-01T09:00:07.000 8.0 2019-01-01T09:00:08.000 17.0 dtype: float64
- searchsorted(values, side='left', ascending=True, na_position='last')#
Find indices where elements should be inserted to maintain order
Parameters#
- valueFrame (Shape must be consistent with self)
Values to be hypothetically inserted into Self
- sidestr {‘left’, ‘right’} optional, default ‘left’
If ‘left’, the index of the first suitable location found is given If ‘right’, return the last such index
- ascendingbool optional, default True
Sorted Frame is in ascending order (otherwise descending)
- na_positionstr {‘last’, ‘first’} optional, default ‘last’
Position of null values in sorted order
Returns#
1-D cupy array of insertion points
Examples#
>>> s = cudf.Series([1, 2, 3]) >>> s.searchsorted(4) 3 >>> s.searchsorted([0, 4]) array([0, 3], dtype=int32) >>> s.searchsorted([1, 3], side='left') array([0, 2], dtype=int32) >>> s.searchsorted([1, 3], side='right') array([1, 3], dtype=int32)
If the values are not monotonically sorted, wrong locations may be returned:
>>> s = cudf.Series([2, 1, 3]) >>> s.searchsorted(1) 0 # wrong result, correct would be 1
>>> df = cudf.DataFrame({'a': [1, 3, 5, 7], 'b': [10, 12, 14, 16]}) >>> df a b 0 1 10 1 3 12 2 5 14 3 7 16 >>> values_df = cudf.DataFrame({'a': [0, 2, 5, 6], ... 'b': [10, 11, 13, 15]}) >>> values_df a b 0 0 10 1 2 17 2 5 13 3 6 15 >>> df.searchsorted(values_df, ascending=False) array([4, 4, 4, 0], dtype=int32)
- property shape#
Get a tuple representing the dimensionality of the data.
- shift(periods=1, freq=None)#
Not yet implemented
- skew(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#
Return unbiased Fisher-Pearson skew of a sample.
Parameters#
- skipna: bool, default True
Exclude NA/null values when computing the result.
Returns#
Series
Notes#
Parameters currently not supported are axis, level and numeric_only
Examples#
Series
>>> import cudf >>> series = cudf.Series([1, 2, 3, 4, 5, 6, 6]) >>> series 0 1 1 2 2 3 3 4 4 5 5 6 6 6 dtype: int64
DataFrame
>>> import cudf >>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 8, 10, 10]}) >>> df.skew() a 0.00000 b -0.37037 dtype: float64
- sort_values(return_indexer=False, ascending=True, na_position='last', key=None)#
Return a sorted copy of the index, and optionally return the indices that sorted the index itself.
Parameters#
- return_indexerbool, default False
Should the indices that would sort the index be returned.
- ascendingbool, default True
Should the index values be sorted in an ascending order.
- na_position{‘first’ or ‘last’}, default ‘last’
Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
- keyNone, optional
This parameter is NON-FUNCTIONAL.
Returns#
- sorted_indexIndex
Sorted copy of the index.
- indexercupy.ndarray, optional
The indices that the index itself was sorted by.
See Also#
cudf.Series.min : Sort values of a Series. cudf.DataFrame.sort_values : Sort values in a DataFrame.
Examples#
>>> import cudf >>> idx = cudf.Index([10, 100, 1, 1000]) >>> idx Int64Index([10, 100, 1, 1000], dtype='int64')
Sort values in ascending order (default behavior).
>>> idx.sort_values() Int64Index([1, 10, 100, 1000], dtype='int64')
Sort values in descending order, and also get the indices idx was sorted by.
>>> idx.sort_values(ascending=False, return_indexer=True) (Int64Index([1000, 100, 10, 1], dtype='int64'), array([3, 1, 0, 2], dtype=int32))
Sorting values in a MultiIndex:
>>> midx = cudf.MultiIndex( ... levels=[[1, 3, 4, -10], [1, 11, 5]], ... codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]], ... names=["x", "y"], ... ) >>> midx MultiIndex([( 1, 1), ( 1, 5), ( 3, 11), ( 4, 11), (-10, 1)], names=['x', 'y']) >>> midx.sort_values() MultiIndex([(-10, 1), ( 1, 1), ( 1, 5), ( 3, 11), ( 4, 11)], names=['x', 'y']) >>> midx.sort_values(ascending=False) MultiIndex([( 4, 11), ( 3, 11), ( 1, 5), ( 1, 1), (-10, 1)], names=['x', 'y'])
- std(axis=<no_default>, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs)#
Return sample standard deviation of the DataFrame.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- ddof: int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
Returns#
Series
Notes#
Parameters currently not supported are level and numeric_only
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.std() a 1.290994 b 1.290994 dtype: float64
- property str#
Not yet implemented.
- sum(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#
Return sum of the values in the DataFrame.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values when computing the result.
- dtype: data type
Data type to cast the result to.
- min_count: int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns#
Series
Notes#
Parameters currently not supported are level, numeric_only.
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.sum() a 10 b 34 dtype: int64
- tail(n=5)#
Returns the last n rows as a new DataFrame or Series
Examples#
DataFrame
>>> import cudf >>> df = cudf.DataFrame() >>> df['key'] = [0, 1, 2, 3, 4] >>> df['val'] = [float(i + 10) for i in range(5)] # insert column >>> df.tail(2) key val 3 3 13.0 4 4 14.0
Series
>>> import cudf >>> ser = cudf.Series([4, 3, 2, 1, 0]) >>> ser.tail(2) 3 1 4 0
- to_arrow()#
Convert to arrow Table
Examples#
>>> import cudf >>> df = cudf.DataFrame( ... {"a":[1, 2, 3], "b":[4, 5, 6]}, index=[1, 2, 3]) >>> df.to_arrow() pyarrow.Table a: int64 b: int64 index: int64 ---- a: [[1,2,3]] b: [[4,5,6]] index: [[1,2,3]]
- to_cupy(dtype: Dtype | None = None, copy: bool = False, na_value=None) cupy.ndarray #
Convert the Frame to a CuPy array.
Parameters#
- dtypestr or
numpy.dtype
, optional The dtype to pass to
numpy.asarray()
.- copybool, default False
Whether to ensure that the returned value is not a view on another array. Note that
copy=False
does not ensure thatto_cupy()
is no-copy. Rather,copy=True
ensure that a copy is made, even if not strictly necessary.- na_valueAny, default None
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
Returns#
cupy.ndarray
- dtypestr or
- to_dlpack()#
Converts a cuDF object into a DLPack tensor.
DLPack is an open-source memory tensor structure: dmlc/dlpack.
This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.
Parameters#
cudf_obj : DataFrame, Series, Index, or Column
Returns#
- pycapsule_objPyCapsule
Output DLPack tensor pointer which is encapsulated in a PyCapsule object.
- to_hdf(path_or_buf, key, *args, **kwargs)#
Write the contained data to an HDF5 file using HDFStore.
Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
For more information see the user guide.
Parameters#
- path_or_bufstr or pandas.HDFStore
File path or HDFStore object.
- keystr
Identifier for the group in the store.
- mode{‘a’, ‘w’, ‘r+’}, default ‘a’
Mode to open file:
‘w’: write, a new file is created (an existing file with the same name would be deleted).
‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
‘r+’: similar to ‘a’, but the file must already exist.
- format{‘fixed’, ‘table’}, default ‘fixed’
Possible values:
‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.
‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.
- appendbool, default False
For Table formats, append the input data to the existing.
- data_columnslist of columns or True, optional
List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.
- complevel{0-9}, optional
Specifies a compression level for data. A value of 0 disables compression.
- complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’
Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.
- fletcher32bool, default False
If applying compression use the fletcher32 checksum.
- dropnabool, default False
If true, ALL nan rows will not be written to store.
- errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()
for a full list of options.
See Also#
cudf.read_hdf : Read from HDF file. cudf.DataFrame.to_parquet : Write a DataFrame to the binary parquet format. cudf.DataFrame.to_feather : Write out feather-format for DataFrames.
- to_json(path_or_buf=None, *args, **kwargs)#
Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.
Parameters#
- path_or_bufstring or file handle, optional
File path or object. If not specified, the result is returned as a string.
- engine{{ ‘auto’, ‘cudf’, ‘pandas’ }}, default ‘auto’
Parser engine to use. If ‘auto’ is passed, the pandas engine will be selected.
- orientstring
Indication of expected JSON string format.
- Series
default is ‘index’
allowed values are: {‘split’,’records’,’index’,’table’}
- DataFrame
default is ‘columns’
allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}
- The format of the JSON string
‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
‘records’ : list like [{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}
‘columns’ : dict like {column -> {index -> value}}
‘values’ : just the values array
‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like
orient='records'
.
- date_format{None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For
orient='table'
, the default is ‘iso’. For all other orients, the default is ‘epoch’.- double_precisionint, default 10
The number of decimal places to use when encoding floating point values.
- force_asciibool, default True
Force encoded string to be ASCII.
- date_unitstring, default ‘ms’ (milliseconds)
The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.
- default_handlercallable, default None
Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serializable object.
- linesbool, default False
If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.
- compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}
A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.
- indexbool, default True
Whether to include the index values in the JSON string. Not including the index (
index=False
) is only supported when orient is ‘split’ or ‘table’.
See Also#
cudf.read_json
- to_list()#
- to_string()#
Convert to string
cuDF uses Pandas internals for efficient string formatting. Set formatting options using pandas string formatting options and cuDF objects will print identically to Pandas objects.
cuDF supports null/None as a value in any column type, which is transparently supported during this output process.
Examples#
>>> import cudf >>> df = cudf.DataFrame() >>> df['key'] = [0, 1, 2] >>> df['val'] = [float(i + 10) for i in range(3)] >>> df.to_string() ' key val\n0 0 10.0\n1 1 11.0\n2 2 12.0'
- tolist()#
- var(axis=<no_default>, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs)#
Return unbiased variance of the DataFrame.
Normalized by N-1 by default. This can be changed using the ddof argument.
Parameters#
- axis: {index (0), columns(1)}
Axis for the function to be applied on.
- skipna: bool, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- ddof: int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
Returns#
scalar
Notes#
Parameters currently not supported are level and numeric_only
Examples#
>>> import cudf >>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]}) >>> df.var() a 1.666667 b 1.666667 dtype: float64
- repeat(repeats, axis=None)#
Repeat elements of a Index.
Returns a new Index where each element of the current Index is repeated consecutively a given number of times.
Parameters#
- repeatsint, or array of ints
The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty object.
Returns#
- Index
A newly created object of same type as caller with repeated elements.
Examples#
>>> index = cudf.Index([10, 22, 33, 55]) >>> index Int64Index([10, 22, 33, 55], dtype='int64') >>> index.repeat(5) Int64Index([10, 10, 10, 10, 10, 22, 22, 22, 22, 22, 33, 33, 33, 33, 33, 55, 55, 55, 55, 55], dtype='int64')