hipdf.DataFrame

hipdf.DataFrame#

225 min read time

Applies to Linux

class hipdf.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None, nan_as_null=<no_default>)#

Bases: IndexedFrame, GetAttrGetItemMixin

A GPU Dataframe object.

Parameters#

dataarray-like, Iterable, dict, or DataFrame.: Dict can contain Series, arrays, constants, or list-like objects.
indexIndex or array-like: Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columnsIndex or array-like: Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.
dtypedtype, default None: Data type to force. Only a single dtype is allowed. If None, infer.
copybool or None, default None: Copy data from inputs. Currently not implemented.
nan_as_nullbool, Default True: If None/True, converts np.nan values to null values. If False, leaves np.nan values as is.

Examples#

Build dataframe with __setitem__:

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df
   key   val
0    0  10.0
1    1  11.0
2    2  12.0
3    3  13.0
4    4  14.0

Build DataFrame via dict of columns:

>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
>>> n = 5
>>> df = cudf.DataFrame({
...     'id': np.arange(n),
...     'datetimes': np.array(
...     [(t0+ timedelta(seconds=x)) for x in range(n)])
... })
>>> df
    id            datetimes
0    0  2018-10-07 12:00:00
1    1  2018-10-07 12:00:01
2    2  2018-10-07 12:00:02
3    3  2018-10-07 12:00:03
4    4  2018-10-07 12:00:04

Build DataFrame via list of rows as tuples:

>>> df = cudf.DataFrame([
...     (5, "cats", "jump", np.nan),
...     (2, "dogs", "dig", 7.5),
...     (3, "cows", "moo", -2.1, "occasionally"),
... ])
>>> df
   0     1     2     3             4
0  5  cats  jump  <NA>          <NA>
1  2  dogs   dig   7.5          <NA>
2  3  cows   moo  -2.1  occasionally

Convert from a Pandas DataFrame:

>>> import pandas as pd
>>> pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})
>>> pdf
   a    b
0  0  0.1
1  1  0.2
2  2  NaN
3  3  0.3
>>> df = cudf.from_pandas(pdf)
>>> df
   a     b
0  0   0.1
1  1   0.2
2  2  <NA>
3  3   0.3

__init__(data=None, index=None, columns=None, dtype=None, copy=None, nan_as_null=<no_default>)#

Methods

`__init__`([data, index, columns, dtype, ...])
`abs`()	Return a Series/DataFrame with absolute numeric value of each element.
`add`(other[, axis, level, fill_value])	Get Addition of DataFrame or Series and other, element-wise (binary operator add).
`add_prefix`(prefix[, axis])	Prefix labels with string prefix.
`add_suffix`(suffix[, axis])	Suffix labels with string suffix.
`agg`(aggs[, axis])	Aggregate using one or more operations over the specified axis.
`all`([axis, bool_only, skipna])	Return whether all elements are True in DataFrame.
`any`([axis, bool_only, skipna])	Return whether any elements is True in DataFrame.
`apply`(func[, axis, raw, result_type, args, ...])	Apply a function along an axis of the DataFrame.
`apply_chunks`(func, incols, outcols[, ...])	Transform user-specified chunks using the user-provided function.
`apply_rows`(func, incols, outcols, kwargs[, ...])	Apply a row-wise user defined function.
`applymap`(func[, na_action])	Apply a function to a Dataframe elementwise.
`argsort`([by, axis, kind, order, ascending, ...])	Return the integer indices that would sort the Series values.
`assign`(**kwargs)	Assign columns to DataFrame from keyword arguments.
`astype`(dtype[, copy, errors])	Cast the object to the given dtype.
`backfill`([value, axis, inplace, limit])	Synonym for `Series.fillna()` with `method='bfill'`.
`bfill`([value, axis, inplace, limit, limit_area])	Synonym for `Series.fillna()` with `method='bfill'`.
`clip`([lower, upper, axis, inplace])	Trim values at input threshold(s).
`convert_dtypes`([infer_objects, ...])	Convert columns to the best possible nullable dtypes.
`copy`([deep])	Make a copy of this object's indices and data.
`corr`([method, min_periods, numeric_only])	Compute the correlation matrix of a DataFrame.
`count`([axis, numeric_only])	Count `non-NA` cells for each column or row.
`cov`([min_periods, ddof, numeric_only])	Compute the covariance matrix of a DataFrame.
`cummax`([axis])	Return cumulative max of the IndexedFrame.
`cummin`([axis])	Return cumulative min of the IndexedFrame.
`cumprod`([axis])	Return cumulative product of the IndexedFrame.
`cumsum`([axis])	Return cumulative sum of the IndexedFrame.
`describe`([percentiles, include, exclude])	Generate descriptive statistics.
`deserialize`(header, frames)	Generate an object from a serialized representation.
`device_deserialize`(header, frames)	Perform device-side deserialization tasks.
`device_serialize`()	Serialize data and metadata associated with device memory.
`diff`([periods, axis])	First discrete difference of element.
`div`(other[, axis, level, fill_value])	Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).
`divide`(other[, axis, level, fill_value])	Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).
`dot`(other[, reflect])	Get dot product of frame and other, (binary operator dot).
`drop`([labels, axis, index, columns, level, ...])	Drop specified labels from rows or columns.
`drop_duplicates`([subset, keep, inplace, ...])	Return DataFrame with duplicate rows removed.
`dropna`([axis, how, thresh, subset, inplace, ...])	Drop rows (or columns) containing nulls from a Column.
`duplicated`([subset, keep])	Return boolean Series denoting duplicate rows.
`eq`(other[, axis, level, fill_value])	Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).
`equals`(other)	Test whether two objects contain the same elements.
`eval`(expr[, inplace])	Evaluate a string describing operations on DataFrame columns.
`ewm`([com, span, halflife, alpha, ...])	Provide exponential weighted (EW) functions.
`explode`(column[, ignore_index])	Transform each element of a list-like to a row, replicating index values.
`ffill`([value, axis, inplace, limit, limit_area])	Synonym for `Series.fillna()` with `method='ffill'`.
`fillna`([value, method, axis, inplace, limit])	Fill null values with `value` or specified `method`.
`first`(offset)	Select initial periods of time series data based on a date offset.
`floordiv`(other[, axis, level, fill_value])	Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).
`from_arrow`(table)	Convert from PyArrow Table to DataFrame.
`from_dict`(data[, orient, dtype, columns])	Construct DataFrame from dict of array-like or dicts.
`from_pandas`(dataframe[, nan_as_null])	Convert from a Pandas DataFrame.
`from_pylibcudf`(table, metadata)	Create a DataFrame from a pylibcudf.Table.
`from_records`(data[, index, exclude, ...])	Convert structured or record ndarray to DataFrame.
`ge`(other[, axis, level, fill_value])	Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).
`groupby`([by, axis, level, as_index, sort, ...])	Group using a mapper or by a Series of columns.
`gt`(other[, axis, level, fill_value])	Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).
`hash_values`([method, seed])	Compute the hash of values in this column.
`head`([n])	Return the first n rows.
`host_deserialize`(header, frames)	Perform device-side deserialization tasks.
`host_serialize`()	Serialize data and metadata associated with host memory.
`info`([verbose, buf, max_cols, memory_usage, ...])	Print a concise summary of a DataFrame.
`insert`(loc, column, value[, ...])	Add a column to DataFrame at the index specified by loc.
`interleave_columns`()	Interleave Series columns of a table into a single column.
`interpolate`([method, axis, limit, inplace, ...])	Interpolate data values between some points.
`isin`(values)	Whether each element in the DataFrame is contained in values.
`isna`()	Identify missing values.
`isnull`()	Identify missing values.
`items`()	Iterate over column names and series pairs
`iterrows`()	Iteration is unsupported.
`itertuples`([index, name])	Iteration is unsupported.
`join`(other[, on, how, lsuffix, rsuffix, ...])	Join columns with other DataFrame on index or on a key column.
`keys`()	Get the columns.
`kurt`([axis, skipna, numeric_only])	Return Fisher's unbiased kurtosis of a sample.
`kurtosis`([axis, skipna, numeric_only])	Return Fisher's unbiased kurtosis of a sample.
`last`(offset)	Select final periods of time series data based on a date offset.
`le`(other[, axis, level, fill_value])	Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).
`lt`(other[, axis, level, fill_value])	Get Less than of DataFrame or Series and other, element-wise (binary operator lt).
`map`(func[, na_action])	Apply a function to a Dataframe elementwise.
`mask`(cond[, other, inplace, axis, level])	Replace values where the condition is True.
`max`([axis, skipna, numeric_only])	Return the maximum of the values in the DataFrame.
`mean`([axis, skipna, numeric_only])	Return the mean of the values for the requested axis.
`median`([axis, skipna, numeric_only])	Return the median of the values for the requested axis.
`melt`([id_vars, value_vars, var_name, ...])	Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.
`memory_usage`([index, deep])	Return the memory usage of an object.
`merge`(right[, how, on, left_on, right_on, ...])	Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.
`min`([axis, skipna, numeric_only])	Return the minimum of the values in the DataFrame.
`mod`(other[, axis, level, fill_value])	Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).
`mode`([axis, numeric_only, dropna])	Get the mode(s) of each element along the selected axis.
`mul`(other[, axis, level, fill_value])	Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).
`multiply`(other[, axis, level, fill_value])	Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).
`nans_to_nulls`()	Convert nans (if any) to nulls
`ne`(other[, axis, level, fill_value])	Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).
`nlargest`(n, columns[, keep])	Return the first n rows ordered by columns in descending order.
`notna`()	Identify non-missing values.
`notnull`()	Identify non-missing values.
`nsmallest`(n, columns[, keep])	Return the first n rows ordered by columns in ascending order.
`nunique`([axis, dropna])	Count number of distinct elements in specified axis.
`pad`([value, axis, inplace, limit])	Synonym for `Series.fillna()` with `method='ffill'`.
`partition_by_hash`(columns, nparts[, keep_index])	Partition the dataframe by the hashed value of data in columns.
`pct_change`([periods, fill_method, limit, freq])	Calculates the percent change between sequential elements in the DataFrame.
`pipe`(func, args, *kwargs)	Apply `func(self, args, *kwargs)`.
`pivot`(*, columns[, index, values])	Return reshaped DataFrame organized by the given index and column values.
`pivot_table`([values, index, columns, ...])	Create a spreadsheet-style pivot table as a DataFrame.
`pop`(item)	Return a column and drop it from the DataFrame.
`pow`(other[, axis, level, fill_value])	Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).
`prod`([axis, skipna, dtype, numeric_only, ...])	Return product of the values in the DataFrame.
`product`([axis, skipna, dtype, numeric_only, ...])	Return product of the values in the DataFrame.
`quantile`([q, axis, numeric_only, ...])	Return values at the given quantile.
`query`(expr[, local_dict])	Query with a boolean expression using Numba to compile a GPU kernel.
`radd`(other[, axis, level, fill_value])	Get Addition of DataFrame or Series and other, element-wise (binary operator radd).
`rank`([axis, method, numeric_only, ...])	Compute numerical data ranks (1 through n) along axis.
`rdiv`(other[, axis, level, fill_value])	Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).
`reindex`([labels, index, columns, axis, ...])	Conform DataFrame to new index.
`rename`([mapper, index, columns, axis, copy, ...])	Alter column and index labels.
`repeat`(repeats[, axis])	Repeats elements consecutively.
`replace`([to_replace, value, inplace, limit, ...])	Replace values given in `to_replace` with `value`.
`resample`(rule[, axis, closed, label, ...])	Convert the frequency of ("resample") the given time series data.
`reset_index`([level, drop, inplace, ...])	Reset the index of the DataFrame, or a level of it.
`rfloordiv`(other[, axis, level, fill_value])	Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).
`rmod`(other[, axis, level, fill_value])	Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).
`rmul`(other[, axis, level, fill_value])	Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).
`rolling`(window[, min_periods, center, ...])	Rolling window calculations.
`round`([decimals, how])	Round to a variable number of decimal places.
`rpow`(other[, axis, level, fill_value])	Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).
`rsub`(other[, axis, level, fill_value])	Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).
`rtruediv`(other[, axis, level, fill_value])	Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).
`sample`([n, frac, replace, weights, ...])	Return a random sample of items from an axis of object.
`scale`()	Scale values to [0, 1] in float64
`scatter_by_map`(map_index[, map_size, ...])	Scatter to a list of dataframes.
`searchsorted`(values[, side, sorter, ...])	Find indices where elements should be inserted to maintain order
`select_dtypes`([include, exclude])	Return a subset of the DataFrame's columns based on the column dtypes.
`serialize`()	Generate an equivalent serializable representation of an object.
`set_index`(keys[, drop, append, inplace, ...])	Return a new DataFrame with a new index
`shift`([periods, freq, axis, fill_value, suffix])	Shift values by periods positions.
`skew`([axis, skipna, numeric_only])	Return unbiased Fisher-Pearson skew of a sample.
`sort_index`([axis, level, ascending, ...])	Sort object by labels (along an axis).
`sort_values`(by[, axis, ascending, inplace, ...])	Sort by the values along either axis.
`squeeze`([axis])	Squeeze 1 dimensional axis objects into scalars.
`stack`([level, dropna, future_stack])	Stack the prescribed level(s) from columns to index
`std`([axis, skipna, ddof, numeric_only])	Return sample standard deviation of the DataFrame.
`sub`(other[, axis, level, fill_value])	Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).
`subtract`(other[, axis, level, fill_value])	Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).
`sum`([axis, skipna, dtype, numeric_only, ...])	Return sum of the values in the DataFrame.
`swaplevel`([i, j, axis])	Swap level i with level j.
`tail`([n])	Returns the last n rows as a new DataFrame or Series
`take`(indices[, axis])	Return a new frame containing the rows specified by indices.
`tile`(count)	Repeats the rows count times to form a new Frame.
`to_arrow`([preserve_index])	Convert to a PyArrow Table.
`to_csv`([path_or_buf, sep, na_rep, columns, ...])	Write a dataframe to csv file format.
`to_cupy`([dtype, copy, na_value])	Convert the Frame to a CuPy array.
`to_dict`([orient, into, index])	Convert the DataFrame to a dictionary.
`to_dlpack`()	Converts a cuDF object into a DLPack tensor.
`to_feather`(path, args, *kwargs)	Write a DataFrame to the feather format.
`to_hdf`(path_or_buf, key, args, *kwargs)	Write the contained data to an HDF5 file using HDFStore.
`to_json`([path_or_buf])	Convert the cuDF object to a JSON string.
`to_numpy`([dtype, copy, na_value])	Convert the Frame to a NumPy array.
`to_orc`(fname[, compression, statistics, ...])	Write a DataFrame to the ORC format.
`to_pandas`(*[, nullable, arrow_type])	Convert to a Pandas DataFrame.
`to_parquet`(path[, engine, compression, ...])	Write a DataFrame to the parquet format.
`to_pylibcudf`([copy])	Convert this DataFrame to a pylibcudf.Table.
`to_records`([index, column_dtypes, index_dtypes])	Convert to a numpy recarray
`to_string`()	Convert to string
`to_struct`([name])	Return a struct Series composed of the columns of the DataFrame.
`transpose`()	Transpose index and columns.
`truediv`(other[, axis, level, fill_value])	Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).
`truncate`([before, after, axis, copy])	Truncate a Series or DataFrame before and after some index value.
`unstack`([level, fill_value, sort])	Pivot one or more levels of the (necessarily hierarchical) index labels.
`update`(other[, join, overwrite, ...])	Modify a DataFrame in place using non-NA values from another DataFrame.
`value_counts`([subset, normalize, sort, ...])	Return a Series containing counts of unique rows in the DataFrame.
`var`([axis, skipna, ddof, numeric_only])	Return unbiased variance of the DataFrame.
`where`(cond[, other, inplace, axis, level])	Replace values where the condition is False.

Attributes

`T`	Transpose index and columns.
`at`	Alias for `DataFrame.loc`; provided for compatibility with Pandas.
`axes`	Return a list representing the axes of the DataFrame.
`columns`	Returns a tuple of columns
`dtypes`	Return the dtypes in this object.
`empty`	Indicator whether DataFrame or Series is empty.
`iat`	Alias for `DataFrame.iloc`; provided for compatibility with Pandas.
`iloc`	Select values by position.
`index`	Get the labels for the rows.
`loc`	Select rows and columns by label or boolean mask.
`ndim`	Dimension of the data.
`shape`	Returns a tuple representing the dimensionality of the DataFrame.
`size`	Return the number of elements in the underlying data.
`values`	Return a CuPy representation of the DataFrame.
`values_host`	Return a NumPy representation of the data.

__init__(data=None, index=None, columns=None, dtype=None, copy=None, nan_as_null=<no_default>)#

property shape#: Returns a tuple representing the dimensionality of the DataFrame.

property dtypes#

Return the dtypes in this object.

Returns#

pandas.Series: The data type of each column.

Examples#

>>> import cudf
>>> import pandas as pd
>>> df = cudf.DataFrame({'float': [1.0],
...                    'int': [1],
...                    'datetime': [pd.Timestamp('20180310')],
...                    'string': ['foo']})
>>> df
   float  int   datetime string
0    1.0    1 2018-03-10    foo
>>> df.dtypes
float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object

property ndim: int#: Dimension of the data. DataFrame ndim is always 2.

__getitem__(arg)#

If arg is a str or int type, return the column Series. If arg is a slice, return a new DataFrame with all columns sliced to the specified range. If arg is an array containing column names, return a new DataFrame with the corresponding columns. If arg is a dtype.bool array, return the rows marked True

Examples#

>>> df = cudf.DataFrame({
...     'a': list(range(10)),
...     'b': list(range(10)),
...     'c': list(range(10)),
... })

Get first 4 rows of all columns.

Get last 5 rows of all columns.

>>> df[-5:]
   a  b  c
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9

Get columns a and c.

>>> df[['a', 'c']]
   a  c
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

Return the rows specified in the boolean mask.

>>> df[[True, False, True, False, True,
...     False, True, False, True, False]]
   a  b  c
0  0  0  0
2  2  2  2
4  4  4  4
6  6  6  6
8  8  8  8

memory_usage(index=True, deep=False) → Series#

Return the memory usage of an object.

Parameters#

indexbool, default True: Specifies whether to include the memory usage of the index.
deepbool, default False: The deep parameter is ignored and is only included for pandas compatibility.

Returns#

Series or scalar: For DataFrame, a Series whose index is the original column names and whose values is the memory usage of each column in bytes. For a Series the total memory usage.

Examples#

DataFrame

>>> dtypes = ['int64', 'float64', 'object', 'bool']
>>> data = dict([(t, np.ones(shape=5000).astype(t))
...              for t in dtypes])
>>> df = cudf.DataFrame(data)
>>> df.head()
   int64  float64  object  bool
0      1      1.0     1.0  True
1      1      1.0     1.0  True
2      1      1.0     1.0  True
3      1      1.0     1.0  True
4      1      1.0     1.0  True
>>> df.memory_usage(index=False)
int64      40000
float64    40000
object     40000
bool        5000
dtype: int64

Use a Categorical for efficient storage of an object-dtype column with many repeated values.

>>> df['object'].astype('category').memory_usage(deep=True)
5008

Series >>> s = cudf.Series(range(3), index=[‘a’,’b’,’c’]) >>> s.memory_usage() 43

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)
24

assign(**kwargs: Callable[[Self], Any] | Any)#

Assign columns to DataFrame from keyword arguments.

Parameters#

**kwargs: dict mapping string column names to values: The value for each key can either be a literal column (or something that can be converted to a column), or a callable of one argument that will be given the dataframe as an argument and should return the new column (without modifying the input argument). Columns are added in-order, so callables can refer to column names constructed in the assignment.

Examples#

>>> import cudf
>>> df = cudf.DataFrame()
>>> df = df.assign(a=[0, 1, 2], b=[3, 4, 5])
>>> df
   a  b
0  0  3
1  1  4
2  2  5

astype(dtype, copy: bool = False, errors: Literal['raise', 'ignore'] = 'raise')#

Cast the object to the given dtype.

Parameters#

dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire DataFrame object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default False

Return a deep-copy when copy=True. Note by default copy=False setting is used and hence changes to values then may propagate to other cudf objects.

errors{‘raise’, ‘ignore’, ‘warn’}, default ‘raise’

Control raising of exceptions on invalid data for provided dtype.

raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.

Returns#

DataFrame/Series

Examples#

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [10, 20, 30], 'b': [1, 2, 3]})
>>> df
    a  b
0  10  1
1  20  2
2  30  3
>>> df.dtypes
a    int64
b    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes
a    int32
b    int32
dtype: object

Cast a to float32 using a dictionary:

>>> df.astype({'a': 'float32'}).dtypes
a    float32
b      int64
dtype: object
>>> df.astype({'a': 'float32'})
      a  b
0  10.0  1
1  20.0  2
2  30.0  3

Series

>>> import cudf
>>> series = cudf.Series([1, 2], dtype='int32')
>>> series
0    1
1    2
dtype: int32
>>> series.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> series.astype('category')
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = cudf.CategoricalDtype(categories=[2, 1], ordered=True)
>>> series.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False (enabled by default) and changing data on a new Series will propagate changes:

>>> s1 = cudf.Series([1, 2])
>>> s1
0    1
1    2
dtype: int64
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1
0    10
1     2
dtype: int64

classmethod from_dict(data: dict, orient: str = 'columns', dtype: Dtype | None = None, columns: list | None = None) → DataFrame#

Construct DataFrame from dict of array-like or dicts. Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

Parameters#

datadict: Of the form {field : array-like} or {field : dict}.
orient{‘columns’, ‘index’, ‘tight’}, default ‘columns’: The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].
dtypedtype, default None: Data type to force, otherwise infer.
columnslist, default None: Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

Returns#

DataFrame

Examples#

By default the keys of the dict become the DataFrame columns:

>>> import cudf
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> cudf.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify orient='index' to create the DataFrame using dictionary keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': [10, 11, 12, 13]}
>>> cudf.DataFrame.from_dict(data, orient='index')
        0   1   2   3
row_1   3   2   1   0
row_2  10  11  12  13

When using the ‘index’ orientation, the column names can be specified manually:

>>> cudf.DataFrame.from_dict(data, orient='index',
...                          columns=['A', 'B', 'C', 'D'])
        A   B   C   D
row_1   3   2   1   0
row_2  10  11  12  13

Specify orient='tight' to create the DataFrame using a ‘tight’ format:

>>> data = {'index': [('a', 'b'), ('a', 'c')],
...         'columns': [('x', 1), ('y', 2)],
...         'data': [[1, 3], [2, 4]],
...         'index_names': ['n1', 'n2'],
...         'column_names': ['z1', 'z2']}
>>> cudf.DataFrame.from_dict(data, orient='tight')
z1     x  y
z2     1  2
n1 n2
a  b   1  3
   c   2  4

to_dict(orient: str = 'dict', into: type[dict] = <class 'dict'>, index: bool = True) → dict | list[dict]#

Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters (see below).

Parameters#

orientstr {‘dict’, ‘list’, ‘series’, ‘split’, ‘tight’, ‘records’, ‘index’}

Determines the type of the values of the dictionary.

‘dict’ (default) : dict like {column -> {index -> value}}
‘list’ : dict like {column -> [values]}
‘series’ : dict like {column -> Series(values)}
‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
‘tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}
‘records’ : list like [{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}

Abbreviations are allowed. s indicates series and sp indicates split.

intoclass, default dict

The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

indexbool, default True

Whether to include the index item (and index_names item if orient is ‘tight’) in the returned dictionary. Can only be False when orient is ‘split’ or ‘tight’. Note that when orient is ‘records’, this parameter does not take effect (index item always not included).

Returns#

dict, list or collections.abc.Mapping: Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'col1': [1, 2],
...                      'col2': [0.5, 0.75]},
...                     index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}

>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]]}

>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]

>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}

>>> df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}

You can also specify the mapping type.

>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
             ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

If you want a defaultdict, you need to initialize it:

>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
 defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]

scatter_by_map(map_index, map_size=None, keep_index=True, debug: bool = False)#

Scatter to a list of dataframes.

Uses map_index to determine the destination of each row of the original DataFrame.

Parameters#

map_indexSeries, str or list-like: Scatter assignment for each row
map_sizeint: Length of output list. Must be >= uniques in map_index
keep_indexbool: Conserve original index values for each row

Returns#

A list of cudf.DataFrame objects.

Raises#

ValueError: If the map_index has invalid entries (not all in [0, num_partitions)).

update(other, join='left', overwrite=True, filter_func=None, errors='ignore')#

Modify a DataFrame in place using non-NA values from another DataFrame.

Aligns on indices. There is no return value.

Parameters#

otherDataFrame, or object coercible into a DataFrame: Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.
join{‘left’}, default ‘left’: Only left join is implemented, keeping the index and columns of the original object.
overwrite{True, False}, default True: How to handle non-NA values for overlapping keys: True: overwrite original DataFrame’s values with values from other. False: only update values that are NA in the original DataFrame.
filter_funcNone: filter_func is not supported yet Return True for values that should be updated.S
errors{‘raise’, ‘ignore’}, default ‘ignore’: If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.

Returns#

None : method directly changes calling object

Raises#

ValueError

When errors = ‘raise’ and there’s overlapping non-NA data.
When errors is not either ‘ignore’ or ‘raise’

NotImplementedError

If join != ‘left’

items()#: Iterate over column names and series pairs

equals(other) → bool#

Test whether two objects contain the same elements.

This function allows two objects to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type.

Parameters#

otherIndex, Series, DataFrame: The other object to be compared with.

Returns#

bool: True if all elements are the same in both objects, False otherwise.

Examples#

>>> import cudf

Comparing Series with equals:

>>> s = cudf.Series([1, 2, 3])
>>> other = cudf.Series([1, 2, 3])
>>> s.equals(other)
True
>>> different = cudf.Series([1.5, 2, 3])
>>> s.equals(different)
False

Comparing DataFrames with equals:

>>> df = cudf.DataFrame({1: [10], 2: [20]})
>>> df
    1   2
0  10  20
>>> exactly_equal = cudf.DataFrame({1: [10], 2: [20]})
>>> exactly_equal
    1   2
0  10  20
>>> df.equals(exactly_equal)
True

For two DataFrames to compare equal, the types of column values must be equal, but the types of column labels need not:

>>> different_column_type = cudf.DataFrame({1.0: [10], 2.0: [20]})
>>> different_column_type
   1.0  2.0
0   10   20
>>> df.equals(different_column_type)
True

property iat#: Alias for DataFrame.iloc; provided for compatibility with Pandas.

property at#: Alias for DataFrame.loc; provided for compatibility with Pandas.

property columns#: Returns a tuple of columns

reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=<NA>, limit=None, tolerance=None)#

Conform DataFrame to new index. Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters#

labelsIndex, Series-convertible, optional, default None: New labels / index to conform the axis specified by axis to.
indexIndex, Series-convertible, optional, default None: The index labels specifying the index to conform to.
columnsarray-like, optional, default None: The column labels specifying the columns to conform to.
axisAxis to target.: Can be either the axis name (index, columns) or number (0, 1).

method : Not supported copy : boolean, default True

Return a new object, even if the passed indexes are the same.

level : Not supported fill_value : Value to use for missing values.

Defaults to NA, but can be any “compatible” value.

limit : Not supported tolerance : Not supported

Returns#

DataFrame with changed index.

Examples#

DataFrame.reindex supports two calling conventions * (index=index_labels, columns=column_labels, ...) * (labels, axis={'index', 'columns'}, ...) We _highly_ recommend using keyword arguments to clarify your intent.

Create a dataframe with some fictional data.

>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = cudf.DataFrame({'http_status': [200, 200, 404, 404, 301],
...                    'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...                      index=index)
>>> df
        http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00
>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...              'Chrome']
>>> df.reindex(new_index)
            http_status response_time
Safari                404          0.07
Iceweasel            <NA>          <NA>
Comodo Dragon        <NA>          <NA>
IE10                  404          0.08
Chrome                200          0.02

We can fill in the missing values by passing a value to the keyword fill_value.

>>> df.reindex(new_index, fill_value=0)
            http_status  response_time
Safari                 404           0.07
Iceweasel                0           0.00
Comodo Dragon            0           0.00
IE10                   404           0.08
Chrome                 200           0.02

We can also reindex the columns.

>>> df.reindex(columns=['http_status', 'user_agent'])
        http_status user_agent
Firefox            200       <NA>
Chrome             200       <NA>
Safari             404       <NA>
IE10               404       <NA>
Konqueror          301       <NA>

Or we can use “axis-style” keyword arguments

>>> df.reindex(columns=['http_status', 'user_agent'])
        http_status user_agent
Firefox            200       <NA>
Chrome             200       <NA>
Safari             404       <NA>
IE10               404       <NA>
Konqueror          301       <NA>

set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)#

Return a new DataFrame with a new index

Parameters#

keysIndex, Series-convertible, label-like, or list: Index : the new index. Series-convertible : values for the new index. Label-like : Label of column to be used as index. List : List of items from above.
dropboolean, default True: Whether to drop corresponding column for str index argument
appendboolean, default True: Whether to append columns to the existing index, resulting in a MultiIndex.
inplaceboolean, default False: Modify the DataFrame in place (do not create a new object).
verify_integrityboolean, default False: Check for duplicates in the new index.

Examples#

>>> df = cudf.DataFrame({
...     "a": [1, 2, 3, 4, 5],
...     "b": ["a", "b", "c", "d","e"],
...     "c": [1.0, 2.0, 3.0, 4.0, 5.0]
... })
>>> df
   a  b    c
0  1  a  1.0
1  2  b  2.0
2  3  c  3.0
3  4  d  4.0
4  5  e  5.0

Set the index to become the ‘b’ column:

>>> df.set_index('b')
   a    c
b
a  1  1.0
b  2  2.0
c  3  3.0
d  4  4.0
e  5  5.0

Create a MultiIndex using columns ‘a’ and ‘b’:

>>> df.set_index(["a", "b"])
       c
a b
1 a  1.0
2 b  2.0
3 c  3.0
4 d  4.0
5 e  5.0

Set new Index instance as index:

>>> df.set_index(cudf.RangeIndex(10, 15))
    a  b    c
1  a  1.0
2  b  2.0
3  c  3.0
4  d  4.0
5  e  5.0

Setting append=True will combine current index with column a:

>>> df.set_index("a", append=True)
     b    c
  a
0 1  a  1.0
1 2  b  2.0
2 3  c  3.0
3 4  d  4.0
4 5  e  5.0

set_index supports inplace parameter too:

>>> df.set_index("a", inplace=True)
>>> df
   b    c
a
1  a  1.0
2  b  2.0
3  c  3.0
4  d  4.0
5  e  5.0

fillna(value=None, method=None, axis=None, inplace=False, limit=None)#

Fill null values with value or specified method.

Parameters#

valuescalar, Series-like or dict: Value to use to fill nulls. If Series-like, null values are filled with values in corresponding indices. A dict can be used to provide different values to fill nulls in different columns. Cannot be used with method.
method{‘ffill’, ‘bfill’}, default None: Method to use for filling null values in the dataframe or series. ffill propagates the last non-null values forward to the next non-null value. bfill propagates backward with the next non-null value. Cannot be used with value.

Deprecated since version 24.04: method is deprecated.

Returns#

resultDataFrame, Series, or Index: Copy with nulls filled.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, None], 'b': [3, None, 5]})
>>> df
      a     b
0     1     3
1     2  <NA>
2  <NA>     5
>>> df.fillna(4)
   a  b
0  1  3
1  2  4
2  4  5
>>> df.fillna({'a': 3, 'b': 4})
   a  b
0  1  3
1  2  4
2  3  5

fillna on a Series object:

>>> ser = cudf.Series(['a', 'b', None, 'c'])
>>> ser
0       a
1       b
2    <NA>
3       c
dtype: object
>>> ser.fillna('z')
0    a
1    b
2    z
3    c
dtype: object

fillna can also supports inplace operation:

>>> ser.fillna('z', inplace=True)
>>> ser
0    a
1    b
2    z
3    c
dtype: object
>>> df.fillna({'a': 3, 'b': 4}, inplace=True)
>>> df
   a  b
0  1  3
1  2  4
2  3  5

fillna specified with fill method

>>> ser = cudf.Series([1, None, None, 2, 3, None, None])
>>> ser.fillna(method='ffill')
  1
  1
  1
  2
  3
  3
  3
dtype: int64
>>> ser.fillna(method='bfill')
     1
     2
     2
     2
     3
  <NA>
  <NA>
dtype: int64

where(cond, other=None, inplace=False, axis=None, level=None)#

Replace values where the condition is False.

Parameters#

condbool Series/DataFrame, array-like

Where cond is True, keep the original value. Where False, replace with corresponding value from other. Callables are not supported.

other: scalar, list of scalars, Series/DataFrame

Entries where cond is False are replaced with corresponding value from other. Callables are not supported. Default is None.

DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.

Series expects only scalar or series like with same length

inplacebool, default False

Whether to perform the operation in place on the data.

Returns#

Same type as caller

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]})
>>> df.where(df % 2 == 0, [-1, -1])
   A  B
0 -1 -1
1  4 -1
2 -1  8

>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.where(ser > 2, 10)
0     4
1     3
2    10
3    10
4    10
dtype: int64
>>> ser.where(ser > 2)
0       4
1       3
2    <NA>
3    <NA>
4    <NA>
dtype: int64

reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='', allow_duplicates: bool = False, names: Hashable | Sequence[Hashable] | None = None)#

Reset the index of the DataFrame, or a level of it.

Parameters#

levelint, str, tuple, or list, default None: Only remove the given levels from the index. Removes all levels by default.
dropbool, default False: Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplacebool, default False: Modify the DataFrame in place (do not create a new object).
allow_duplicatesbool, default False: Allow duplicate column labels to be created. Currently not supported.

Returns#

DataFrame or None: DataFrame with the new index or None if inplace=True.

Examples#

>>> df = cudf.DataFrame([('bird', 389.0),
...                    ('bird', 24.0),
...                    ('mammal', 80.5),
...                    ('mammal', np.nan)],
...                   index=['falcon', 'parrot', 'lion', 'monkey'],
...                   columns=('class', 'max_speed'))
>>> df
         class max_speed
falcon    bird     389.0
parrot    bird      24.0
lion    mammal      80.5
monkey  mammal      <NA>
>>> df.reset_index()
    index   class max_speed
0  falcon    bird     389.0
1  parrot    bird      24.0
2    lion  mammal      80.5
3  monkey  mammal      <NA>
>>> df.reset_index(drop=True)
    class max_speed
0    bird     389.0
1    bird      24.0
2  mammal      80.5
3  mammal      <NA>

You can also use reset_index with MultiIndex.

>>> index = cudf.MultiIndex.from_tuples([('bird', 'falcon'),
...                                     ('bird', 'parrot'),
...                                     ('mammal', 'lion'),
...                                     ('mammal', 'monkey')],
...                                     names=['class', 'name'])
>>> df = cudf.DataFrame([(389.0, 'fly'),
...                      ( 24.0, 'fly'),
...                      ( 80.5, 'run'),
...                      (np.nan, 'jump')],
...                      index=index,
...                      columns=('speed', 'type'))
>>> df
               speed  type
class  name
bird   falcon  389.0   fly
       parrot   24.0   fly
mammal lion     80.5   run
       monkey   <NA>  jump
>>> df.reset_index(level='class')
         class  speed  type
name
falcon    bird  389.0   fly
parrot    bird   24.0   fly
lion    mammal   80.5   run
monkey  mammal   <NA>  jump

insert(loc, column, value, allow_duplicates: bool = False, nan_as_null=<no_default>)#

Add a column to DataFrame at the index specified by loc.

Parameters#

locint: location to insert by index, cannot be greater then num columns + 1
columnnumber or string: column or label of column to be inserted

value : Series or array-like nan_as_null : bool, Default None

If None/True, converts np.nan values to null values. If False, leaves np.nan values as is.

property axes#

Return a list representing the axes of the DataFrame.

DataFrame.axes returns a list of two elements: element zero is the row index and element one is the columns.

Examples#

>>> import cudf
>>> cdf1 = cudf.DataFrame()
>>> cdf1["key"] = [0,0,1,1]
>>> cdf1["k2"] = [1,2,2,3]
>>> cdf1["val"] = [1,2,3,4]
>>> cdf1["temp"] = [-1,2,2,3]
>>> cdf1.axes
[RangeIndex(start=0, stop=4, step=1),
    Index(['key', 'k2', 'val', 'temp'], dtype='object')]

diff(periods=1, axis=0)#

First discrete difference of element.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).

Parameters#

periodsint, default 1: Periods to shift for calculating difference, accepts negative values.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: Take difference over rows (0) or columns (1). Only row-wise (0) shift is supported.

Returns#

DataFrame: First differences of the DataFrame.

Examples#

>>> import cudf
>>> gdf = cudf.DataFrame({'a': [1, 2, 3, 4, 5, 6],
...                       'b': [1, 1, 2, 3, 5, 8],
...                       'c': [1, 4, 9, 16, 25, 36]})
>>> gdf
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> gdf.diff(periods=2)
      a     b     c
0  <NA>  <NA>  <NA>
1  <NA>  <NA>  <NA>
2     2     1     8
3     2     2    12
4     2     3    16
5     2     5    20

drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)#

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters#

subsetcolumn label or sequence of labels, optional: Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’: Determines which duplicates (if any) to keep. - ‘first’ : Drop duplicates except for the first occurrence. - ‘last’ : Drop duplicates except for the last occurrence. - False : Drop all duplicates.
inplacebool, default False: Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default False: If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns#

DataFrame or None: DataFrame with duplicates removed or None if inplace=True.

Examples#

Consider a dataset containing ramen ratings.

>>> import cudf
>>> df = cudf.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
     brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
     brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

pop(item)#: Return a column and drop it from the DataFrame.

rename(mapper=None, index=None, columns=None, axis=0, copy=True, inplace=False, level=None, errors='ignore')#

Alter column and index labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

DataFrame.rename supports two calling conventions:

(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={0/'index' or 1/'column'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Parameters#

mapperdict-like or function, default None

optional dict-like or functions transformations to apply to the index/column values depending on selected axis.

indexdict-like, default None

Optional dict-like transformations to apply to the index axis’ values. Does not support functions for axis 0 yet.

columnsdict-like or function, default None

optional dict-like or functions transformations to apply to the columns axis’ values.

axisint, default 0

Axis to rename with mapper. 0 or ‘index’ for index 1 or ‘columns’ for columns

copyboolean, default True

Also copy underlying data

inplaceboolean, default False

Return new DataFrame. If True, assign columns without copy

levelint or level name, default None

In case of a MultiIndex, only rename labels in the specified level.

errors{‘raise’, ‘ignore’, ‘warn’}, default ‘ignore’

Only ‘ignore’ supported Control raising of exceptions on invalid data for provided dtype.

raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.
warn : prints last exceptions as warnings and return original object.

Returns#

DataFrame

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df
   A  B
0  1  4
1  2  5
2  3  6

Rename columns using a mapping:

>>> df.rename(columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

Rename index using a mapping:

>>> df.rename(index={0: 10, 1: 20, 2: 30})
    A  B
10  1  4
20  2  5
30  3  6

add_prefix(prefix, axis=None)#

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters#

prefixstr: The string to add before each label.

Returns#

Series or DataFrame: New Series with updated labels or DataFrame with updated labels.

Examples#

Series

>>> s = cudf.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

DataFrame

>>> df = cudf.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6

add_suffix(suffix, axis=None)#

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters#

prefixstr: The string to add after each label.

Returns#

Series or DataFrame: New Series with updated labels or DataFrame with updated labels.

Examples#

Series

>>> s = cudf.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

DataFrame

>>> df = cudf.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6

agg(aggs, axis=None)#

Aggregate using one or more operations over the specified axis.

Parameters#

aggsIterable (set, list, string, tuple or dict)

Function to use for aggregating data. Accepted types are:

string name, e.g. "sum"
list of functions, e.g. ["sum", "min", "max"]
dict of axis labels specified operations per column, e.g. {"a": "sum"}

axis : not yet supported

Returns#

Aggregation ResultSeries or DataFrame: When DataFrame.agg is called with single agg, Series is returned. When DataFrame.agg is called with several aggs, DataFrame is returned.

nlargest(n, columns, keep='first')#

Return the first n rows ordered by columns in descending order.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

Parameters#

nint

Number of rows to return.

columnslabel or list of labels

Column label(s) to order by.

keep{‘first’, ‘last’}, default ‘first’

Where there are duplicate values:

first : prioritize the first occurrence(s)
last : prioritize the last occurrence(s)

Returns#

DataFrame: The first n rows ordered by the given columns in descending order.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI
>>> df.nlargest(3, 'population')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT
>>> df.nlargest(3, 'population', keep='last')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

nsmallest(n, columns, keep='first')#

Return the first n rows ordered by columns in ascending order.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

Parameters#

nint

Number of items to retrieve.

columnslist or str

Column name or names to order by.

keep{‘first’, ‘last’}, default ‘first’

Where there are duplicate values:

first : take the first occurrence.
last : take the last occurrence.

Returns#

DataFrame

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 337000,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru         337000      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nsmallest to select the three rows having the smallest values in column “population”.

>>> df.nsmallest(3, 'population')
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS

When using keep='last', ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru         337000  182      NR

swaplevel(i=-2, j=-1, axis=0)#

Swap level i with level j. Calling this method does not change the ordering of the values.

Parameters#

iint or str, default -2: First level of index to be swapped.
jint or str, default -1: Second level of index to be swapped.
axisThe axis to swap levels on.: 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

Examples#

>>> import cudf
>>> midx = cudf.MultiIndex(levels=[['llama', 'cow', 'falcon'],
...   ['speed', 'weight', 'length'],['first','second']],
...   codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2],
...             [0, 0, 0, 0, 0, 0, 1, 1, 1]])
>>> cdf = cudf.DataFrame(index=midx, columns=['big', 'small'],
...  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...         [250, 150], [1.5, 0.8], [320, 250], [1, 0.8], [0.3, 0.2]])

>>> cdf
                             big  small
     llama  speed  first    45.0   30.0
            weight first   200.0  100.0
            length first     1.5    1.0
     cow    speed  first    30.0   20.0
            weight first   250.0  150.0
            length first     1.5    0.8
     falcon speed  second  320.0  250.0
            weight second    1.0    0.8
            length second    0.3    0.2

>>> cdf.swaplevel()
                             big  small
     llama  first  speed    45.0   30.0
                   weight  200.0  100.0
                   length    1.5    1.0
     cow    first  speed    30.0   20.0
                   weight  250.0  150.0
                   length    1.5    0.8
     falcon second speed   320.0  250.0
                   weight    1.0    0.8
                   length    0.3    0.2

transpose()#: Transpose index and columns.

Returns#

a new (ncol x nrow) dataframe. self is (nrow x ncol)

property T#: Transpose index and columns.

Returns#

a new (ncol x nrow) dataframe. self is (nrow x ncol)

melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index: bool = True)#

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

Parameters#

frame : DataFrame id_vars : tuple, list, or ndarray, optional

Column(s) to use as identifier variables. default: None

value_varstuple, list, or ndarray, optional: Column(s) to unpivot. default: all columns that are not set as id_vars.
var_namescalar: Name to use for the variable column. default: frame.columns.name or ‘variable’
value_namestr: Name to use for the value column. default: ‘value’

Returns#

outDataFrame: Melted result

merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), indicator=False, validate=None)#

Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes.

Parameters#

right : DataFrame on : label or list; defaults to None

Column or index level names to join on. These must be found in both DataFrames.

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

how{‘left’, ‘outer’, ‘inner’, ‘leftsemi’, ‘leftanti’}, default ‘inner’

Type of merge to be performed.

left : use only keys from left frame, similar to a SQL left outer join.
right : not supported.
outer : use union of keys from both frames, similar to a SQL full outer join.
inner : use intersection of keys from both frames, similar to a SQL inner join.
leftsemisimilar to inner join, but only returns columns
from the left dataframe and ignores all columns from the right dataframe.
leftanti : returns only rows columns from the left dataframe for non-matched records. This is exact opposite to leftsemi join.

left_onlabel or list, or array-like

Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

right_onlabel or list, or array-like

Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

left_indexbool, default False

Use the index from the left DataFrame as the join key(s).

right_indexbool, default False

Use the index from the right DataFrame as the join key.

sortbool, default False

Sort the resulting dataframe by the columns that were merged on, starting from the left.

suffixes: Tuple[str, str], defaults to (‘_x’, ‘_y’)

Suffixes applied to overlapping column names on the left and right sides

Returns#

merged : DataFrame

Examples#

>>> import cudf
>>> df_a = cudf.DataFrame()
>>> df_a['key'] = [0, 1, 2, 3, 4]
>>> df_a['vals_a'] = [float(i + 10) for i in range(5)]
>>> df_b = cudf.DataFrame()
>>> df_b['key'] = [1, 2, 4]
>>> df_b['vals_b'] = [float(i+10) for i in range(3)]
>>> df_merged = df_a.merge(df_b, on=['key'], how='left')
>>> df_merged.sort_values('key')
   key  vals_a  vals_b
3    0    10.0
0    1    11.0    10.0
1    2    12.0    11.0
4    3    13.0
2    4    14.0    12.0

Merging on categorical variables is only allowed in certain cases

Categorical variable typecasting logic depends on both how and the specifics of the categorical variables to be merged. Merging categorical variables when only one side is ordered is ambiguous and not allowed. Merging when both categoricals are ordered is allowed, but only when the categories are exactly equal and have equal ordering, and will result in the common dtype. When both sides are unordered, the result categorical depends on the kind of join: - For inner joins, the result will be the intersection of the categories - For left or right joins, the result will be the left or right dtype respectively. This extends to semi and anti joins. - For outer joins, the result will be the union of categories from both sides.

join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate: str | None = None)#

Join columns with other DataFrame on index or on a key column.

Parameters#

other : DataFrame how : str

Only accepts “left”, “right”, “inner”, “outer”

lsuffix, rsuffixstr

The suffices to add to the left (lsuffix) and right (rsuffix) column names when avoiding conflicts.

sortbool

Set to True to ensure sorted ordering.

validatestr, optional

If specified, checks if join is of specified type.

“one_to_one” or “1:1”: check if join keys are unique in both left and right datasets.
“one_to_many” or “1:m”: check if join keys are unique in left dataset.
“many_to_one” or “m:1”: check if join keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.

Currently not supported.

Returns#

joined : DataFrame

groupby(by=None, axis=0, level=None, as_index=True, sort=<no_default>, group_keys=False, observed=True, dropna=True)#

Group using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters#

bymapping, function, label, or list of labels: Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an cupy array is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.
levelint, level name, or sequence of such, default None: If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True: For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default False: Sort result by group key. Differ from Pandas, cudf defaults to False for better performance. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keysbool, optional: When calling apply and the by argument produces a like-indexed result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise. This argument has no effect if the result produced is not like-indexed with respect to the input.

Returns#

DataFrameGroupBy: Returns a DataFrameGroupBy object that contains information about the groups.

Examples#

Series

>>> ser = cudf.Series([390., 350., 30., 20.],
...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...                 name="Max Speed")
>>> ser
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0, sort=True).mean()
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100, sort=True).mean()
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

DataFrame

>>> import cudf
>>> import pandas as pd
>>> df = cudf.DataFrame({
...     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...     'Max Speed': [380., 370., 24., 26.],
... })
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal'], sort=True).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = cudf.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...     index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
        Wild         350.0
Parrot Captive       30.0
        Wild          20.0
>>> df.groupby(level=0, sort=True).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type", sort=True).mean()
        Max Speed
Type
Captive      210.0
Wild         185.0

>>> df = cudf.DataFrame({'A': 'a a b'.split(),
...                      'B': [1,2,3],
...                      'C': [4,6,5]})
>>> g1 = df.groupby('A', group_keys=False, sort=True)
>>> g2 = df.groupby('A', group_keys=True, sort=True)

Notice that g1 have g2 have two groups, a and b, and only differ in their group_keys argument. Calling apply in various ways, we can get different grouping results:

>>> g1[['B', 'C']].apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

In the above, the groups are not part of the index. We can have them included by using g2 where group_keys=True:

>>> g2[['B', 'C']].apply(lambda x: x / x.sum())
            B    C
A
a 0  0.333333  0.4
  1  0.666667  0.6
b 2  1.000000  1.0

query(expr, local_dict=None)#

Query with a boolean expression using Numba to compile a GPU kernel.

See pandas.DataFrame.query().

Parameters#

exprstr

A boolean expression. Names in expression refer to columns. index can be used instead of index name, but this is not supported for MultiIndex.

Names starting with @ refer to Python variables.

An output value will be null if any of the input values are null regardless of expression.

local_dictdict

Containing the local variable to be used in query.

Returns#

filtered : DataFrame

Examples#

>>> df = cudf.DataFrame({
...     "a": [1, 2, 2],
...     "b": [3, 4, 5],
... })
>>> expr = "(a == 2 and b == 4) or (b == 3)"
>>> df.query(expr)
   a  b
0  1  3
1  2  4

DateTime conditionals:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> df.query('datetimes==@search_date')
   datetimes
1 2018-10-08

Using local_dict:

>>> import numpy as np
>>> import datetime
>>> df = cudf.DataFrame()
>>> data = np.array(['2018-10-07', '2018-10-08'], dtype='datetime64')
>>> df['datetimes'] = data
>>> search_date2 = datetime.datetime.strptime('2018-10-08', '%Y-%m-%d')
>>> df.query('datetimes==@search_date',
...          local_dict={'search_date': search_date2})
   datetimes
1 2018-10-08

apply(func, axis=1, raw=False, result_type=None, args=(), by_row: Literal[False, 'compat'] = 'compat', engine: Literal['python', 'numba'] = 'python', engine_kwargs: dict[str, bool] | None = None, **kwargs)#

Apply a function along an axis of the DataFrame. apply relies on Numba to JIT compile func. Thus the allowed operations within func are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.

Some string functions and methods are supported. Refer to the guide to UDFs for details.

Parameters#

funcfunction

Function to apply to each row.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Axis along which the function is applied. - 0 or ‘index’: apply function to each column (not yet supported). - 1 or ‘columns’: apply function to each row.

raw: bool, default False

Not yet supported

result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None

Not yet supported

args: tuple

Positional arguments to pass to func in addition to the dataframe.

by_rowFalse or “compat”, default “compat”

Only has an effect when func is a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g. Series().apply(np.sum) will be translated to Series().sum()). If that doesn’t work, will try call to apply again with by_row=True and if that fails, will call apply again with by_row=False (backward compatible). If False, the funcs will be passed the whole Series at once.

Currently not supported.

engine{‘python’, ‘numba’}, default ‘python’

Unused. Added for compatibility with pandas.

engine_kwargsdict

Unused. Added for compatibility with pandas.

**kwargs

Additional keyword arguments to pass as keywords arguments to func.

Examples#

Simple function of a single variable which could be NA:

>>> def f(row):
...     if row['a'] is cudf.NA:
...             return 0
...     else:
...             return row['a'] + 1
...
>>> df = cudf.DataFrame({'a': [1, cudf.NA, 3]})
>>> df.apply(f, axis=1)
0    2
1    0
2    4
dtype: int64

Function of multiple variables will operate in a null aware manner:

>>> def f(row):
...     return row['a'] - row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, cudf.NA, 3, cudf.NA],
...     'b': [5, 6, cudf.NA, cudf.NA]
... })
>>> df.apply(f)
0      -4
1    <NA>
2    <NA>
3    <NA>
dtype: int64

Functions may conditionally return NA as in pandas:

>>> def f(row):
...     if row['a'] + row['b'] > 3:
...             return cudf.NA
...     else:
...             return row['a'] + row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [2, 1, 1]
... })
>>> df.apply(f, axis=1)
0       3
1       3
2    <NA>
dtype: int64

Mixed types are allowed, but will return the common type, rather than object as in pandas:

>>> def f(row):
...     return row['a'] + row['b']
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [0.5, cudf.NA, 3.14]
... })
>>> df.apply(f, axis=1)
0     1.5
1    <NA>
2    6.14
dtype: float64

Functions may also return scalar values, however the result will be promoted to a safe type regardless of the data:

>>> def f(row):
...     if row['a'] > 3:
...             return row['a']
...     else:
...             return 1.5
...
>>> df = cudf.DataFrame({
...     'a': [1, 3, 5]
... })
>>> df.apply(f, axis=1)
0    1.5
1    1.5
2    5.0
dtype: float64

Ops against N columns are supported generally:

>>> def f(row):
...     v, w, x, y, z = (
...         row['a'], row['b'], row['c'], row['d'], row['e']
...     )
...     return x + (y - (z / w)) % v
...
>>> df = cudf.DataFrame({
...     'a': [1, 2, 3],
...     'b': [4, 5, 6],
...     'c': [cudf.NA, 4, 4],
...     'd': [8, 7, 8],
...     'e': [7, 1, 6]
... })
>>> df.apply(f, axis=1)
0    <NA>
1     4.8
2     5.0
dtype: float64

UDFs manipulating string data are allowed, as long as they neither modify strings in place nor create new strings. For example, the following UDF is allowed:

>>> def f(row):
...     st = row['str_col']
...     scale = row['scale']
...     if len(st) == 0:
...             return -1
...     elif st.startswith('a'):
...             return 1 - scale
...     elif 'example' in st:
...             return 1 + scale
...     else:
...             return 42
...
>>> df = cudf.DataFrame({
...     'str_col': ['', 'abc', 'some_example'],
...     'scale': [1, 2, 3]
... })
>>> df.apply(f, axis=1)
0   -1
1   -1
2    4
dtype: int64

However, the following UDF is not allowed since it includes an operation that requires the creation of a new string: a call to the upper method. Methods that are not supported in this manner will raise an AttributeError.

>>> def f(row):
...     st = row['str_col'].upper()
...     return 'ABC' in st
>>> df.apply(f, axis=1)

For a complete list of supported functions and methods that may be used to manipulate string data, see the UDF guide, <https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>

applymap(func: Callable[[Any], Any], na_action: str | None = None, **kwargs) → DataFrame#

Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters#

funccallable: Python function, returns a single value from a single value.
na_action{None, ‘ignore’}, default None: If ‘ignore’, propagate NaN values, without passing them to func.

Returns#

DataFrame: Transformed DataFrame.

map(func: Callable[[Any], Any], na_action: str | None = None, **kwargs) → DataFrame#

Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters#

funccallable: Python function, returns a single value from a single value.
na_action{None, ‘ignore’}, default None: If ‘ignore’, propagate NaN values, without passing them to func.

Returns#

DataFrame: Transformed DataFrame.

apply_rows(func, incols, outcols, kwargs, pessimistic_nulls=True, cache_key=None)#

Apply a row-wise user defined function.

Parameters#

dfDataFrame: The source dataframe.
funcfunction: The transformation function that will be executed on the CUDA GPU.
incols: list or dict: A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as {‘col1’: ‘arg1’}.
outcols: dict: A dictionary of output column names and their dtype.
kwargs: dict: name-value of extra arguments. These values are passed directly into the function.
pessimistic_nullsbool: Whether or not apply_rows output should be null when any corresponding input is null. If False, all outputs will be non-null, but will be the result of applying func against the underlying column data, which may be garbage.

Examples#

The user function should loop over the columns and set the output for each row. Loop execution order is arbitrary, so each iteration of the loop MUST be independent of each other.

When func is invoked, the array args corresponding to the input/output are strided so as to improve GPU parallelism. The loop in the function resembles serial code, but executes concurrently in multiple threads.

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame()
>>> nelem = 3
>>> df['in1'] = np.arange(nelem)
>>> df['in2'] = np.arange(nelem)
>>> df['in3'] = np.arange(nelem)

Define input columns for the kernel

>>> in1 = df['in1']
>>> in2 = df['in2']
>>> in3 = df['in3']
>>> def kernel(in1, in2, in3, out1, out2, kwarg1, kwarg2):
...     for i, (x, y, z) in enumerate(zip(in1, in2, in3)):
...         out1[i] = kwarg2 * x - kwarg1 * y
...         out2[i] = y - kwarg1 * z

Call .apply_rows with the name of the input columns, the name and dtype of the output columns, and, optionally, a dict of extra arguments.

>>> df.apply_rows(kernel,
...               incols=['in1', 'in2', 'in3'],
...               outcols=dict(out1=np.float64, out2=np.float64),
...               kwargs=dict(kwarg1=3, kwarg2=4))
   in1  in2  in3 out1 out2
0    0    0    0  0.0  0.0
1    1    1    1  1.0 -2.0
2    2    2    2  2.0 -4.0

apply_chunks(func, incols, outcols, kwargs=None, pessimistic_nulls=True, chunks=None, blkct=None, tpb=None)#

Transform user-specified chunks using the user-provided function.

Parameters#

dfDataFrame: The source dataframe.
funcfunction: The transformation function that will be executed on the CUDA GPU.
incols: list or dict: A list of names of input columns that match the function arguments. Or, a dictionary mapping input column names to their corresponding function arguments such as {‘col1’: ‘arg1’}.
outcols: dict: A dictionary of output column names and their dtype.
kwargs: dict: name-value of extra arguments. These values are passed directly into the function.
pessimistic_nullsbool: Whether or not apply_rows output should be null when any corresponding input is null. If False, all outputs will be non-null, but will be the result of applying func against the underlying column data, which may be garbage.
chunksint or Series-like: If it is an int, it is the chunksize. If it is an array, it contains integer offset for the start of each chunk. The span of a chunk for chunk i-th is data[chunks[i] : chunks[i + 1]] for any i + 1 < chunks.size; or, data[chunks[i]:] for the i == len(chunks) - 1.
tpbint; optional: The threads-per-block for the underlying kernel. If not specified (Default), uses Numba .forall(...) built-in to query the CUDA Driver API to determine optimal kernel launch configuration. Specify 1 to emulate serial execution for each chunk. It is a good starting point but inefficient. Its maximum possible value is limited by the available CUDA GPU resources.
blkctint; optional: The number of blocks for the underlying kernel. If not specified (Default) and tpb is not specified (Default), uses Numba .forall(...) built-in to query the CUDA Driver API to determine optimal kernel launch configuration. If not specified (Default) and tpb is specified, uses chunks as the number of blocks.

Examples#

For tpb > 1, func is executed by tpb number of threads concurrently. To access the thread id and count, use numba.cuda.threadIdx.x and numba.cuda.blockDim.x, respectively (See numba CUDA kernel documentation).

In the example below, the kernel is invoked concurrently on each specified chunk. The kernel computes the corresponding output for the chunk.

By looping over the range range(cuda.threadIdx.x, in1.size, cuda.blockDim.x), the kernel function can be used with any tpb in an efficient manner.

>>> from numba import cuda
>>> @cuda.jit
... def kernel(in1, in2, in3, out1):
...      for i in range(cuda.threadIdx.x, in1.size, cuda.blockDim.x):
...          x = in1[i]
...          y = in2[i]
...          z = in3[i]
...          out1[i] = x * y + z

Parameters#

columnssequence of str: The names of the columns to be hashed. Must have at least one name.
npartsint: Number of output partitions
keep_indexboolean: Whether to keep the index or drop it

Returns#

partitioned: list of DataFrame

info(verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)#

Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Parameters#

verbosebool, optional: Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.
bufwritable buffer, defaults to sys.stdout: Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
max_colsint, optional: When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.
memory_usagebool, str, optional: Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting. True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.
null_countsbool, optional: Whether to show the non-null counts. By default, this is shown only if the frame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

Returns#

None: This method prints a summary of a DataFrame and returns None.

Examples#

>>> import cudf
>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> df = cudf.DataFrame({"int_col": int_values,
...                     "text_col": text_values,
...                     "float_col": float_values})
>>> df
   int_col text_col  float_col
0        1    alpha       0.00
1        2     beta       0.25
2        3    gamma       0.50
3        4    delta       0.75
4        5  epsilon       1.00

Prints information of all columns:

>>> df.info(verbose=True)
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 130.0+ bytes

Prints a summary of columns count and its dtypes but not per column information:

>>> df.info(verbose=False)
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 130.0+ bytes

Pipe output of DataFrame.info to a buffer instead of sys.stdout and print buffer contents:

>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> print(buffer.getvalue())
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 130.0+ bytes

The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:

>>> import numpy as np
>>> rng = np.random.default_rng(seed=0)
>>> random_strings_array = rng.choice(['a', 'b', 'c'], 10 ** 6)
>>> df = cudf.DataFrame({
...     'column_1': rng.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_2': rng.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_3': rng.choice(['a', 'b', 'c'], 10 ** 6)
... })
>>> df.info(memory_usage='deep')
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 14.3 MB

describe(percentiles=None, include=None, exclude=None)#

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters#

percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None(default), optional

A list of data types to include in the result. Ignored for Series. Here are the options:

‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A list of data types to omit from the result. Ignored for Series. Here are the options:

A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
None (default) : The result will exclude nothing.

Returns#

output_frameSeries or DataFrame: Summary statistics of the Series or Dataframe provided.

Notes#

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For strings dtype or datetime dtype, the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples#

Describing a Series containing numeric values.

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> s
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64
>>> s.describe()
count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
dtype: float64

Describing a categorical Series.

>>> s = cudf.Series(['a', 'b', 'a', 'b', 'c', 'a'], dtype='category')
>>> s
0    a
1    b
2    a
3    b
4    c
5    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.describe()
count     6
unique    3
top       a
freq      3
dtype: object

Describing a timestamp Series.

>>> s = cudf.Series([
...   "2000-01-01",
...   "2010-01-01",
...   "2010-01-01"
... ], dtype="datetime64[s]")
>>> s
0   2000-01-01
1   2010-01-01
2   2010-01-01
dtype: datetime64[s]
>>> s.describe()
count                     3
mean    2006-09-01 08:00:00
min     2000-01-01 00:00:00
25%     2004-12-31 12:00:00
50%     2010-01-01 00:00:00
75%     2010-01-01 00:00:00
max     2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = cudf.DataFrame({"categorical": cudf.Series(['d', 'e', 'f'],
...                         dtype='category'),
...                      "numeric": [1, 2, 3],
...                      "object": ['a', 'b', 'c']
... })
>>> df
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')
       categorical numeric object
count            3     3.0      3
unique           3    <NA>      3
top              d    <NA>      a
freq             1    <NA>      1
mean          <NA>     2.0   <NA>
std           <NA>     1.0   <NA>
min           <NA>     1.0   <NA>
25%           <NA>     1.5   <NA>
50%           <NA>     2.0   <NA>
75%           <NA>     2.5   <NA>
max           <NA>     3.0   <NA>

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              d      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])
       categorical numeric
count            3     3.0
unique           3    <NA>
top              d    <NA>
freq             1    <NA>
mean          <NA>     2.0
std           <NA>     1.0
min           <NA>     1.0
25%           <NA>     1.5
50%           <NA>     2.0
75%           <NA>     2.5
max           <NA>     3.0

to_pandas(*, nullable: bool = False, arrow_type: bool = False) → DataFrame#

Convert to a Pandas DataFrame.

Parameters#

nullableBoolean, Default False: If nullable is True, the resulting columns in the dataframe will be having a corresponding nullable Pandas dtype. If there is no corresponding nullable Pandas dtype present, the resulting dtype will be a regular pandas dtype. If nullable is False, the resulting columns will either convert null values to np.nan or None depending on the dtype.
arrow_typebool, Default False: Return the columns with a pandas.ArrowDtype

Returns#

out : Pandas DataFrame

Notes#

nullable and arrow_type cannot both be set to True

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [0, 1, 2], 'b': [-3, 2, 0]})
>>> pdf = df.to_pandas()
>>> pdf
   a  b
0  0 -3
1  1  2
2  2  0
>>> type(pdf)
<class 'pandas.core.frame.DataFrame'>

nullable=True converts the result to pandas nullable types:

>>> df = cudf.DataFrame({'a': [0, None, 2], 'b': [True, False, None]})
>>> df
      a      b
0     0   True
1  <NA>  False
2     2   <NA>
>>> pdf = df.to_pandas(nullable=True)
>>> pdf
      a      b
0     0   True
1  <NA>  False
2     2   <NA>
>>> pdf.dtypes
a      Int64
b    boolean
dtype: object
>>> pdf = df.to_pandas(nullable=False)
>>> pdf
     a      b
0  0.0   True
1  NaN  False
2  2.0   None
>>> pdf.dtypes
a    float64
b     object
dtype: object

arrow_type=True converts the result to pandas.ArrowDtype:

>>> df.to_pandas(arrow_type=True).dtypes
a    int64[pyarrow]
b     bool[pyarrow]
dtype: object

classmethod from_pandas(dataframe, nan_as_null=<no_default>)#

Convert from a Pandas DataFrame.

Parameters#

dataframePandas DataFrame object: A Pandas DataFrame object which has to be converted to cuDF DataFrame.
nan_as_nullbool, Default True: If True, converts np.nan values to null values. If False, leaves np.nan values as is.

Raises#

TypeError for invalid input type.

Examples#

>>> import cudf
>>> import pandas as pd
>>> data = [[0,1], [1,2], [3,4]]
>>> pdf = pd.DataFrame(data, columns=['a', 'b'], dtype=int)
>>> cudf.from_pandas(pdf)
   a  b
0  0  1
1  1  2
2  3  4

classmethod from_arrow(table)#

Convert from PyArrow Table to DataFrame.

Parameters#

tablePyArrow Table Object: PyArrow Table Object which has to be converted to cudf DataFrame.

Raises#

TypeError for invalid input type.

Returns#

cudf DataFrame

Examples#

>>> import cudf
>>> import pyarrow as pa
>>> data = pa.table({"a":[1, 2, 3], "b":[4, 5, 6]})
>>> cudf.DataFrame.from_arrow(data)
   a  b
0  1  4
1  2  5
2  3  6

to_arrow(preserve_index=None) → Table#

Convert to a PyArrow Table.

Parameters#

preserve_indexbool, optional: whether index column and its meta data needs to be saved or not. The default of None will store the index as a column, except for a RangeIndex which is stored as metadata only. Setting preserve_index to True will force a RangeIndex to be materialized.

Returns#

PyArrow Table

Examples#

>>> import cudf
>>> df = cudf.DataFrame(
...     {"a":[1, 2, 3], "b":[4, 5, 6]}, index=[1, 2, 3])
>>> df.to_arrow()
pyarrow.Table
a: int64
b: int64
index: int64
----
a: [[1,2,3]]
b: [[4,5,6]]
index: [[1,2,3]]
>>> df.to_arrow(preserve_index=False)
pyarrow.Table
a: int64
b: int64
----
a: [[1,2,3]]
b: [[4,5,6]]

to_records(index=True, column_dtypes=None, index_dtypes=None)#

Convert to a numpy recarray

Parameters#

indexbool: Whether to include the index in the output.
column_dtypesstr, type, dict, default None: If a string or type, the data type to store all columns. If a dictionary, a mapping of column names and indices (zero-indexed) to specific data types. Currently not supported.
index_dtypesstr, type, dict, default None: If a string or type, the data type to store all index levels. If a dictionary, a mapping of index level names and indices (zero-indexed) to specific data types. This mapping is applied only if index=True. Currently not supported.

Returns#

numpy recarray

classmethod from_records(data, index=None, exclude=None, columns=None, coerce_float: bool = False, nrows: int | None = None, nan_as_null=False)#

Convert structured or record ndarray to DataFrame.

Parameters#

data : numpy structured dtype or recarray of ndim=2 index : str, array-like

The name of the index column in data. If None, the default index is used.

excludesequence, default None: Columns or fields to exclude. Currently not implemented.
columnslist of str: List of column names to include.
coerce_floatbool, default False: Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets. Currently not implemented.
nrowsint, default None: Number of rows to read if data is an iterator. Currently not implemented.

Returns#

DataFrame

interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)#

Interpolate data values between some points.

Parameters#

methodstr, default ‘linear’: Interpolation technique to use. Currently, only ‘linear` is supported. * ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes. * ‘index’, ‘values’: linearly interpolate using the index as an x-axis. Unsorted indices can lead to erroneous results.
axisint, default 0: Axis to interpolate along. Currently, only ‘axis=0’ is supported.
inplacebool, default False: Update the data in place if possible.

Returns#

Series or DataFrame: Returns the same object type as the caller, interpolated at some or all NaN values

quantile(q=0.5, axis=0, numeric_only=True, interpolation=None, method='single', columns=None, exact=True)#

Return values at the given quantile.

Parameters#

qfloat or array-like

0 <= q <= 1, the quantile(s) to compute

axisint

axis is a NON-FUNCTIONAL parameter

numeric_onlybool, default True

If False, the quantile of datetime and timedelta data will be computed as well.

interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j. Default is 'linear' for method="single", and 'nearest' for method="table".

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

lower: i.

higher: j.

nearest: i or j whichever is nearest.

midpoint: (i + j) / 2.

method{‘single’, ‘table’}, default ‘single’

Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.

columnslist of str

List of column names to include.

exactboolean

Whether to use approximate or exact quantile algorithm.

Returns#

Series or DataFrame

If q is an array or numeric_only is set to False, a DataFrame will be returned where index is q, the columns are the columns of self, and the values are the quantile.

If q is a float, a Series will be returned where the index is the columns of self and the values are the quantiles.

Examples#

>>> import cupy as cp
>>> import cudf
>>> df = cudf.DataFrame(cp.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
...                   columns=['a', 'b'])
>>> df
   a    b
0  1    1
1  2   10
2  3  100
3  4  100
>>> df.quantile(0.1)
a    1.3
b    3.7
Name: 0.1, dtype: float64
>>> df.quantile([.1, .5])
       a     b
0.1  1.3   3.7
0.5  2.5  55.0

isin(values)#

Whether each element in the DataFrame is contained in values.

Parameters#

valuesiterable, Series, DataFrame or dict: The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns#

DataFrame:: DataFrame of booleans showing whether each element in the DataFrame is contained in values.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
...                     index=['falcon', 'dog'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0

When values is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])
        num_legs  num_wings
falcon      True       True
dog        False       True

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})
        num_legs  num_wings
falcon     False      False
dog        False       True

When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in other.

>>> other = cudf.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]},
...                         index=['spider', 'falcon'])
>>> df.isin(other)
        num_legs  num_wings
falcon      True       True
dog        False      False

count(axis=0, numeric_only=False)#

Count non-NA cells for each column or row.

The values None, NaN, NaT are considered NA.

Returns#

Series: For each column/row the number of non-NA/null entries.

Examples#

>>> import cudf
>>> import numpy as np
>>> df = cudf.DataFrame({"Person":
...        ["John", "Myla", "Lewis", "John", "Myla"],
...        "Age": [24., np.nan, 21., 33, 26],
...        "Single": [False, True, True, True, False]})
>>> df.count()
Person    5
Age       4
Single    5
dtype: int64

mode(axis=0, numeric_only=False, dropna=True)#

Get the mode(s) of each element along the selected axis.

The mode of a set of values is the value that appears most often. It can be multiple values.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to iterate over while searching for the mode:

0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.

numeric_onlybool, default False

If True, only apply to numeric columns.

dropnabool, default True

Don’t consider counts of NA/NaN/NaT.

Returns#

DataFrame: The modes of each column or row.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({
...     "species": ["bird", "mammal", "arthropod", "bird"],
...     "legs": [2, 4, 8, 2],
...     "wings": [2.0, None, 0.0, None]
... })
>>> df
     species  legs wings
0       bird     2   2.0
1     mammal     4  <NA>
2  arthropod     8   0.0
3       bird     2  <NA>

By default, missing values are not considered, and the mode of wings are both 0 and 2. The second row of species and legs contains NA, because they have only one mode, but the DataFrame has two rows.

>>> df.mode()
  species  legs  wings
0    bird     2    0.0
1    <NA>  <NA>    2.0

Setting dropna=False, NA values are considered and they can be the mode (like for wings).

>>> df.mode(dropna=False)
  species  legs wings
0    bird     2  <NA>

Setting numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.

>>> df.mode(numeric_only=True)
   legs  wings
0     2    0.0
1  <NA>    2.0

all(axis=0, bool_only=None, skipna=True, **kwargs)#

Return whether all elements are True in DataFrame.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

0 or ‘index’reduce the index, return a Series
whose index is the original column labels.
1 or ‘columns’reduce the columns, return a Series
whose index is the original index.
None : reduce all axes, return a scalar.

skipna: bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

Returns#

Series

Notes#

Parameters currently not supported are bool_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]})
>>> df.all()
a     True
b    False
dtype: bool

any(axis=0, bool_only=None, skipna=True, **kwargs)#

Return whether any elements is True in DataFrame.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

0 or ‘index’reduce the index, return a Series
whose index is the original column labels.
1 or ‘columns’reduce the columns, return a Series
whose index is the original index.
None : reduce all axes, return a scalar.

skipna: bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

Returns#

Series

Notes#

Parameters currently not supported are bool_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]})
>>> df.any()
a    True
b    True
dtype: bool

select_dtypes(include=None, exclude=None)#

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters#

includestr or list: which columns to include based on dtypes
excludestr or list: which columns to exclude based on dtypes

Returns#

DataFrame: The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises#

ValueError

If both of include and exclude are empty
If include and exclude have overlapping elements

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2] * 3,
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df
   a      b    c
0  1   True  1.0
1  2  False  2.0
2  1   True  1.0
3  2  False  2.0
4  1   True  1.0
5  2  False  2.0
>>> df.select_dtypes(include='bool')
       b
0   True
1  False
2   True
3  False
4   True
5  False
>>> df.select_dtypes(include=['float64'])
     c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
>>> df.select_dtypes(exclude=['int'])
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0

to_parquet(path, engine='cudf', compression='snappy', index=None, partition_cols=None, partition_file_name=None, partition_offsets=None, statistics='ROWGROUP', metadata_file_path=None, int96_timestamps=False, row_group_size_bytes=None, row_group_size_rows=None, max_page_size_bytes=None, max_page_size_rows=None, storage_options=None, return_metadata=False, use_dictionary=True, header_version='1.0', skip_compression=None, column_encoding=None, column_type_length=None, output_as_binary=None, *args, **kwargs)#

Write a DataFrame to the parquet format.

Parameters#

pathstr or list of str: File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset. Use list of str with partition_offsets to write parts of the dataframe to different files.
compression{{‘snappy’, ‘ZSTD’, ‘LZ4’, None}}, default ‘snappy’: Name of the compression to use; case insensitive. Use None for no compression.
indexbool, default None: If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved, however, instead of being saved as values any RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.
partition_colslist, optional, default None: Column names by which to partition the dataset Columns are partitioned in the order they are given
partition_file_namestr, optional, default None: File name to use for partitioned datasets. Different partitions will be written to different directories, but all files will have this name. If nothing is specified, a random uuid4 hex string will be used for each file. This parameter is only supported by ‘cudf’ engine, and will be ignored by other engines.
partition_offsetslist, optional, default None: Offsets to partition the dataframe by. Should be used when path is list of str. Should be a list of integers of size len(path) + 1
statistics{{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}}, default ‘ROWGROUP’: Level at which column statistics should be included in file.
metadata_file_pathstr, optional, default None: If specified, this function will return a binary blob containing the footer metadata of the written parquet file. The returned blob will have the chunk.file_path field set to the metadata_file_path for each chunk. When using with partition_offsets, should be same size as len(path)
int96_timestampsbool, default False: If True, write timestamps in int96 format. This will convert timestamps from timestamp[ns], timestamp[ms], timestamp[s], and timestamp[us] to the int96 format, which is the number of Julian days and the number of nanoseconds since midnight of 1970-01-01. If False, timestamps will not be altered.
row_group_size_bytes: integer, default None: Maximum size of each stripe of the output. If None, no limit on row group stripe size will be used.
row_group_size_rows: integer or None, default None: Maximum number of rows of each stripe of the output. If None, 1000000 will be used.
max_page_size_bytes: integer or None, default None: Maximum uncompressed size of each page of the output. If None, 524288 (512KB) will be used.
max_page_size_rows: integer or None, default None: Maximum number of rows of each page of the output. If None, 20000 will be used.
max_dictionary_size: integer or None, default None: Maximum size of the dictionary page for each output column chunk. Dictionary encoding for column chunks that exceeds this limit will be disabled. If None, 1048576 (1MB) will be used.
storage_optionsdict, optional, default None: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.
return_metadatabool, default False: Return parquet metadata for written data. Returned metadata will include the file path metadata (relative to root_path). To request metadata binary blob when using with partition_cols, Pass return_metadata=True instead of specifying metadata_file_path
use_dictionarybool, default True: When False, prevents the use of dictionary encoding for Parquet page data. When True, dictionary encoding is preferred subject to max_dictionary_size constraints.
header_version{{‘1.0’, ‘2.0’}}, default “1.0”: Controls whether to use version 1.0 or version 2.0 page headers when encoding. Version 1.0 is more portable, but version 2.0 enables the use of newer encoding schemes.
force_nullable_schemabool, default False.: If True, writes all columns as null in schema. If False, columns are written as null if they contain null values, otherwise as not null.
skip_compressionset, optional, default None: If a column name is present in the set, that column will not be compressed, regardless of the compression setting.
column_encodingdict, optional, default None: Sets the page encoding to use on a per-column basis. The key is a column name, and the value is one of: ‘PLAIN’, ‘DICTIONARY’, ‘DELTA_BINARY_PACKED’, ‘DELTA_LENGTH_BYTE_ARRAY’, ‘DELTA_BYTE_ARRAY’, ‘BYTE_STREAM_SPLIT’, or ‘USE_DEFAULT’.
column_type_lengthdict, optional, default None: Specifies the width in bytes of FIXED_LEN_BYTE_ARRAY column elements. The key is a column name and the value is an integer. The named column will be output as unannotated binary (i.e. the column will behave as if output_as_binary was set).
output_as_binaryset, optional, default None: If a column name is present in the set, that column will be output as unannotated binary, rather than the default ‘UTF-8’.
store_schemabool, default False: If True, writes arrow schema to Parquet file footer’s key-value metadata section to faithfully round-trip duration types with arrow. This cannot be used with int96_timestamps enabled as int96 timestamps are deprecated in arrow. Also, all decimal32 and decimal64 columns will be converted to decimal128 as arrow only supports decimal128 and decimal256 types.
**kwargs: Additional parameters will be passed to execution engines other than cudf.

Parameters#

pathstr: File path

Parameters#

path_or_bufstr or file handle, default None: File path or object, if None is provided the result is returned as a string.
sepchar, default ‘,’: Delimiter to be used.
na_repstr, default ‘’: String to use for null entries
columnslist of str, optional: Columns to write
headerbool, default True: Write out the column names
indexbool, default True: Write out the index as a column
encodingstr, default ‘utf-8’: A string representing the encoding to use in the output file Only ‘utf-8’ is currently supported
compressionstr, None: A string representing the compression scheme to use in the output file Compression while writing csv is not supported currently
lineterminatorstr, optional: The newline character or character sequence to use in the output file. Defaults to os.linesep.
chunksizeint or None, default None: Rows to write at a time
storage_optionsdict, optional, default None: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.

Returns#

None or str: If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.

Notes#

Follows the standard of Pandas csv.QUOTE_NONNUMERIC for all output.
The default behaviour is to write all rows of the dataframe at once. This can lead to memory or overflow errors for large tables. If this happens, consider setting the chunksize argument to some reasonable fraction of the total rows in the dataframe.

Examples#

Write a dataframe to csv.

>>> import cudf
>>> filename = 'foo.csv'
>>> df = cudf.DataFrame({'x': [0, 1, 2, 3],
...                      'y': [1.0, 3.3, 2.2, 4.4],
...                      'z': ['a', 'b', 'c', 'd']})
>>> df = df.set_index(cudf.Series([3, 2, 1, 0]))
>>> df.to_csv(filename)

Parameters#

fnamestr: File path or object where the ORC dataset will be stored.
compression{{ ‘snappy’, ‘ZSTD’, ‘ZLIB’, ‘LZ4’, None }}, default ‘snappy’: Name of the compression to use; case insensitive. Use None for no compression.
statistics: str {{ “ROWGROUP”, “STRIPE”, None }}, default “ROWGROUP”: The granularity with which column statistics must be written to the file.
stripe_size_bytes: integer or None, default None: Maximum size of each stripe of the output. If None, 67108864 (64MB) will be used.
stripe_size_rows: integer or None, default None: Maximum number of rows of each stripe of the output. If None, 1000000 will be used.
row_index_stride: integer or None, default None: Row index stride (maximum number of rows in each row group). If None, 10000 will be used.
cols_as_map_typelist of column names or None, default None: A list of column names which should be written as map type in the ORC file. Note that this option only affects columns of ListDtype. Names of other column types will be ignored.
storage_optionsdict, optional, default None: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.
indexbool, default None: If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved, however, instead of being saved as values any RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

Parameters#

levelint, str, list default -1: Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.
dropnabool, default True: Whether to drop rows in the resulting Frame/Series with missing values. When multiple levels are specified, dropna==False is unsupported.

Returns#

DataFrame or Series: Stacked dataframe or series.

Notes#

The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).

Examples#

Single level columns

>>> df_single_level_cols = cudf.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight height
cat       0      1
dog       2      3
>>> df_single_level_cols.stack()
cat  height    1
     weight    0
dog  height    3
     weight    2
dtype: int64

Multi level columns: simple case

>>> import pandas as pd
>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = cudf.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1
     weight
         kg    pounds
cat       1        2
dog       2        4
>>> df_multi_level_cols1.stack()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

Missing values

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = cudf.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NULLs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack()
       weight height
cat kg    1.0   <NA>
    m    <NA>    2.0
dog kg    3.0   <NA>
    m    <NA>    4.0

Prescribing the level(s) to be stacked

The first parameter controls which level or levels are stacked:

>>> df_multi_level_cols2.stack(0)
            kg     m
cat height  <NA>   2.0
    weight   1.0  <NA>
dog height  <NA>   4.0
    weight   3.0  <NA>

>>> df_multi_level_cols2.stack([0, 1])
cat  height  m     2.0
     weight  kg    1.0
dog  height  m     4.0
     weight  kg    3.0
dtype: float64

cov(min_periods=None, ddof: int = 1, numeric_only: bool = False)#

Compute the covariance matrix of a DataFrame.

Parameters#

min_periodsint, optional: Minimum number of observations required per pair of columns to have a valid result. Currently not supported.
ddofint, default 1: Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default False: Include only float, int or boolean data. Currently not supported.

Returns#

cov : DataFrame

corr(method='pearson', min_periods=None, numeric_only: bool = False)#

Compute the correlation matrix of a DataFrame.

Parameters#

method{‘pearson’, ‘spearman’}, default ‘pearson’

Method used to compute correlation:

pearson : Standard correlation coefficient
spearman : Spearman rank correlation

min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result.

Returns#

DataFrame: The requested correlation matrix.

to_struct(name=None)#

Return a struct Series composed of the columns of the DataFrame.

Parameters#

name: optional: Name of the resulting Series

Notes#

Note: a copy of the columns is made.

keys()#

Get the columns. This is index for Series, columns for DataFrame.

Returns#

Index: Columns of DataFrame.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'one' : [1, 2, 3], 'five' : ['a', 'b', 'c']})
>>> df
   one five
0    1    a
1    2    b
2    3    c
>>> df.keys()
Index(['one', 'five'], dtype='object')
>>> df = cudf.DataFrame(columns=[0, 1, 2, 3])
>>> df
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
>>> df.keys()
Index([0, 1, 2, 3], dtype='int64')

itertuples(index=True, name='Pandas')#

Iteration is unsupported.

See iteration for more information.

iterrows()#

Iteration is unsupported.

See iteration for more information.

pivot(*, columns, index=<no_default>, values=<no_default>)#

Return reshaped DataFrame organized by the given index and column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame.

Parameters#

columnsscalar or list of scalars, optional: Column label(s) used to construct the columns of the result.
indexscalar or list of scalars, optional: Column label(s) used to construct the index of the result.
valuescolumn name or list of column names, optional: Column(s) whose values are rearranged to produce the result. If not specified, all remaining columns of the DataFrame are used.

Returns#

DataFrame

Examples#

>>> a = cudf.DataFrame()
>>> a['a'] = [1, 1, 2, 2]
>>> a['b'] = ['a', 'b', 'a', 'b']
>>> a['c'] = [1, 2, 3, 4]
>>> a.pivot(index='a', columns='b')
   c
b  a  b
a
1  1  2
2  3  4

Pivot with missing values in result:

>>> a = cudf.DataFrame()
>>> a['a'] = [1, 1, 2]
>>> a['b'] = [1, 2, 3]
>>> a['c'] = ['one', 'two', 'three']
>>> a.pivot(index='a', columns='b')
          c
    b     1     2      3
    a
    1   one   two   <NA>
    2  <NA>  <NA>  three

pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=None, margins_name='All', observed=False, sort=True)#

Create a spreadsheet-style pivot table as a DataFrame.

Parameters#

data : DataFrame values : column name or list of column names to aggregate, optional index : list of column names

Values to group by in the rows.

columnslist of column names: Values to group by in the columns.
aggfuncstr or dict, default “mean”: If dict is passed, the key is column to aggregate and value is function name.
fill_valuescalar, default None: Value to replace missing values with (in the resulting pivot table, after aggregation).

margins : Not supported dropna : Not supported margins_name : Not supported observed : Not supported sort : Not supported

Returns#

DataFrame: An Excel style pivot table.

unstack(level=-1, fill_value=None, sort: bool = True)#

Pivot one or more levels of the (necessarily hierarchical) index labels.

Pivots the specified levels of the index labels of df to the innermost levels of the columns labels of the result.

If the index of df has multiple levels, returns a Dataframe with specified level of the index pivoted to the column levels.
If the index of df has single level, returns a Series with all column levels pivoted to the index levels.

Parameters#

df : DataFrame level : level name or index, list-like

Integer, name or list of such, specifying one or more levels of the index to pivot

fill_value: Non-functional argument provided for compatibility with Pandas.
sortbool, default True: Sort the level(s) in the resulting MultiIndex columns.

Returns#

Series or DataFrame

Examples#

>>> df = cudf.DataFrame()
>>> df['a'] = [1, 1, 1, 2, 2]
>>> df['b'] = [1, 2, 3, 1, 2]
>>> df['c'] = [5, 6, 7, 8, 9]
>>> df['d'] = ['a', 'b', 'a', 'd', 'e']
>>> df = df.set_index(['a', 'b', 'd'])
>>> df
       c
a b d
1 1 a  5
  2 b  6
  3 a  7
2 1 d  8
  2 e  9

Unstacking level ‘a’:

>>> df.unstack('a')
        c
a       1     2
b d
1 a     5  <NA>
  d  <NA>     8
2 b     6  <NA>
  e  <NA>     9
3 a     7  <NA>

Unstacking level ‘d’ :

>>> df.unstack('d')
        c
d       a     b     d     e
a b
1 1     5  <NA>  <NA>  <NA>
  2  <NA>     6  <NA>  <NA>
  3     7  <NA>  <NA>  <NA>
2 1  <NA>  <NA>     8  <NA>
  2  <NA>  <NA>  <NA>     9

Unstacking multiple levels:

>>> df.unstack(['b', 'd'])
      c
b     1           2           3
d     a     d     b     e     a
a
1     5  <NA>     6  <NA>     7
2  <NA>     8  <NA>     9  <NA>

Unstacking single level index dataframe:

>>> df = cudf.DataFrame({('c', 1): [1, 2, 3], ('c', 2):[9, 8, 7]})
>>> df.unstack()
c  1  0    1
      1    2
      2    3
   2  0    9
      1    8
      2    7
dtype: int64

explode(column, ignore_index=False)#

Transform each element of a list-like to a row, replicating index values.

Parameters#

columnstr: Column to explode.
ignore_indexbool, default False: If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns#

DataFrame

Examples#

>>> import cudf
>>> df = cudf.DataFrame({
...     "a": [[1, 2, 3], [], None, [4, 5]],
...     "b": [11, 22, 33, 44],
... })
>>> df
           a   b
0  [1, 2, 3]  11
1         []  22
2       None  33
3     [4, 5]  44
>>> df.explode('a')
      a   b
0     1  11
0     2  11
0     3  11
1  <NA>  22
2  <NA>  33
3     4  44
3     5  44

pct_change(periods=1, fill_method=<no_default>, limit=<no_default>, freq=None, **kwargs)#

Calculates the percent change between sequential elements in the DataFrame.

Parameters#

periodsint, default 1: Periods to shift for forming percent change.
fill_methodstr, default ‘ffill’: How to handle NAs before computing percent changes.

Deprecated since version 24.04: All options of fill_method are deprecated except fill_method=None.
limitint, optional: The number of consecutive NAs to fill before stopping. Not yet implemented.

Deprecated since version 24.04: limit is deprecated.
freqstr, optional: Increment to use from time series API. Not yet implemented.
**kwargs: Additional keyword arguments are passed into DataFrame.shift.

Returns#

DataFrame

nunique(axis=0, dropna: bool = True) → Series#

Count number of distinct elements in specified axis. Return Series with number of distinct elements. Can ignore NaN values.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
dropnabool, default True: Don’t include NaN in the counts.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]})
>>> df.nunique()
A    3
B    2
dtype: int64

interleave_columns()#

Interleave Series columns of a table into a single column.

Converts the column major table cols into a row major column.

Parameters#

cols : input Table containing columns to interleave.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({0: ['A1', 'A2', 'A3'], 1: ['B1', 'B2', 'B3']})
>>> df
    0   1
0  A1  B1
1  A2  B2
2  A3  B3
>>> df.interleave_columns()
0    A1
1    B1
2    A2
3    B2
4    A3
5    B3
dtype: object

Returns#

The interleaved columns as a single column

eval(expr: str, inplace: bool = False, **kwargs)#

Evaluate a string describing operations on DataFrame columns.

Operates on columns only, not specific rows or elements.

Parameters#

exprstr: The expression string to evaluate.
inplacebool, default False: If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
**kwargs: Not supported.

Returns#

DataFrame, Series, or None: Series if a single column is returned (the typical use case), DataFrame if any assignment statements are included in expr, or None if inplace=True.

Examples#

>>> df = cudf.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
>>> df
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not modified.

>>> df.eval('C = A + B')
   A   B   C
1  10  11
2   8  10
3   6   9
4   4   8
5   2   7
>>> df
   A   B
1  10
2   8
3   6
4   4
5   2

Use inplace=True to modify the original DataFrame.

>>> df.eval('C = A + B', inplace=True)
>>> df
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7

Multiple columns can be assigned to using multi-line expressions:

>>> df.eval(
...     '''
... C = A + B
... D = A - B
... '''
... )
   A   B   C  D
0  1  10  11 -9
1  2   8  10 -6
2  3   6   9 -3
3  4   4   8  0
4  5   2   7  3

value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)#

Return a Series containing counts of unique rows in the DataFrame.

Parameters#

subset: list-like, optional: Columns to use when counting unique combinations.
normalize: bool, default False: Return proportions rather than frequencies.
sort: bool, default True: Sort by frequencies.
ascending: bool, default False: Sort in ascending order.
dropna: bool, default True: Don’t include counts of rows that contain NA values.

Returns#

Series

Notes#

The returned Series will have a MultiIndex with one level per input column. By default, rows that contain any NA values are omitted from the result. By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring row.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'num_legs': [2, 4, 4, 6],
...                    'num_wings': [2, 0, 0, 0]},
...                    index=['falcon', 'dog', 'cat', 'ant'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0
cat            4          0
ant            6          0
>>> df.value_counts().sort_index()
num_legs  num_wings
2         2            1
4         0            2
6         0            1
Name: count, dtype: int64

to_pylibcudf(copy: bool = False) → tuple[Table, dict]#

Convert this DataFrame to a pylibcudf.Table.

Parameters#

copybool: Whether or not to generate a new copy of the underlying device data

Returns#

pylibcudf.Table: A pylibcudf.Table referencing the same data.
dict: Dict of metadata (includes column names and dataframe indices)

Notes#

User requests to convert to pylibcudf must assume that the data may be modified afterwards.

classmethod from_pylibcudf(table: Table, metadata: dict) → Self#

Create a DataFrame from a pylibcudf.Table.

Parameters#

tablepylibcudf.Table: The input Table.
metadatadict: Metadata necessary to reconstruct the dataframe

Returns#

tablecudf.DataFrame: A cudf.DataFrame referencing the columns in the pylibcudf.Table.
metadatalist[str]: Dict of metadata (includes column names and dataframe indices)

Notes#

This function will generate a DataFrame which contains a tuple of columns pointing to the same columns the input table points to. It will directly access the data and mask buffers of the pylibcudf columns, so the newly created object is not tied to the lifetime of the original pylibcudf.Table.

abs()#

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns#

DataFrame/Series: Absolute value of each element.

Examples#

Absolute numeric values in a Series

>>> s = cudf.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

add(other, axis='columns', level=None, fill_value=None)#

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.add(1)
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.add(b)
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.add(b, fill_value=0)
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64

argsort(by=None, axis=0, kind='quicksort', order=None, ascending=True, na_position='last') → ndarray#

Return the integer indices that would sort the Series values.

Parameters#

bystr or list of str, default None: Name or list of names to sort by. If None, sort by all columns.
axis{0 or “index”}: Has no effect but is accepted for compatibility with numpy.
kind{‘mergesort’, ‘quicksort’, ‘heapsort’, ‘stable’}, default ‘quicksort’: Choice of sorting algorithm. See numpy.sort() for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. Only quicksort is supported in cuDF.
orderNone: Has no effect but is accepted for compatibility with numpy.
ascendingbool or list of bool, default True: If True, sort values in ascending order, otherwise descending.
na_position{‘first’ or ‘last’}, default ‘last’: Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.

Returns#

cupy.ndarray: The indices sorted based on input.

Examples#

Series

>>> import cudf
>>> s = cudf.Series([3, 1, 2])
>>> s
0    3
1    1
2    2
dtype: int64
>>> s.argsort()
0    1
1    2
2    0
dtype: int32
>>> s[s.argsort()]
1    1
2    2
0    3
dtype: int64

DataFrame >>> import cudf >>> df = cudf.DataFrame({‘foo’: [3, 1, 2]}) >>> df.argsort() array([1, 2, 0], dtype=int32)

Index >>> import cudf >>> idx = cudf.Index([3, 1, 2]) >>> idx.argsort() array([1, 2, 0], dtype=int32)

backfill(value=None, axis=None, inplace=None, limit=None)#: Synonym for Series.fillna() with method='bfill'.

Deprecated since version 23.06: Use DataFrame.bfill/Series.bfill instead.

Returns#

Object with missing values filled or None if inplace=True.

bfill(value=None, axis=None, inplace=None, limit=None, limit_area=None)#: Synonym for Series.fillna() with method='bfill'.

Returns#

Object with missing values filled or None if inplace=True.

clip(lower=None, upper=None, axis=1, inplace=False)#

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis. Currently only axis=1 is supported.

Parameters#

lowerscalar or array_like, default None: Minimum threshold value. All values below this threshold will be set to it. If it is None, there will be no clipping based on lower. In case of Series/Index, lower is expected to be a scalar or an array of size 1.
upperscalar or array_like, default None: Maximum threshold value. All values below this threshold will be set to it. If it is None, there will be no clipping based on upper. In case of Series, upper is expected to be a scalar or an array of size 1.

inplace : bool, default False

Returns#

Clipped DataFrame/Series/Index/MultiIndex

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"a":[1, 2, 3, 4], "b":['a', 'b', 'c', 'd']})
>>> df.clip(lower=[2, 'b'], upper=[3, 'c'])
   a  b
0  2  b
1  2  b
2  3  c
3  3  c

>>> df.clip(lower=None, upper=[3, 'c'])
   a  b
0  1  a
1  2  b
2  3  c
3  3  c

>>> df.clip(lower=[2, 'b'], upper=None)
   a  b
0  2  b
1  2  b
2  3  c
3  4  d

>>> df.clip(lower=2, upper=3, inplace=True)
>>> df
   a  b
0  2  2
1  2  3
2  3  3
3  3  3

>>> import cudf
>>> sr = cudf.Series([1, 2, 3, 4])
>>> sr.clip(lower=2, upper=3)
0    2
1    2
2    3
3    3
dtype: int64

>>> sr.clip(lower=None, upper=3)
0    1
1    2
2    3
3    3
dtype: int64

>>> sr.clip(lower=2, upper=None, inplace=True)
>>> sr
0    2
1    2
2    3
3    4
dtype: int64

convert_dtypes(infer_objects: bool = True, convert_string: bool = True, convert_integer: bool = True, convert_boolean: bool = True, convert_floating: bool = True, dtype_backend=None) → Self#

Convert columns to the best possible nullable dtypes.

If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating type.

All other dtypes are always returned as-is as all dtypes in cudf are nullable.

copy(deep: bool = True) → Self#

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below). When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters#

deepbool, default True: Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.

Returns#

copySeries or DataFrame: Object type matches caller.

Examples#

>>> s = cudf.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64
>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

Shallow copy versus default (deep) copy:

>>> s = cudf.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.

>>> s['a'] = 3
>>> shallow['b'] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

cummax(axis=None, *args, **kwargs)#

Return cumulative max of the IndexedFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

IndexedFrame

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34

cummin(axis=None, *args, **kwargs)#

Return cumulative min of the IndexedFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

IndexedFrame

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34

cumprod(axis=None, *args, **kwargs)#

Return cumulative product of the IndexedFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

IndexedFrame

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34

cumsum(axis=None, *args, **kwargs)#

Return cumulative sum of the IndexedFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

IndexedFrame

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34

div(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64

divide(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64

dot(other, reflect=False)#

Get dot product of frame and other, (binary operator dot).

Among flexible wrappers (add, sub, mul, div, mod, pow, dot) to arithmetic operators: +, -, *, /, //, %, **, @.

Parameters#

otherSequence, Series, or DataFrame: Any multiple element data structure, or list-like object.
reflectbool, default False: If True, swap the order of the operands. See https://docs.python.org/3/reference/datamodel.html#object.__ror__ for more information on when this is necessary.

Returns#

scalar, Series, or DataFrame: The result of the operation.

Examples#

>>> import cudf
>>> df = cudf.DataFrame([[1, 2, 3, 4],
...                      [5, 6, 7, 8]])
>>> df @ df.T
    0    1
0  30   70
1  70  174
>>> s = cudf.Series([1, 1, 1, 1])
>>> df @ s
0    10
1    26
dtype: int64
>>> [1, 2, 3, 4] @ s
10

drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')#

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters#

labelssingle label or list-like: Index or column labels to drop.
axis{0 or ‘index’, 1 or ‘columns’}, default 0: Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
indexsingle label or list-like: Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
columnssingle label or list-like: Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
levelint or level name, optional: For MultiIndex, level from which the labels will be removed.
inplacebool, default False: If False, return a copy. Otherwise, do operation inplace and return None.
errors{‘ignore’, ‘raise’}, default ‘raise’: If ‘ignore’, suppress error and only existing labels are dropped.

Returns#

DataFrame or Series: DataFrame or Series without the removed index or column labels.

Raises#

KeyError: If any of the labels is not found in the selected axis.

Examples#

Series

>>> s = cudf.Series([1,2,3], index=['x', 'y', 'z'])
>>> s
x    1
y    2
z    3
dtype: int64

Drop labels x and z

>>> s.drop(labels=['x', 'z'])
y    2
dtype: int64

Drop a label from the second level in MultiIndex Series.

>>> midx = cudf.MultiIndex.from_product([[0, 1, 2], ['x', 'y']])
>>> s = cudf.Series(range(6), index=midx)
>>> s
0  x    0
   y    1
1  x    2
   y    3
2  x    4
   y    5
dtype: int64
>>> s.drop(labels='y', level=1)
0  x    0
1  x    2
2  x    4
Name: 2, dtype: int64

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({"A": [1, 2, 3, 4],
...                      "B": [5, 6, 7, 8],
...                      "C": [10, 11, 12, 13],
...                      "D": [20, 30, 40, 50]})
>>> df
   A  B   C   D
0  1  5  10  20
1  2  6  11  30
2  3  7  12  40
3  4  8  13  50

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  1  20
1  2  30
2  3  40
3  4  50
>>> df.drop(columns=['B', 'C'])
   A   D
0  1  20
1  2  30
2  3  40
3  4  50

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  3  7  12  40
3  4  8  13  50

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = cudf.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = cudf.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
       length    1.5    1.0
cow    speed    30.0   20.0
       weight  250.0  150.0
       length    1.5    0.8
falcon speed   320.0  250.0
       weight    1.0    0.8
       length    0.3    0.2
>>> df.drop(index='cow', columns='small')
                 big
lama   speed    45.0
       weight  200.0
       length    1.5
falcon speed   320.0
       weight    1.0
       length    0.3
>>> df.drop(index='length', level=1)
                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
cow    speed    30.0   20.0
       weight  250.0  150.0
falcon speed   320.0  250.0
       weight    1.0    0.8

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False, ignore_index: bool = False)#

Drop rows (or columns) containing nulls from a Column.

Parameters#

axis{0, 1}, optional: Whether to drop rows (axis=0, default) or columns (axis=1) containing nulls.
how{“any”, “all”}, optional: Specifies how to decide whether to drop a row (or column). any (default) drops rows (or columns) containing at least one null value. all drops only rows (or columns) containing all null values.
thresh: int, optional: If specified, then drops every row (or column) containing less than thresh non-null values
subsetlist, optional: List of columns to consider when dropping rows (all columns are considered by default). Alternatively, when dropping columns, subset is a list of rows to consider.
inplacebool, default False: If True, do operation inplace and return None.
ignore_indexbool, default False: If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns#

Copy of the DataFrame with rows/columns containing nulls dropped.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": ['Batmobile', None, 'Bullwhip'],
...                    "born": [np.datetime64("1940-04-25"),
...                             np.datetime64("NaT"),
...                             np.datetime64("NaT")]})
>>> df
       name        toy                 born
0    Alfred  Batmobile  1940-04-25 00:00:00
1    Batman       <NA>                 <NA>
2  Catwoman   Bullwhip                 <NA>

Drop the rows where at least one element is null.

>>> df.dropna()
     name        toy       born
0  Alfred  Batmobile 1940-04-25

Drop the columns where at least one element is null.

>>> df.dropna(axis='columns')
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are null.

>>> df.dropna(how='all')
       name        toy                 born
0    Alfred  Batmobile  1940-04-25 00:00:00
1    Batman       <NA>                 <NA>
2  Catwoman   Bullwhip                 <NA>

Keep only the rows with at least 2 non-null values.

>>> df.dropna(thresh=2)
       name        toy                 born
0    Alfred  Batmobile  1940-04-25 00:00:00
2  Catwoman   Bullwhip                 <NA>

Define in which columns to look for null values.

>>> df.dropna(subset=['name', 'born'])
     name        toy       born
0  Alfred  Batmobile 1940-04-25

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)
>>> df
     name        toy       born
0  Alfred  Batmobile 1940-04-25

duplicated(subset=None, keep: Literal['first', 'last', False] = 'first') → Series#

Return boolean Series denoting duplicate rows.

Considering certain columns is optional.

Parameters#

subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to mark.

'first'Mark duplicates as True except for the first
occurrence.
'last'Mark duplicates as True except for the last
occurrence.
False : Mark all duplicates as True.

Returns#

Series: Boolean series indicating duplicated rows.

Examples#

Consider a dataset containing ramen product ratings.

>>> import cudf
>>> df = cudf.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Maggie', 'Maggie', 'Maggie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
     brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2   Maggie   cup     3.5
3   Maggie  pack    15.0
4   Maggie  pack     5.0

By default, for each set of duplicated values, the first occurrence is set to False and all others to True.

>>> df.duplicated()
  False
   True
  False
  False
  False
dtype: bool

By using ‘last’, the last occurrence of each set of duplicated values is set to False and all others to True.

>>> df.duplicated(keep='last')
   True
  False
  False
  False
  False
dtype: bool

By setting keep to False, all duplicates are True.

>>> df.duplicated(keep=False)
   True
   True
  False
  False
  False
dtype: bool

To find duplicates on specific column(s), use subset.

>>> df.duplicated(subset=['brand'])
  False
   True
  False
   True
   True
dtype: bool

property empty#

Indicator whether DataFrame or Series is empty.

True if DataFrame/Series is entirely empty (no items), meaning any of the axes are of length 0.

Returns#

outbool: If DataFrame/Series is empty, return True, if not return False.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'A' : []})
>>> df
Empty DataFrame
Columns: [A]
Index: []
>>> df.empty
True

If we only have null values in our DataFrame, it is not considered empty! We will need to drop the null’s to make the DataFrame empty:

>>> df = cudf.DataFrame({'A' : [None, None]})
>>> df
      A
0  <NA>
1  <NA>
>>> df.empty
False
>>> df.dropna().empty
True

Non-empty and empty Series example:

>>> s = cudf.Series([1, 2, None])
>>> s
0       1
1       2
2    <NA>
dtype: int64
>>> s.empty
False
>>> s = cudf.Series([])
>>> s
Series([], dtype: float64)
>>> s.empty
True

eq(other, axis='columns', level=None, fill_value=None)#

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.eq(1)
        angles  degrees
circle      False    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.eq(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.eq(b, fill_value=0)
a    True
b   False
c   False
d   False
e    <NA>
dtype: bool

Provide exponential weighted (EW) functions. Available EW functions: mean() Exactly one parameter: com, span, halflife, or alpha must be provided.

Parameters#

comfloat, optional: Specify decay in terms of center of mass, \(\alpha = 1 / (1 + com)\), for \(com \geq 0\).
spanfloat, optional: Specify decay in terms of span, \(\alpha = 2 / (span + 1)\), for \(span \geq 1\).
halflifefloat, str, timedelta, optional: Specify decay in terms of half-life, \(\alpha = 1 - \exp\left(-\ln(2) / halflife\right)\), for \(halflife > 0\).
alphafloat, optional: Specify smoothing factor \(\alpha\) directly, \(0 < \alpha \leq 1\).
min_periodsint, default 0: Not Supported
adjustbool, default True: Controls assumptions about the first value in the sequence. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html for details.
ignore_nabool, default False: Not Supported
axis{0, 1}, default 0: Not Supported
timesstr, np.ndarray, Series, default None: Not Supported

Returns#

ExponentialMovingWindow object

Notes#

cuDF input data may contain both nulls and nan values. For the purposes of this method, they are taken to have the same meaning, meaning nulls in cuDF will affect the result the same way that nan values would using the equivalent pandas method.

Examples#

>>> df = cudf.DataFrame({'B': [0, 1, 2, cudf.NA, 4]})
>>> df
      B
0     0
1     1
2     2
3  <NA>
4     4
>>> df.ewm(com=0.5).mean()
          B
0  0.000000
1  0.750000
2  1.615385
3  1.615385
4  3.670213

>>> df.ewm(com=0.5, adjust=False).mean()
          B
0.000000
0.666667
1.555556
1.555556
3.650794

ffill(value=None, axis=None, inplace=None, limit=None, limit_area: Literal['inside', 'outside', None] = None)#: Synonym for Series.fillna() with method='ffill'.

Returns#

Object with missing values filled or None if inplace=True.

first(offset)#

Select initial periods of time series data based on a date offset.

When having a DataFrame with sorted dates as index, this function can select the first few rows based on a date offset.

Parameters#

offset: str: The offset length of the data that will be selected. For instance, ‘1M’ will display all rows having their index within the first month.

Returns#

Series or DataFrame: A subset of the caller.

Raises#

TypeError: If the index is not a DatetimeIndex

Examples#

>>> i = cudf.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = cudf.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
>>> ts.first('3D')
            A
2018-04-09  1
2018-04-11  2

floordiv(other, axis='columns', level=None, fill_value=None)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.floordiv(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.floordiv(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.floordiv(b, fill_value=0)
a                      1
b    9223372036854775807
c    9223372036854775807
d                      0
e                   <NA>
dtype: int64

ge(other, axis='columns', level=None, fill_value=None)#

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.ge(1)
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.ge(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ge(b, fill_value=0)
a   True
b    True
c    True
d   False
e    <NA>
dtype: bool

gt(other, axis='columns', level=None, fill_value=None)#

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.gt(1)
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.gt(b)
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.gt(b, fill_value=0)
a   False
b    True
c    True
d   False
e    <NA>
dtype: bool

hash_values(method: Literal['murmur3', 'xxhash64', 'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'] = 'murmur3', seed: int | None = None) → Series#

Compute the hash of values in this column.

Parameters#

method{‘murmur3’, ‘xxhash32’, ‘xxhash64’, ‘md5’, ‘sha1’, ‘sha224’, ‘sha256’, ‘sha384’, ‘sha512’}, default ‘murmur3’

Hash function to use:

murmur3: MurmurHash3 hash function
xxhash32: xxHash32 hash function
xxhash64: xxHash64 hash function
md5: MD5 hash function
sha1: SHA-1 hash function
sha224: SHA-224 hash function
sha256: SHA-256 hash function
sha384: SHA-384 hash function
sha512: SHA-512 hash function

seedint, optional

Seed value to use for the hash function. This parameter is only supported for ‘murmur3’, ‘xxhash32’, and ‘xxhash64’.

Returns#

Series: A Series with hash values.

Examples#

Series

>>> import cudf
>>> series = cudf.Series([10, 120, 30])
>>> series
0     10
1    120
2     30
dtype: int64
>>> series.hash_values(method="murmur3")
0   -1930516747
1     422619251
2    -941520876
dtype: int32
>>> series.hash_values(method="md5")
0    7be4bbacbfdb05fb3044e36c22b41e8b
1    947ca8d2c5f0f27437f156cfbfab0969
2    d0580ef52d27c043c8e341fd5039b166
dtype: object
>>> series.hash_values(method="murmur3", seed=42)
0    2364453205
1     422621911
2    3353449140
dtype: uint32

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({"a": [10, 120, 30], "b": [0.0, 0.25, 0.50]})
>>> df
     a     b
0   10  0.00
1  120  0.25
2   30  0.50
>>> df.hash_values(method="murmur3")
0    -330519225
1    -397962448
2   -1345834934
dtype: int32
>>> df.hash_values(method="md5")
0    57ce879751b5169c525907d5c563fae1
1    948d6221a7c4963d4be411bcead7e32b
2    fe061786ea286a515b772d91b0dfcd70
dtype: object

head(n=5)#

Return the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters#

nint, default 5: Number of rows to select.

Returns#

DataFrame or Series: The first n rows of the caller object.

Examples#

Series

>>> ser = cudf.Series(['alligator', 'bee', 'falcon',
... 'lion', 'monkey', 'parrot', 'shark', 'whale', 'zebra'])
>>> ser
0    alligator
1          bee
2       falcon
3         lion
4       monkey
5       parrot
6        shark
7        whale
8        zebra
dtype: object

Viewing the first 5 lines

>>> ser.head()
  alligator
        bee
     falcon
       lion
     monkey
dtype: object

Viewing the first n lines (three in this case)

>>> ser.head(3)
0    alligator
1          bee
2       falcon
dtype: object

For negative values of n

>>> ser.head(-3)
  alligator
        bee
     falcon
       lion
     monkey
     parrot
dtype: object

DataFrame

>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df.head(2)
   key   val
0    0  10.0
1    1  11.0

property iloc#

Select values by position.

Examples#

Series

>>> import cudf
>>> s = cudf.Series([10, 20, 30])
>>> s
0    10
1    20
2    30
dtype: int64
>>> s.iloc[2]
30

DataFrame

Selecting rows and column by position.

>>> df = cudf.DataFrame({'a': range(20),
...                      'b': range(20),
...                      'c': range(20)})

Select a single row using an integer index.

>>> df.iloc[1]
a    1
b    1
c    1
Name: 1, dtype: int64

Select multiple rows using a list of integers.

>>> df.iloc[[0, 2, 9, 18]]
      a    b    c
 0    0    0    0
 2    2    2    2
 9    9    9    9
18   18   18   18

Select rows using a slice.

>>> df.iloc[3:10:2]
     a    b    c
3    3    3    3
5    5    5    5
7    7    7    7
9    9    9    9

Select both rows and columns.

>>> df.iloc[[1, 3, 5, 7], 2]
1    1
3    3
5    5
7    7
Name: c, dtype: int64

Setting values in a column using iloc.

>>> df.iloc[:4] = 0
>>> df
   a  b  c
0  0  0
0  0  0
0  0  0
0  0  0
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9
[10 more rows]

property index#: Get the labels for the rows.

isna()#

Identify missing values.

Return a boolean same-sized object indicating if the values are <NA>. <NA> values gets mapped to True values. Everything else gets mapped to False values. <NA> values include:

Values where null mask is set.
NaN in float dtype.
NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index: Mask of bool values for each element in the object that indicates whether an element is an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.nan],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.nan, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
3    False
4    False
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.nan, 0.32, np.inf])
>>> idx
Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.isna()
array([False, False,  True,  True, False, False])

isnull()#

Identify missing values.

Return a boolean same-sized object indicating if the values are <NA>. <NA> values gets mapped to True values. Everything else gets mapped to False values. <NA> values include:

Values where null mask is set.
NaN in float dtype.
NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index: Mask of bool values for each element in the object that indicates whether an element is an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.nan],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.nan, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
3    False
4    False
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.nan, 0.32, np.inf])
>>> idx
Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.isna()
array([False, False,  True,  True, False, False])

kurt(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return Fisher’s unbiased kurtosis of a sample.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values when computing the result.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series or scalar

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series.kurtosis()
-1.1999999999999904

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.kurt()
a   -1.2
b   -1.2
dtype: float64

kurtosis(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return Fisher’s unbiased kurtosis of a sample.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values when computing the result.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series or scalar

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series.kurtosis()
-1.1999999999999904

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.kurt()
a   -1.2
b   -1.2
dtype: float64

last(offset)#

Select final periods of time series data based on a date offset.

When having a DataFrame with sorted dates as index, this function can select the last few rows based on a date offset.

Parameters#

offset: str: The offset length of the data that will be selected. For instance, ‘3D’ will display all rows having their index within the last 3 days.

Returns#

Series or DataFrame: A subset of the caller.

Raises#

TypeError: If the index is not a DatetimeIndex

Examples#

>>> i = cudf.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = cudf.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
>>> ts.last('3D')
            A
2018-04-13  3
2018-04-15  4

le(other, axis='columns', level=None, fill_value=None)#

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.le(1)
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.le(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.le(b, fill_value=0)
a    True
b   False
c   False
d    True
e    <NA>
dtype: bool

property loc#

Select rows and columns by label or boolean mask.

Examples#

Series

>>> import cudf
>>> series = cudf.Series([10, 11, 12], index=['a', 'b', 'c'])
>>> series
a    10
b    11
c    12
dtype: int64
>>> series.loc['b']
11

DataFrame

DataFrame with string index.

>>> df
   a  b
a  0  5
b  1  6
c  2  7
d  3  8
e  4  9

Select a single row by label.

>>> df.loc['a']
a    0
b    5
Name: a, dtype: int64

Select multiple rows and a single column.

>>> df.loc[['a', 'c', 'e'], 'b']
a    5
c    7
e    9
Name: b, dtype: int64

Selection by boolean mask.

>>> df.loc[df.a > 2]
   a  b
d  3  8
e  4  9

Setting values using loc.

>>> df.loc[['a', 'c', 'e'], 'a'] = 0
>>> df
   a  b
a  0  5
b  1  6
c  0  7
d  3  8
e  0  9

lt(other, axis='columns', level=None, fill_value=None)#

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.lt(1)
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.lt(b)
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.lt(b, fill_value=0)
a   False
b   False
c   False
d    True
e    <NA>
dtype: bool

mask(cond, other=None, inplace: bool = False, axis=None, level=None) → Self | None#

Replace values where the condition is True.

Parameters#

condbool Series/DataFrame, array-like

Where cond is False, keep the original value. Where True, replace with corresponding value from other. Callables are not supported.

other: scalar, list of scalars, Series/DataFrame

Entries where cond is True are replaced with corresponding value from other. Callables are not supported. Default is None.

DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.

Series expects only scalar or series like with same length

inplacebool, default False

Whether to perform the operation in place on the data.

Returns#

Same type as caller

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]})
>>> df.mask(df % 2 == 0, [-1, -1])
   A  B
0  1  3
1 -1  5
2  5 -1

>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.mask(ser > 2, 10)
0    10
1    10
2     2
3     1
4     0
dtype: int64
>>> ser.mask(ser > 2)
0    <NA>
1    <NA>
2       2
3       1
4       0
dtype: int64

max(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return the maximum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values when computing the result.
numeric_only: bool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.max()
a     4
b    10
dtype: int64

mean(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return the mean of the values for the requested axis.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}: Axis for the function to be applied on.
skipnabool, default True: Exclude NA/null values when computing the result.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.
**kwargs: Additional keyword arguments to be passed to the function.

Returns#

mean : Series or DataFrame (if level specified)

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.mean()
a    2.5
b    8.5
dtype: float64

median(axis=<no_default>, skipna=True, numeric_only=None, **kwargs)#

Return the median of the values for the requested axis.

Parameters#

axis{index (0), columns (1)}: Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
skipnabool, default True: Exclude NA/null values when computing the result.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

scalar

Examples#

>>> import cudf
>>> ser = cudf.Series([10, 25, 3, 25, 24, 6])
>>> ser
0    10
1    25
2     3
3    25
4    24
5     6
dtype: int64
>>> ser.median()
17.0

min(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return the minimum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values when computing the result.
numeric_only: bool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> min_series = df.min()
>>> min_series
a    1
b    7
dtype: int64
>>> min_series.min()
1

mod(other, axis='columns', level=None, fill_value=None)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.mod(1)
        angles  degrees
circle          0        0
triangle        0        0
rectangle       0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.mod(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.mod(b, fill_value=0)
a             0
b    4294967295
c    4294967295
d             0
e          <NA>
dtype: int64

mul(other, axis='columns', level=None, fill_value=None)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.multiply(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.multiply(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64

multiply(other, axis='columns', level=None, fill_value=None)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.multiply(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.multiply(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64

nans_to_nulls()#

Convert nans (if any) to nulls

Returns#

DataFrame or Series

Examples#

Series

>>> import cudf, numpy as np
>>> series = cudf.Series([1, 2, np.nan, None, 10], nan_as_null=False)
>>> series
0     1.0
1     2.0
2     NaN
3    <NA>
4    10.0
dtype: float64
>>> series.nans_to_nulls()
0     1.0
1     2.0
2    <NA>
3    <NA>
4    10.0
dtype: float64

DataFrame

>>> df = cudf.DataFrame()
>>> df['a'] = cudf.Series([1, None, np.nan], nan_as_null=False)
>>> df['b'] = cudf.Series([None, 3.14, np.nan], nan_as_null=False)
>>> df
      a     b
0   1.0  <NA>
1  <NA>  3.14
2   NaN   NaN
>>> df.nans_to_nulls()
      a     b
0   1.0  <NA>
1  <NA>  3.14
2  <NA>  <NA>

ne(other, axis='columns', level=None, fill_value=None)#

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.ne(1)
        angles  degrees
circle       True     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.ne(b)
a    False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ne(b, fill_value=0)
a   False
b    True
c    True
d    True
e    <NA>
dtype: bool

notna()#

Identify non-missing values.

Return a boolean same-sized object indicating if the values are not <NA>. Non-missing values get mapped to True. <NA> values get mapped to False values. <NA> values include:

Values where null mask is set.
NaN in float dtype.
NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index: Mask of bool values for each element in the object that indicates whether an element is not an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.nan],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.nan, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
3     True
4     True
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.nan, 0.32, np.inf])
>>> idx
Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.notna()
array([ True,  True, False, False,  True,  True])

notnull()#

Identify non-missing values.

Return a boolean same-sized object indicating if the values are not <NA>. Non-missing values get mapped to True. <NA> values get mapped to False values. <NA> values include:

Values where null mask is set.
NaN in float dtype.
NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index: Mask of bool values for each element in the object that indicates whether an element is not an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.nan],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.nan, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
3     True
4     True
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.nan, 0.32, np.inf])
>>> idx
Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.notna()
array([ True,  True, False, False,  True,  True])

pad(value=None, axis=None, inplace=None, limit=None)#: Synonym for Series.fillna() with method='ffill'.

Deprecated since version 23.06: Use DataFrame.ffill/Series.ffill instead.

Returns#

Object with missing values filled or None if inplace=True.

pipe(func, *args, **kwargs)#

Apply func(self, *args, **kwargs).

Parameters#

funcfunction: Function to apply to the Series/DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.
argsiterable, optional: Positional arguments passed into func.
kwargsmapping, optional: A dictionary of keyword arguments passed into func.

Returns#

object : the return type of func.

Examples#

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> func(g(h(df), arg1=a), arg2=b, arg3=c)

You can write

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe(func, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe((func, 'arg2'), arg1=a, arg3=c)
...  )

pow(other, axis='columns', level=None, fill_value=None)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.pow(1)
        angles  degrees
circle          0      360
triangle        2      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.pow(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.pow(b, fill_value=0)
a       1
b       1
c       1
d       0
e    <NA>
dtype: int64

prod(axis=<no_default>, skipna=True, dtype=None, numeric_only=False, min_count=0, **kwargs)#

Return product of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

numeric_onlybool, default False

If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.product()
a      24
b    5040
dtype: int64

product(axis=<no_default>, skipna=True, dtype=None, numeric_only=False, min_count=0, **kwargs)#

Return product of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

numeric_onlybool, default False

If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.product()
a      24
b    5040
dtype: int64

radd(other, axis='columns', level=None, fill_value=None)#

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.radd(1)
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.radd(b)
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.radd(b, fill_value=0)
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64

rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)#

Compute numerical data ranks (1 through n) along axis.

By default, equal values are assigned a rank that is the average of the ranks of those values.

Parameters#

axis{0 or ‘index’}, default 0: Index to direct ranking.
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’: How to rank the group of records that have the same value (i.e. ties): * average: average rank of the group * min: lowest rank in the group * max: highest rank in the group * first: ranks assigned in order they appear in the array * dense: like ‘min’, but rank always increases by 1 between groups.
numeric_onlybool, default False: For DataFrame objects, rank only numeric columns if set to True.
na_option{‘keep’, ‘top’, ‘bottom’}, default ‘keep’: How to rank NaN values: * keep: assign NaN rank to NaN values * top: assign smallest rank to NaN values if ascending * bottom: assign highest rank to NaN values if ascending.
ascendingbool, default True: Whether or not the elements should be ranked in ascending order.
pctbool, default False: Whether or not to display the returned rankings in percentile form.

Returns#

same type as caller: Return a Series or DataFrame with data ranks as values.

rdiv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rtruediv(1)
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rtruediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64

repeat(repeats, axis=None)#

Repeats elements consecutively.

Returns a new object of caller type(DataFrame/Series) where each element of the current object is repeated consecutively a given number of times.

Parameters#

repeatsint, or array of ints: The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty object.

Returns#

Series/DataFrame: A newly created object of same type as caller with repeated elements.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3], 'b': [10, 20, 30]})
>>> df
   a   b
0  1  10
1  2  20
2  3  30
>>> df.repeat(3)
   a   b
0  1  10
0  1  10
0  1  10
1  2  20
1  2  20
1  2  20
2  3  30
2  3  30
2  3  30

Repeat on Series

>>> s = cudf.Series([0, 2])
>>> s
0    0
1    2
dtype: int64
>>> s.repeat([3, 4])
0    0
0    0
0    0
1    2
1    2
1    2
1    2
dtype: int64
>>> s.repeat(2)
0    0
0    0
1    2
1    2
dtype: int64

replace(to_replace=None, value=<no_default>, inplace=False, limit=None, regex=False, method=<no_default>)#

Replace values given in to_replace with value.

Parameters#

to_replacenumeric, str or list-like

Value(s) to replace.

numeric or str:
- values equal to to_replace will be replaced with value
list of numeric or str:
- If value is also list-like, to_replace and value must be of same length.
dict:
- Dicts can be used to specify different replacement values for different existing values. For example, {‘a’: ‘b’, ‘y’: ‘z’} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.

valuescalar, dict, list-like, str, default None

Value to replace any values matching to_replace with.

inplacebool, default False

If True, in place.

Raises#

TypeError

If to_replace is not a scalar, array-like, dict, or None
If to_replace is a dict and value is not a list, dict, or Series

ValueError

If a list is passed to to_replace and value but they are not the same length.

Returns#

resultSeries: Series after replacement. The mask and index are preserved.

Examples#

Series

Scalar to_replace and value

>>> import cudf
>>> s = cudf.Series([0, 1, 2, 3, 4])
>>> s
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

List-like to_replace

>>> s.replace([1, 2], 10)
   0
  10
  10
   3
   4
dtype: int64

dict-like to_replace

>>> s.replace({1:5, 3:50})
0     0
1     5
2     2
3    50
4     4
dtype: int64
>>> s = cudf.Series(['b', 'a', 'a', 'b', 'a'])
>>> s
0     b
1     a
2     a
3     b
4     a
dtype: object
>>> s.replace({'a': None})
0       b
1    <NA>
2    <NA>
3       b
4    <NA>
dtype: object

If there is a mismatch in types of the values in to_replace & value with the actual series, then cudf exhibits different behavior with respect to pandas and the pairs are ignored silently:

>>> s = cudf.Series(['b', 'a', 'a', 'b', 'a'])
>>> s
0    b
1    a
2    a
3    b
4    a
dtype: object
>>> s.replace('a', 1)
0    b
1    a
2    a
3    b
4    a
dtype: object
>>> s.replace(['a', 'c'], [1, 2])
0    b
1    a
2    a
3    b
4    a
dtype: object

DataFrame

Scalar to_replace and value

>>> import cudf
>>> df = cudf.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df
   A  B  C
0  0  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like to_replace

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
4  5  a
4  6  b
4  7  c
4  8  d
4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
4  5  a
3  6  b
2  7  c
1  8  d
4  9  e

dict-like to_replace

>>> df.replace({0: 10, 1: 100})
     A  B  C
 10  5  a
100  6  b
  2  7  c
  3  8  d
  4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
100  100  a
  1    6  b
  2    7  c
  3    8  d
  4    9  e

resample(rule, axis=0, closed: Literal['right', 'left'] | None = None, label: Literal['right', 'left'] | None = None, convention: Literal['start', 'end', 's', 'e'] = 'start', kind=None, on=None, level=None, origin='start_day', offset=None, group_keys: bool = False)#

Convert the frequency of (“resample”) the given time series data.

Parameters#

rule: str: The offset string representing the frequency to use. Note that DateOffset objects are not yet supported.
closed: {“right”, “left”}, default None: Which side of bin interval is closed. The default is “left” for all frequency offsets except for “M” and “W”, which have a default of “right”.
label: {“right”, “left”}, default None: Which bin edge label to label bucket with. The default is “left” for all frequency offsets except for “M” and “W”, which have a default of “right”.
on: str, optional: For a DataFrame, column to use instead of the index for resampling. Column must be a datetime-like.
level: str or int, optional: For a MultiIndex, level to use instead of the index for resampling. The level must be a datetime-like.

Returns#

A Resampler object

Examples#

First, we create a time series with 1 minute intervals:

>>> index = cudf.date_range(start="2001-01-01", periods=10, freq="1T")
>>> sr = cudf.Series(range(10), index=index)
>>> sr
2001-01-01 00:00:00    0
2001-01-01 00:01:00    1
2001-01-01 00:02:00    2
2001-01-01 00:03:00    3
2001-01-01 00:04:00    4
2001-01-01 00:05:00    5
2001-01-01 00:06:00    6
2001-01-01 00:07:00    7
2001-01-01 00:08:00    8
2001-01-01 00:09:00    9
dtype: int64

Downsampling to 3 minute intervals, followed by a “sum” aggregation:

>>> sr.resample("3T").sum()
2001-01-01 00:00:00     3
2001-01-01 00:03:00    12
2001-01-01 00:06:00    21
2001-01-01 00:09:00     9
dtype: int64

Use the right side of each interval to label the bins:

>>> sr.resample("3T", label="right").sum()
2001-01-01 00:03:00     3
2001-01-01 00:06:00    12
2001-01-01 00:09:00    21
2001-01-01 00:12:00     9
dtype: int64

Close the right side of the interval instead of the left:

>>> sr.resample("3T", closed="right").sum()
2000-12-31 23:57:00     0
2001-01-01 00:00:00     6
2001-01-01 00:03:00    15
2001-01-01 00:06:00    24
dtype: int64

Upsampling to 30 second intervals:

>>> sr.resample("30s").asfreq()[:5]  # show the first 5 rows
2001-01-01 00:00:00       0
2001-01-01 00:00:30    <NA>
2001-01-01 00:01:00       1
2001-01-01 00:01:30    <NA>
2001-01-01 00:02:00       2
dtype: int64

Upsample and fill nulls using the “bfill” method:

>>> sr.resample("30s").bfill()[:5]
2001-01-01 00:00:00    0
2001-01-01 00:00:30    1
2001-01-01 00:01:00    1
2001-01-01 00:01:30    2
2001-01-01 00:02:00    2
dtype: int64

Resampling by a specified column of a Dataframe:

>>> df = cudf.DataFrame({
...     "price": [10, 11, 9, 13, 14, 18, 17, 19],
...     "volume": [50, 60, 40, 100, 50, 100, 40, 50],
...     "week_starting": cudf.date_range(
...         "2018-01-01", periods=8, freq="7D"
...     )
... })
>>> df
price  volume week_starting
0     10      50    2018-01-01
1     11      60    2018-01-08
2      9      40    2018-01-15
3     13     100    2018-01-22
4     14      50    2018-01-29
5     18     100    2018-02-05
6     17      40    2018-02-12
7     19      50    2018-02-19
>>> df.resample("M", on="week_starting").mean()
               price     volume
week_starting
2018-01-31      11.4  60.000000
2018-02-28      18.0  63.333333

rfloordiv(other, axis='columns', level=None, fill_value=None)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rfloordiv(1)
                        angles  degrees
circle     9223372036854775807        0
triangle                     0        0
rectangle                    0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rfloordiv(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rfloordiv(b, fill_value=0)
a                      1
b                      0
c                      0
d    9223372036854775807
e                   <NA>
dtype: int64

rmod(other, axis='columns', level=None, fill_value=None)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rmod(1)
            angles  degrees
circle     4294967295        1
triangle            1        1
rectangle           1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rmod(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmod(b, fill_value=0)
a             0
b             0
c             0
d    4294967295
e          <NA>
dtype: int64

rmul(other, axis='columns', level=None, fill_value=None)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rmul(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rmul(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmul(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64

rolling(window, min_periods=None, center: bool = False, win_type: str | None = None, on=None, axis=0, closed: str | None = None, step: int | None = None, method: str = 'single')#

Rolling window calculations.

Parameters#

windowint, offset or a BaseIndexer subclass: Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset. If a BaseIndexer subclass is passed, calculates the window boundaries based on the defined get_window_bounds method.
min_periodsint, optional: The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None, min_periods is equal to the window size.
centerbool, optional: If True, the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns#

Rolling object.

Examples#

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
  1
  3
  5
  3
  4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
  1
  2
  3
  2
  2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
  2
  3
  2
  2
  1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )

>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64

round(decimals=0, how='half_even')#

Round to a variable number of decimal places.

Parameters#

decimalsint, dict, Series: Number of decimal places to round each column to. This parameter must be an int for a Series. For a DataFrame, a dict or a Series are also valid inputs. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
howstr, optional: Type of rounding. Can be either “half_even” (default) or “half_up” rounding.

Returns#

Series or DataFrame: A Series or DataFrame with the affected columns rounded to the specified number of decimal places.

Examples#

Series

>>> s = cudf.Series([0.1, 1.4, 2.9])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64

DataFrame

>>> df = cudf.DataFrame(
...     [(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
...     columns=['dogs', 'cats'],
... )
>>> df
   dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number of decimal places.

>>> df.round(1)
   dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be specified with the column names as keys and the number of decimal places as values.

>>> df.round({'dogs': 1, 'cats': 0})
   dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be specified with the column names as the index and the number of decimal places as the values.

>>> decimals = cudf.Series([0, 1], index=['cats', 'dogs'])
>>> df.round(decimals)
   dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

rpow(other, axis='columns', level=None, fill_value=None)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rpow(1)
        angles  degrees
circle          1        1
triangle        1        1
rectangle       1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rpow(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rpow(b, fill_value=0)
a       1
b       0
c       0
d       1
e    <NA>
dtype: int64

rsub(other, axis='columns', level=None, fill_value=None)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rsub(1)
        angles  degrees
circle          1     -359
triangle       -2     -179
rectangle      -3     -359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rsub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rsub(b, fill_value=0)
a       0
b      -1
c      -1
d       1
e    <NA>
dtype: int64

rtruediv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.rtruediv(1)
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.rtruediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64

sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)#

Return a random sample of items from an axis of object.

If reproducible results are required, a random number generator may be provided via the random_state parameter. This function will always produce the same sample given an identical random_state.

Parameters#

nint, optional: Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
fracfloat, optional: Fraction of axis items to return. Cannot be used with n.
replacebool, default False: Allow or disallow sampling of the same row more than once. replace == True is not supported for axis = 1/”columns”. replace == False is not supported for axis = 0/”index” given random_state is None or a cupy random state, and weights is specified.
weightsndarray-like, optional: Default None for uniform probability distribution over rows to sample from. If ndarray is passed, the length of weights should equal to the number of rows to sample from, and will be normalized to have a sum of 1. Unlike pandas, index alignment is not currently not performed.
random_stateint, numpy/cupy RandomState, or None, default None: If None, default cupy random state is chosen. If int, the seed for the default cupy random state. If RandomState, rows-to-sample are generated from the RandomState.
axis{0 or index, 1 or columns, None}, default None: Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames). Series doesn’t support axis=1.
ignore_indexbool, default False: If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns#

Series or DataFrame: A new object of same type as caller containing n items randomly sampled from the caller object.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"a":{1, 2, 3, 4, 5}})
>>> df.sample(3)
   a
1  2
3  4
0  1

>>> sr = cudf.Series([1, 2, 3, 4, 5])
>>> sr.sample(10, replace=True)
  4
  1
  4
  5
  1
  5
  1
  2
  3
  2
dtype: int64

>>> df = cudf.DataFrame(
...     {"a": [1, 2], "b": [2, 3], "c": [3, 4], "d": [4, 5]}
... )
>>> df.sample(2, axis=1)
   a  c
0  1  3
1  2  4

scale()#

Scale values to [0, 1] in float64

Returns#

DataFrame or Series: Values scaled to [0, 1].

Examples#

>>> import cudf
>>> series = cudf.Series([10, 11, 12, 0.5, 1])
>>> series
0    10.0
1    11.0
2    12.0
3     0.5
4     1.0
dtype: float64
>>> series.scale()
0    0.826087
1    0.913043
2    1.000000
3    0.000000
4    0.043478
dtype: float64

searchsorted(values, side: Literal['left', 'right'] = 'left', sorter=None, ascending: bool = True, na_position: Literal['first', 'last'] = 'last') → ScalarLike | cupy.ndarray#

Find indices where elements should be inserted to maintain order

Parameters#

valueFrame (Shape must be consistent with self): Values to be hypothetically inserted into Self
sidestr {‘left’, ‘right’} optional, default ‘left’: If ‘left’, the index of the first suitable location found is given If ‘right’, return the last such index
sorter1-D array-like, optional: Optional array of integer indices that sort self into ascending order. They are typically the result of np.argsort. Currently not supported.
ascendingbool optional, default True: Sorted Frame is in ascending order (otherwise descending)
na_positionstr {‘last’, ‘first’} optional, default ‘last’: Position of null values in sorted order

Returns#

1-D cupy array of insertion points

Examples#

>>> s = cudf.Series([1, 2, 3])
>>> s.searchsorted(4)
3
>>> s.searchsorted([0, 4])
array([0, 3], dtype=int32)
>>> s.searchsorted([1, 3], side='left')
array([0, 2], dtype=int32)
>>> s.searchsorted([1, 3], side='right')
array([1, 3], dtype=int32)

If the values are not monotonically sorted, wrong locations may be returned:

>>> s = cudf.Series([2, 1, 3])
>>> s.searchsorted(1)
0   # wrong result, correct would be 1

>>> df = cudf.DataFrame({'a': [1, 3, 5, 7], 'b': [10, 12, 14, 16]})
>>> df
   a   b
0  1  10
1  3  12
2  5  14
3  7  16
>>> values_df = cudf.DataFrame({'a': [0, 2, 5, 6],
... 'b': [10, 11, 13, 15]})
>>> values_df
   a   b
0  0  10
1  2  17
2  5  13
3  6  15
>>> df.searchsorted(values_df, ascending=False)
array([4, 4, 4, 0], dtype=int32)

shift(periods=1, freq=None, axis=0, fill_value=None, suffix: str | None = None)#: Shift values by periods positions.

property size: int#

Return the number of elements in the underlying data.

Returns#

size : Size of the DataFrame / Index / Series / MultiIndex

Examples#

Size of an empty dataframe is 0.

>>> import cudf
>>> df = cudf.DataFrame()
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df.size
0
>>> df = cudf.DataFrame(index=[1, 2, 3])
>>> df
Empty DataFrame
Columns: []
Index: [1, 2, 3]
>>> df.size
0

DataFrame with values

>>> df = cudf.DataFrame({'a': [10, 11, 12],
...         'b': ['hello', 'rapids', 'ai']})
>>> df
    a       b
0  10   hello
1  11  rapids
2  12      ai
>>> df.size
6
>>> df.index
RangeIndex(start=0, stop=3)
>>> df.index.size
3

Size of an Index

>>> index = cudf.Index([])
>>> index
Index([], dtype='float64')
>>> index.size
0
>>> index = cudf.Index([1, 2, 3, 10])
>>> index
Index([1, 2, 3, 10], dtype='int64')
>>> index.size
4

Size of a MultiIndex

>>> midx = cudf.MultiIndex(
...                 levels=[["a", "b", "c", None], ["1", None, "5"]],
...                 codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]],
...                 names=["x", "y"],
...             )
>>> midx
MultiIndex([( 'a',  '1'),
            ( 'a',  '5'),
            ( 'b', <NA>),
            ( 'c', <NA>),
            (<NA>,  '1')],
           names=['x', 'y'])
>>> midx.size
5

skew(axis=0, skipna=True, numeric_only=False, **kwargs)#

Return unbiased Fisher-Pearson skew of a sample.

Parameters#

skipna: bool, default True: Exclude NA/null values when computing the result.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4, 5, 6, 6])
>>> series
0    1
1    2
2    3
3    4
4    5
5    6
6    6
dtype: int64

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 8, 10, 10]})
>>> df.skew()
a    0.00000
b   -0.37037
dtype: float64

sort_index(axis=0, level=None, ascending=True, inplace=False, kind=None, na_position='last', sort_remaining=True, ignore_index=False, key=None)#

Sort object by labels (along an axis).

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}, default 0: The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
levelint or level name or list of ints or list of level names: If not None, sort on values in specified index level(s). This is only useful in the case of MultiIndex.
ascendingbool, default True: Sort ascending vs. descending.
inplacebool, default False: If True, perform operation in-place.
kindsorting method such as quick sort and others.: Not yet supported.
na_position{‘first’, ‘last’}, default ‘last’: Puts NaNs at the beginning if first; last puts NaNs at the end.
sort_remainingbool, default True: When sorting a multiindex on a subset of its levels, should entries be lexsorted by the remaining (non-specified) levels as well?
ignore_indexbool, default False: if True, index will be replaced with RangeIndex.
keycallable, optional: If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

Returns#

Frame or None

Examples#

Series

>>> import cudf
>>> series = cudf.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
>>> series
3    a
2    b
1    c
4    d
dtype: object
>>> series.sort_index()
1    c
2    b
3    a
4    d
dtype: object

Sort Descending

>>> series.sort_index(ascending=False)
4    d
3    a
2    b
1    c
dtype: object

DataFrame

>>> df = cudf.DataFrame(
... {"b":[3, 2, 1], "a":[2, 1, 3]}, index=[1, 3, 2])
>>> df.sort_index(axis=0)
   b  a
1  3  2
2  1  3
3  2  1
>>> df.sort_index(axis=1)
   a  b
1  2  3
3  1  2
2  3  1

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)#

Sort by the values along either axis.

Parameters#

bystr or list of str: Name or list of names to sort by.
ascendingbool or list of bool, default True: Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
na_position{‘first’, ‘last’}, default ‘last’: ‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end
ignore_indexbool, default False: If True, index will not be sorted.
keycallable, optional: Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently. Currently not supported.

Returns#

Frame : Frame with sorted values.

Examples#

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['a'] = [0, 1, 2]
>>> df['b'] = [-3, 2, 0]
>>> df.sort_values('b')
   a  b
0  0 -3
2  2  0
1  1  2

squeeze(axis: Literal['index', 'columns', 0, 1, None] = None)#

Squeeze 1 dimensional axis objects into scalars.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’, None}, default None: A specific axis to squeeze. By default, all length-1 axes are squeezed. For Series this parameter is unused and defaults to None.

Returns#

DataFrame, Series, or scalar: The projection after squeezing axis or all the axes.

Examples#

>>> primes = cudf.Series([2, 3, 5, 7])

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]
>>> even_primes
0    2
dtype: int64

>>> even_primes.squeeze()
2

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]
>>> odd_primes
1    3
2    5
3    7
dtype: int64

>>> odd_primes.squeeze()
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = cudf.DataFrame([[1, 2], [3, 4]], columns=["a", "b"])
>>> df
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[["a"]]
>>> df_a
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze("columns")
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ["a"]]
>>> df_0a
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze("rows")
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()
1

std(axis=<no_default>, skipna=True, ddof=1, numeric_only=False, **kwargs)#

Return sample standard deviation of the DataFrame.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof: int, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.std()
a    1.290994
b    1.290994
dtype: float64

sub(other, axis='columns', level=None, fill_value=None)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.sub(1)
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.sub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64

subtract(other, axis='columns', level=None, fill_value=None)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.sub(1)
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.sub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64

sum(axis=<no_default>, skipna=True, dtype=None, numeric_only=False, min_count=0, **kwargs)#

Return sum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

numeric_onlybool, default False

If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.sum()
a    10
b    34
dtype: int64

tail(n=5)#

Returns the last n rows as a new DataFrame or Series

Examples#

DataFrame

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df.tail(2)
   key   val
3    3  13.0
4    4  14.0

Series

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.tail(2)
3    1
4    0

take(indices, axis=0)#

Return a new frame containing the rows specified by indices.

Parameters#

indicesarray-like: Array of ints indicating which positions to take.

axis : Unsupported

Returns#

outSeries or DataFrame: New object with desired subset of rows.

Examples#

Series >>> s = cudf.Series([‘a’, ‘b’, ‘c’, ‘d’, ‘e’]) >>> s.take([2, 0, 4, 3]) 2 c 0 a 4 e 3 d dtype: object

DataFrame

>>> a = cudf.DataFrame({'a': [1.0, 2.0, 3.0],
...                    'b': cudf.Series(['a', 'b', 'c'])})
>>> a.take([0, 2, 2])
     a  b
0  1.0  a
2  3.0  c
2  3.0  c
>>> a.take([True, False, True])
     a  b
0  1.0  a
2  3.0  c

tile(count: int)#

Repeats the rows count times to form a new Frame.

Parameters#

self : input Table containing columns to interleave. count : Number of times to tile “rows”. Must be non-negative.

Examples#

>>> import cudf
>>> df  = cudf.Dataframe([[8, 4, 7], [5, 2, 3]])
>>> count = 2
>>> df.tile(df, count)
   0  1  2
0  8  4  7
1  5  2  3
0  8  4  7
1  5  2  3

Returns#

The indexed frame containing the tiled “rows”.

to_cupy(dtype: Dtype | None = None, copy: bool = False, na_value=None) → cupy.ndarray#

Convert the Frame to a CuPy array.

Parameters#

dtypestr or numpy.dtype, optional: The dtype to pass to numpy.asarray().
copybool, default False: Whether to ensure that the returned value is not a view on another array. Note that copy=False does not ensure that to_cupy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.
na_valueAny, default None: The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.

Returns#

cupy.ndarray

to_dlpack()#

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters#

cudf_obj : DataFrame, Series, Index, or Column

Returns#

pycapsule_objPyCapsule: Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_hdf(path_or_buf, key, *args, **kwargs)#

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters#

path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

‘w’: write, a new file is created (an existing file with the same name would be deleted).
‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.
‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

Parameters#

path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

engine{{ ‘auto’, ‘cudf’, ‘pandas’ }}, default ‘auto’

Parser engine to use. If ‘auto’ is passed, the pandas engine will be selected.

orientstring

Indication of expected JSON string format.

Series
- default is ‘index’
- allowed values are: {‘split’,’records’,’index’,’table’}
DataFrame
- default is ‘columns’
- allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}
The format of the JSON string
- ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
- ‘records’ : list like [{column -> value}, … , {column -> value}]
- ‘index’ : dict like {index -> {column -> value}}
- ‘columns’ : dict like {column -> {index -> value}}
- ‘values’ : just the values array
- ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serializable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

Parameters#

dtypestr or numpy.dtype, optional: The dtype to pass to numpy.asarray().
copybool, default True: Whether to ensure that the returned value is not a view on another array. This parameter must be True since cuDF must copy device memory to host to provide a numpy array.
na_valueAny, default None: The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.

Returns#

numpy.ndarray

to_string()#

Convert to string

cuDF uses Pandas internals for efficient string formatting. Set formatting options using pandas string formatting options and cuDF objects will print identically to Pandas objects.

cuDF supports null/None as a value in any column type, which is transparently supported during this output process.

Examples#

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2]
>>> df['val'] = [float(i + 10) for i in range(3)]
>>> df.to_string()
'   key   val\n0    0  10.0\n1    1  11.0\n2    2  12.0'

truediv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame: Any single or multiple element data structure, or list-like object.
axisint or string: Only 0 is supported for series, 1 or columns supported for dataframe
levelint or name: Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.
fill_valuefloat or None, default None: Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series: Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )

>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])

>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64

truncate(before=None, after=None, axis=0, copy=True)#

Truncate a Series or DataFrame before and after some index value.

This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.

Parameters#

beforedate, str, int: Truncate all rows before this index value.
afterdate, str, int: Truncate all rows after this index value.
axis{0 or ‘index’, 1 or ‘columns’}, optional: Axis to truncate. Truncates the index (rows) by default.
copybool, default is True,: Return a copy of the truncated section.

Returns#

The truncated Series or DataFrame.

Notes#

If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.

Examples#

Series

>>> import cudf
>>> cs1 = cudf.Series([1, 2, 3, 4])
>>> cs1
0    1
1    2
2    3
3    4
dtype: int64

>>> cs1.truncate(before=1, after=2)
1    2
2    3
dtype: int64

>>> import cudf
>>> dates = cudf.date_range(
...     '2021-01-01 23:45:00', '2021-01-01 23:46:00', freq='s'
... )
>>> cs2 = cudf.Series(range(len(dates)), index=dates)
>>> cs2
2021-01-01 23:45:00     0
2021-01-01 23:45:01     1
2021-01-01 23:45:02     2
2021-01-01 23:45:03     3
2021-01-01 23:45:04     4
2021-01-01 23:45:05     5
2021-01-01 23:45:06     6
2021-01-01 23:45:07     7
2021-01-01 23:45:08     8
2021-01-01 23:45:09     9
2021-01-01 23:45:10    10
2021-01-01 23:45:11    11
2021-01-01 23:45:12    12
2021-01-01 23:45:13    13
2021-01-01 23:45:14    14
2021-01-01 23:45:15    15
2021-01-01 23:45:16    16
2021-01-01 23:45:17    17
2021-01-01 23:45:18    18
2021-01-01 23:45:19    19
2021-01-01 23:45:20    20
2021-01-01 23:45:21    21
2021-01-01 23:45:22    22
2021-01-01 23:45:23    23
2021-01-01 23:45:24    24
...
2021-01-01 23:45:56    56
2021-01-01 23:45:57    57
2021-01-01 23:45:58    58
2021-01-01 23:45:59    59
dtype: int64

>>> cs2.truncate(
...     before="2021-01-01 23:45:18", after="2021-01-01 23:45:27"
... )
2021-01-01 23:45:18    18
2021-01-01 23:45:19    19
2021-01-01 23:45:20    20
2021-01-01 23:45:21    21
2021-01-01 23:45:22    22
2021-01-01 23:45:23    23
2021-01-01 23:45:24    24
2021-01-01 23:45:25    25
2021-01-01 23:45:26    26
2021-01-01 23:45:27    27
dtype: int64

>>> cs3 = cudf.Series({'A': 1, 'B': 2, 'C': 3, 'D': 4})
>>> cs3
A    1
B    2
C    3
D    4
dtype: int64

>>> cs3.truncate(before='B', after='C')
B    2
C    3
dtype: int64

DataFrame

>>> df = cudf.DataFrame({
...     'A': ['a', 'b', 'c', 'd', 'e'],
...     'B': ['f', 'g', 'h', 'i', 'j'],
...     'C': ['k', 'l', 'm', 'n', 'o']
... }, index=[1, 2, 3, 4, 5])
>>> df
   A  B  C
1  a  f  k
2  b  g  l
3  c  h  m
4  d  i  n
5  e  j  o

>>> df.truncate(before=2, after=4)
   A  B  C
2  b  g  l
3  c  h  m
4  d  i  n

>>> df.truncate(before="A", after="B", axis="columns")
   A  B
a  f
b  g
c  h
d  i
e  j

>>> import cudf
>>> dates = cudf.date_range(
...     '2021-01-01 23:45:00', '2021-01-01 23:46:00', freq='s'
... )
>>> df2 = cudf.DataFrame(data={'A': 1, 'B': 2}, index=dates)
>>> df2.head()
                     A  B
2021-01-01 23:45:00  1  2
2021-01-01 23:45:01  1  2
2021-01-01 23:45:02  1  2
2021-01-01 23:45:03  1  2
2021-01-01 23:45:04  1  2

>>> df2.truncate(
...     before="2021-01-01 23:45:18", after="2021-01-01 23:45:27"
... )
                     A  B
2021-01-01 23:45:18  1  2
2021-01-01 23:45:19  1  2
2021-01-01 23:45:20  1  2
2021-01-01 23:45:21  1  2
2021-01-01 23:45:22  1  2
2021-01-01 23:45:23  1  2
2021-01-01 23:45:24  1  2
2021-01-01 23:45:25  1  2
2021-01-01 23:45:26  1  2
2021-01-01 23:45:27  1  2

property values: ndarray#

Return a CuPy representation of the DataFrame.

Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns#

cupy.ndarray: The values of the DataFrame.

property values_host: ndarray#

Return a NumPy representation of the data.

Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns#

numpy.ndarray: A host representation of the underlying data.

var(axis=<no_default>, skipna=True, ddof=1, numeric_only=False, **kwargs)#

Return unbiased variance of the DataFrame.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters#

axis: {index (0), columns(1)}: Axis for the function to be applied on.
skipna: bool, default True: Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof: int, default 1: Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_onlybool, default False: If True, includes only float, int, boolean columns. If False, will raise error in-case there are non-numeric columns.

Returns#

scalar

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.var()
a    1.666667
b    1.666667
dtype: float64

hipdf.DataFrame

Contents

hipdf.DataFrame#

Parameters#

Examples#

Returns#

Examples#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

See Also#

Examples#

Parameters#

Returns#

See Also#

Examples#

Parameters#

Returns#

Raises#

Parameters#

Returns#

Raises#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

See Also#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

See Also#

Examples#

Parameters#

Returns#

See Also#

Examples#

Parameters#

Returns#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Examples#

Returns#

Returns#

Parameters#