hipdf.Series

Contents

hipdf.Series#

177 min read time

Applies to Linux

class hipdf.Series(data=None, index=None, dtype=None, name=None, copy=False, nan_as_null=True)#

Bases: SingleColumnFrame, IndexedFrame, Serializable

One-dimensional GPU array (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as null/NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values, they need not be the same length. The result index will be the sorted union of the two indexes.

Series objects are used as columns of DataFrame.

Parameters#

dataarray-like, Iterable, dict, or scalar value

Contains data stored in Series.

indexarray-like or Index (1d)

Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict.

dtypestr, numpy.dtype, or ExtensionDtype, optional

Data type for the output Series. If not specified, this will be inferred from data.

namestr, optional

The name to give to the Series.

copybool, default False

Copy input data. Only affects Series or 1d ndarray input.

nan_as_nullbool, Default True

If None/True, converts np.nan values to null values. If False, leaves np.nan values as is.

__init__(data=None, index=None, dtype=None, name=None, copy=False, nan_as_null=True)#

Methods

__init__([data, index, dtype, name, copy, ...])

abs()

Return a Series/DataFrame with absolute numeric value of each element.

add(other[, level, fill_value, axis])

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

add_prefix(prefix)

Prefix labels with string prefix.

add_suffix(suffix)

Suffix labels with string suffix.

all([axis, bool_only, skipna, level])

Return whether all elements are True in DataFrame.

any([axis, bool_only, skipna, level])

Return whether any elements is True in DataFrame.

append(to_append[, ignore_index, ...])

Append values from another Series or array-like object.

apply(func[, convert_dtype, args])

Apply a scalar function to the values of a Series.

argsort([axis, kind, order, ascending, ...])

Return the integer indices that would sort the Series values.

astype(dtype[, copy, errors])

Cast the object to the given dtype.

autocorr([lag])

Compute the lag-N autocorrelation.

backfill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='bfill'.

between(left, right[, inclusive])

Return boolean Series equivalent to left <= series <= right.

bfill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='bfill'.

clip([lower, upper, inplace, axis])

Trim values at input threshold(s).

convert_dtypes([infer_objects, ...])

Convert columns to the best possible nullable dtypes.

copy([deep])

Make a copy of this object's indices and data.

corr(other[, method, min_periods])

Calculates the sample correlation between two Series, excluding missing values.

count([level])

Return number of non-NA/null observations in the Series

cov(other[, min_periods])

Compute covariance with Series, excluding missing values.

cummax([axis, skipna])

Return cumulative max of the Series.

cummin([axis, skipna])

Return cumulative min of the Series.

cumprod([axis, skipna])

Return cumulative product of the Series.

cumsum([axis, skipna])

Return cumulative sum of the Series.

describe([percentiles, include, exclude, ...])

Generate descriptive statistics.

deserialize(header, frames)

Generate an object from a serialized representation.

device_deserialize(header, frames)

Perform device-side deserialization tasks.

device_serialize()

Serialize data and metadata associated with device memory.

diff([periods])

First discrete difference of element.

digitize(bins[, right])

Return the indices of the bins to which each value belongs.

div(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

divide(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

dot(other[, reflect])

Get dot product of frame and other, (binary operator dot).

drop([labels, axis, index, columns, level, ...])

Drop specified labels from rows or columns.

drop_duplicates([keep, inplace, ignore_index])

Return Series with duplicate values removed.

dropna([axis, inplace, how])

Return a Series with null values removed.

duplicated([keep])

Indicate duplicate Series values.

eq(other[, level, fill_value, axis])

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

equals(other)

Test whether two objects contain the same elements.

explode([ignore_index])

Transform each element of a list-like to a row, replicating index values.

factorize([sort, na_sentinel, use_na_sentinel])

Encode the input values as integer labels.

ffill([value, axis, inplace, limit])

Synonym for Series.fillna() with method='ffill'.

fillna([value, method, axis, inplace, limit])

Fill null values with value or specified method.

first(offset)

Select initial periods of time series data based on a date offset.

floordiv(other[, level, fill_value, axis])

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

from_arrow(array)

Create from PyArrow Array/ChunkedArray.

from_categorical(categorical[, codes])

Creates from a pandas.Categorical

from_masked_array(data, mask[, null_count])

Create a Series with null-mask.

from_pandas(s[, nan_as_null])

Convert from a Pandas Series.

ge(other[, level, fill_value, axis])

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

groupby([by, axis, level, as_index, sort, ...])

Group using a mapper or by a Series of columns.

gt(other[, level, fill_value, axis])

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

hash_values([method, seed])

Compute the hash of values in this column.

head([n])

Return the first n rows.

host_deserialize(header, frames)

Perform device-side deserialization tasks.

host_serialize()

Serialize data and metadata associated with host memory.

interpolate([method, axis, limit, inplace, ...])

Interpolate data values between some points.

isin(values)

Check whether values are contained in Series.

isna()

Identify missing values.

isnull()

Identify missing values.

items()

Iteration is unsupported.

iteritems()

Iteration is unsupported.

keys()

Return alias for index.

kurt([axis, skipna, level, numeric_only])

Return Fisher's unbiased kurtosis of a sample.

kurtosis([axis, skipna, level, numeric_only])

Return Fisher's unbiased kurtosis of a sample.

last(offset)

Select final periods of time series data based on a date offset.

le(other[, level, fill_value, axis])

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

lt(other[, level, fill_value, axis])

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

map(arg[, na_action])

Map values of Series according to input correspondence.

mask(cond[, other, inplace])

Replace values where the condition is True.

max([axis, skipna, level, numeric_only])

Return the maximum of the values in the DataFrame.

mean([axis, skipna, level, numeric_only])

Return the mean of the values for the requested axis.

median([axis, skipna, level, numeric_only])

Return the median of the values for the requested axis.

memory_usage([index, deep])

Return the memory usage of an object.

min([axis, skipna, level, numeric_only])

Return the minimum of the values in the DataFrame.

mod(other[, level, fill_value, axis])

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

mode([dropna])

Return the mode(s) of the dataset.

mul(other[, level, fill_value, axis])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

multiply(other[, level, fill_value, axis])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

nans_to_nulls()

Convert nans (if any) to nulls

ne(other[, level, fill_value, axis])

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

nlargest([n, keep])

Returns a new Series of the n largest element.

notna()

Identify non-missing values.

notnull()

Identify non-missing values.

nsmallest([n, keep])

Returns a new Series of the n smallest element.

nunique([dropna])

Return count of unique values for the column.

pad([value, axis, inplace, limit])

Synonym for Series.fillna() with method='ffill'.

pct_change([periods, fill_method, limit, freq])

Calculates the percent change between sequential elements in the Series.

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

pow(other[, level, fill_value, axis])

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

prod([axis, skipna, dtype, level, ...])

Return product of the values in the DataFrame.

product([axis, skipna, dtype, level, ...])

Return product of the values in the DataFrame.

quantile([q, interpolation, exact, quant_index])

Return values at the given quantile.

radd(other[, level, fill_value, axis])

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

rank([axis, method, numeric_only, ...])

Compute numerical data ranks (1 through n) along axis.

rdiv(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

reindex(*args, **kwargs)

Conform Series to new index.

rename([index, copy])

Alter Series name

repeat(repeats[, axis])

Repeats elements consecutively.

replace([to_replace, value])

Replace values given in to_replace with value.

resample(rule[, axis, closed, label, ...])

Convert the frequency of ("resample") the given time series data.

reset_index([level, drop, name, inplace])

Reset the index of the Series, or a level of it.

rfloordiv(other[, level, fill_value, axis])

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

rmod(other[, level, fill_value, axis])

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

rmul(other[, level, fill_value, axis])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

rolling(window[, min_periods, center, axis, ...])

Rolling window calculations.

round([decimals, how])

Round to a variable number of decimal places.

rpow(other[, level, fill_value, axis])

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

rsub(other[, level, fill_value, axis])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

rtruediv(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

sample([n, frac, replace, weights, ...])

Return a random sample of items from an axis of object.

scale()

Scale values to [0, 1] in float64

searchsorted(values[, side, ascending, ...])

Find indices where elements should be inserted to maintain order

serialize()

Generate an equivalent serializable representation of an object.

shift([periods, freq, axis, fill_value])

Shift values by periods positions.

skew([axis, skipna, level, numeric_only])

Return unbiased Fisher-Pearson skew of a sample.

sort_index([axis])

Sort object by labels (along an axis).

sort_values([axis, ascending, inplace, ...])

Sort by the values along either axis.

std([axis, skipna, level, ddof, numeric_only])

Return sample standard deviation of the DataFrame.

sub(other[, level, fill_value, axis])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

subtract(other[, level, fill_value, axis])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

sum([axis, skipna, dtype, level, ...])

Return sum of the values in the DataFrame.

tail([n])

Returns the last n rows as a new DataFrame or Series

take(indices[, axis])

Return a new frame containing the rows specified by indices.

tile(count)

Repeats the rows count times to form a new Frame.

to_arrow()

Convert to a PyArrow Array.

to_cupy([dtype, copy, na_value])

Convert the Frame to a CuPy array.

to_dict([into])

Convert Series to {label -> value} dict or dict-like object.

to_dlpack()

Converts a cuDF object into a DLPack tensor.

to_frame([name])

Convert Series into a DataFrame

to_hdf(path_or_buf, key, *args, **kwargs)

Write the contained data to an HDF5 file using HDFStore.

to_json([path_or_buf])

Convert the cuDF object to a JSON string.

to_list()

to_numpy([dtype, copy, na_value])

Convert the Frame to a NumPy array.

to_pandas([index, nullable])

Convert to a Pandas Series.

to_string()

Convert to string

tolist()

transpose()

Return the transpose, which is by definition self.

truediv(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

truncate([before, after, axis, copy])

Truncate a Series or DataFrame before and after some index value.

unique()

Returns unique values of this Series.

update(other)

Modify Series in place using values from passed Series.

value_counts([normalize, sort, ascending, ...])

Return a Series containing counts of unique values.

var([axis, skipna, level, ddof, numeric_only])

Return unbiased variance of the DataFrame.

where(cond[, other, inplace])

Replace values where the condition is False.

Attributes

T

Return the transpose, which is by definition self.

axes

Return a list representing the axes of the Series.

cat

Accessor object for categorical properties of the Series values.

data

The gpu buffer for the data

dt

Accessor object for datetime-like properties of the Series values.

dtype

The dtype of the Series.

empty

Indicator whether DataFrame or Series is empty.

has_nulls

Indicator whether Series contains null values.

hasnans

Return True if there are any NaNs or nulls.

iloc

Select values by position.

index

Get the labels for the rows.

is_monotonic

Return boolean if values in the object are monotonically increasing.

is_monotonic_decreasing

Return boolean if values in the object are monotonically decreasing.

is_monotonic_increasing

Return boolean if values in the object are monotonically increasing.

is_unique

Return boolean if values in the object are unique.

list

List methods for Series

loc

Select rows and columns by label or boolean mask.

name

Get the name of this object.

ndim

Number of dimensions of the underlying data, by definition 1.

null_count

Number of null values

nullable

A boolean indicating whether a null-mask is needed

nullmask

The gpu buffer for the null-mask

shape

Get a tuple representing the dimensionality of the Index.

size

Return the number of elements in the underlying data.

str

Vectorized string functions for Series and Index.

struct

Struct methods for Series

valid_count

Number of non-null values

values

Return a CuPy representation of the DataFrame.

values_host

Return a NumPy representation of the data.

classmethod from_categorical(categorical, codes=None)#

Creates from a pandas.Categorical

Parameters#

categoricalpandas.Categorical

Contains data stored in a pandas Categorical.

codesarray-like, optional.

The category codes of this categorical. If codes are defined, they are used instead of categorical.codes

Returns#

Series

A cudf categorical series.

Examples#

>>> import cudf
>>> import pandas as pd
>>> pd_categorical = pd.Categorical(pd.Series(['a', 'b', 'c', 'a'], dtype='category'))
>>> pd_categorical
['a', 'b', 'c', 'a']
Categories (3, object): ['a', 'b', 'c']
>>> series = cudf.Series.from_categorical(pd_categorical)
>>> series
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
classmethod from_masked_array(data, mask, null_count=None)#

Create a Series with null-mask. This is equivalent to:

Series(data).set_mask(mask, null_count=null_count)

Parameters#

data1D array-like

The values. Null values must not be skipped. They can appear as garbage values.

mask1D array-like

The null-mask. Valid values are marked as 1; otherwise 0. The mask bit given the data index idx is computed as:

(mask[idx // 8] >> (idx % 8)) & 1
null_countint, optional

The number of null values. If None, it is calculated automatically.

Returns#

Series

Examples#

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4, None])
>>> a
0       1
1       2
2       3
3    <NA>
4       4
5    <NA>
dtype: int64
>>> b = cudf.Series([10, 11, 12, 13, 14])
>>> cudf.Series.from_masked_array(data=b, mask=a._column.mask)
0      10
1      11
2      12
3    <NA>
4      14
dtype: int64
__init__(data=None, index=None, dtype=None, name=None, copy=False, nan_as_null=True)#
classmethod from_pandas(s, nan_as_null=<no_default>)#

Convert from a Pandas Series.

Parameters#

sPandas Series object

A Pandas Series object which has to be converted to cuDF Series.

nan_as_nullbool, Default None

If None/True, converts np.nan values to null values. If False, leaves np.nan values as is.

Raises#

TypeError for invalid input type.

Examples#

>>> import cudf
>>> import pandas as pd
>>> import numpy as np
>>> data = [10, 20, 30, np.nan]
>>> pds = pd.Series(data, dtype='float64')
>>> cudf.Series.from_pandas(pds)
0    10.0
1    20.0
2    30.0
3    <NA>
dtype: float64
>>> cudf.Series.from_pandas(pds, nan_as_null=False)
0    10.0
1    20.0
2    30.0
3     NaN
dtype: float64
property is_unique#

Return boolean if values in the object are unique.

Returns#

bool

property dt#

Accessor object for datetime-like properties of the Series values.

Examples#

>>> s = cudf.Series(cudf.date_range(
...   start='2001-02-03 12:00:00',
...   end='2001-02-03 14:00:00',
...   freq='1H'))
>>> s.dt.hour
0    12
1    13
dtype: int16
>>> s.dt.second
0    0
1    0
dtype: int16
>>> s.dt.day
0    3
1    3
dtype: int16

Returns#

A Series indexed like the original Series.

Raises#

TypeError if the Series does not contain datetimelike values.

property axes#

Return a list representing the axes of the Series.

Series.axes returns a list containing the row index.

Examples#

>>> import cudf
>>> csf1 = cudf.Series([1, 2, 3, 4])
>>> csf1.axes
[RangeIndex(start=0, stop=4, step=1)]
property hasnans#

Return True if there are any NaNs or nulls.

Returns#

outbool

If Series has at least one NaN or null value, return True, if not return False.

Examples#

>>> import cudf
>>> import numpy as np
>>> series = cudf.Series([1, 2, np.nan, 3, 4], nan_as_null=False)
>>> series
0    1.0
1    2.0
2    NaN
3    3.0
4    4.0
dtype: float64
>>> series.hasnans
True

hasnans returns True for the presence of any NA values:

>>> series = cudf.Series([1, 2, 3, None, 4])
>>> series
0       1
1       2
2       3
3    <NA>
4       4
dtype: int64
>>> series.hasnans
True
drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')#

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters#

labelssingle label or list-like

Index or column labels to drop.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

indexsingle label or list-like

Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

columnssingle label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

levelint or level name, optional

For MultiIndex, level from which the labels will be removed.

inplacebool, default False

If False, return a copy. Otherwise, do operation inplace and return None.

errors{‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and only existing labels are dropped.

Returns#

DataFrame or Series

DataFrame or Series without the removed index or column labels.

Raises#

KeyError

If any of the labels is not found in the selected axis.

See Also#

DataFrame.loc : Label-location based indexer for selection by label. DataFrame.dropna : Return DataFrame with labels on given axis omitted

where (all or any) data are missing.

DataFrame.drop_duplicatesReturn DataFrame with duplicate rows

removed, optionally only considering certain columns.

Series.reindex

Return only specified index labels of Series

Series.dropna

Return series without null values

Series.drop_duplicates

Return series with duplicate values removed

Examples#

Series

>>> s = cudf.Series([1,2,3], index=['x', 'y', 'z'])
>>> s
x    1
y    2
z    3
dtype: int64

Drop labels x and z

>>> s.drop(labels=['x', 'z'])
y    2
dtype: int64

Drop a label from the second level in MultiIndex Series.

>>> midx = cudf.MultiIndex.from_product([[0, 1, 2], ['x', 'y']])
>>> s = cudf.Series(range(6), index=midx)
>>> s
0  x    0
   y    1
1  x    2
   y    3
2  x    4
   y    5
dtype: int64
>>> s.drop(labels='y', level=1)
0  x    0
1  x    2
2  x    4
Name: 2, dtype: int64

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({"A": [1, 2, 3, 4],
...                      "B": [5, 6, 7, 8],
...                      "C": [10, 11, 12, 13],
...                      "D": [20, 30, 40, 50]})
>>> df
   A  B   C   D
0  1  5  10  20
1  2  6  11  30
2  3  7  12  40
3  4  8  13  50

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  1  20
1  2  30
2  3  40
3  4  50
>>> df.drop(columns=['B', 'C'])
   A   D
0  1  20
1  2  30
2  3  40
3  4  50

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  3  7  12  40
3  4  8  13  50

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = cudf.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = cudf.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
       length    1.5    1.0
cow    speed    30.0   20.0
       weight  250.0  150.0
       length    1.5    0.8
falcon speed   320.0  250.0
       weight    1.0    0.8
       length    0.3    0.2
>>> df.drop(index='cow', columns='small')
                 big
lama   speed    45.0
       weight  200.0
       length    1.5
falcon speed   320.0
       weight    1.0
       length    0.3
>>> df.drop(index='length', level=1)
                 big  small
lama   speed    45.0   30.0
       weight  200.0  100.0
cow    speed    30.0   20.0
       weight  250.0  150.0
falcon speed   320.0  250.0
       weight    1.0    0.8
tolist()#
to_list()#
to_dict(into: type[dict] = <class 'dict'>) dict#

Convert Series to {label -> value} dict or dict-like object.

Parameters#

intoclass, default dict

The collections.abc.Mapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

Returns#

collections.abc.Mapping

Key-value representation of Series.

Examples#

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.to_dict()
{0: 1, 1: 2, 2: 3, 3: 4}
>>> from collections import OrderedDict, defaultdict
>>> s.to_dict(OrderedDict)
OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
>>> dd = defaultdict(list)
>>> s.to_dict(dd)
defaultdict(<class 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
append(to_append, ignore_index=False, verify_integrity=False)#

Append values from another Series or array-like object. If ignore_index=True, the index is reset.

Parameters#

to_appendSeries or list/tuple of Series

Series to append with self.

ignore_indexboolean, default False.

If True, do not use the index.

verify_integritybool, default False

This Parameter is currently not supported.

Returns#

Series

A new concatenated series

See Also#

cudf.concatGeneral function to concatenate DataFrame or

Series objects.

Examples#

>>> import cudf
>>> s1 = cudf.Series([1, 2, 3])
>>> s2 = cudf.Series([4, 5, 6])
>>> s1
0    1
1    2
2    3
dtype: int64
>>> s2
0    4
1    5
2    6
dtype: int64
>>> s1.append(s2)
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s3 = cudf.Series([4, 5, 6], index=[3, 4, 5])
>>> s3
3    4
4    5
5    6
dtype: int64
>>> s1.append(s3)
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With ignore_index set to True:

>>> s1.append(s2, ignore_index=True)
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
reindex(*args, **kwargs)#

Conform Series to new index.

Parameters#

indexIndex, Series-convertible, default None

New labels / index to conform to, should be specified using keywords.

method: Not Supported copy : boolean, default True level: Not Supported fill_value : Value to use for missing values.

Defaults to NA, but can be any “compatible” value.

limit: Not Supported tolerance: Not Supported

Returns#

Series with changed index.

Examples#

>>> import cudf
>>> series = cudf.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
>>> series
a    10
b    20
c    30
d    40
dtype: int64
>>> series.reindex(['a', 'b', 'y', 'z'])
a      10
b      20
y    <NA>
z    <NA>
dtype: int64
reset_index(level=None, drop=False, name=None, inplace=False)#

Reset the index of the Series, or a level of it.

Parameters#

levelint, str, tuple, or list, default None

Only remove the given levels from the index. Removes all levels by default.

dropbool, default False

Do not try to insert index into dataframe columns. This resets the index to the default integer index.

nameobject, optional

The name to use for the column containing the original Series values. Uses self.name by default. This argument is ignored when drop is True.

inplacebool, default False

Modify the DataFrame in place (do not create a new object).

Returns#

Series or DataFrame or None

Series with the new index or None if inplace=True. For Series, When drop is False (the default), a DataFrame is returned. The newly created columns will come first in the DataFrame, followed by the original Series values. When drop is True, a Series is returned. In either case, if inplace=True, no value is returned.

Examples#

>>> series = cudf.Series(['a', 'b', 'c', 'd'], index=[10, 11, 12, 13])
>>> series
10    a
11    b
12    c
13    d
dtype: object
>>> series.reset_index()
   index  0
0     10  a
1     11  b
2     12  c
3     13  d
>>> series.reset_index(drop=True)
0    a
1    b
2    c
3    d
dtype: object

You can also use reset_index with MultiIndex.

>>> s2 = cudf.Series(
...             range(4), name='foo',
...             index=cudf.MultiIndex.from_tuples([
...                     ('bar', 'one'), ('bar', 'two'),
...                     ('baz', 'one'), ('baz', 'two')],
...                     names=['a', 'b']
...      ))
>>> s2
a    b
bar  one    0
     two    1
baz  one    2
     two    3
Name: foo, dtype: int64
>>> s2.reset_index(level='a')
       a  foo
b
one  bar    0
two  bar    1
one  baz    2
two  baz    3
to_frame(name=None)#

Convert Series into a DataFrame

Parameters#

namestr, default None

Name to be used for the column

Returns#

DataFrame

cudf DataFrame

Examples#

>>> import cudf
>>> series = cudf.Series(['a', 'b', 'c', None, 'd'], name='sample', index=[10, 11, 12, 13, 15])
>>> series
10       a
11       b
12       c
13    <NA>
15       d
Name: sample, dtype: object
>>> series.to_frame()
   sample
10      a
11      b
12      c
13   <NA>
15      d
memory_usage(index=True, deep=False)#

Return the memory usage of an object.

Parameters#

deepbool

The deep parameter is ignored and is only included for pandas compatibility.

Returns#

The total bytes used.

map(arg, na_action=None) Series#

Map values of Series according to input correspondence.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters#

argfunction, collections.abc.Mapping subclass or Series

Mapping correspondence.

na_action{None, ‘ignore’}, default None

If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

Returns#

Series

Same index as caller.

Examples#

>>> s = cudf.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2     <NA>
3   rabbit
dtype: object

map accepts a dict or a Series. Values that are not found in the dict are converted to NaN, default values in dicts are currently not supported.:

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2     <NA>
3     <NA>
dtype: object

It also accepts numeric functions:

>>> s = cudf.Series([1, 2, 3, 4, np.nan])
>>> s.map(lambda x: x ** 2)
0       1
1       4
2       9
3       16
4     <NA>
dtype: int64

Notes#

Please note map currently only supports fixed-width numeric type functions.

__getitem__(arg)#
iteritems()#

Iteration is unsupported.

See iteration for more information.

items()#

Iteration is unsupported.

See iteration for more information.

property cat#

Accessor object for categorical properties of the Series values. Be aware that assigning to categories is a inplace operation, while all methods return new categorical data per default.

Parameters#

column : Column parent : Series or CategoricalIndex

Examples#

>>> s = cudf.Series([1,2,3], dtype='category')
>>> s
0    1
1    2
2    3
dtype: category
Categories (3, int64): [1, 2, 3]
>>> s.cat.categories
Int64Index([1, 2, 3], dtype='int64')
>>> s.cat.reorder_categories([3,2,1])
0    1
1    2
2    3
dtype: category
Categories (3, int64): [3, 2, 1]
>>> s.cat.remove_categories([1])
0    <NA>
1       2
2       3
dtype: category
Categories (2, int64): [2, 3]
>>> s.cat.set_categories(list('abcde'))
0    <NA>
1    <NA>
2    <NA>
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
>>> s.cat.as_ordered()
0    1
1    2
2    3
dtype: category
Categories (3, int64): [1 < 2 < 3]
>>> s.cat.as_unordered()
0    1
1    2
2    3
dtype: category
Categories (3, int64): [1, 2, 3]
property str#

Vectorized string functions for Series and Index.

This mimics pandas df.str interface. nulls stay null unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.

property list#

List methods for Series

property struct#

Struct methods for Series

property dtype#

The dtype of the Series.

property valid_count#

Number of non-null values

property null_count#

Number of null values

property nullable#

A boolean indicating whether a null-mask is needed

property has_nulls#

Indicator whether Series contains null values.

Returns#

outbool

If Series has at least one null value, return True, if not return False.

Examples#

>>> import cudf
>>> series = cudf.Series([1, 2, None, 3, 4])
>>> series
0       1
1       2
2    <NA>
3       3
4       4
dtype: int64
>>> series.has_nulls
True
>>> series.dropna().has_nulls
False
dropna(axis=0, inplace=False, how=None)#

Return a Series with null values removed.

Parameters#

axis{0 or ‘index’}, default 0

There is only one axis to drop values from.

inplacebool, default False

If True, do operation inplace and return None.

howstr, optional

Not in use. Kept for compatibility.

Returns#

Series

Series with null entries dropped from it.

See Also#

Series.isna : Indicate null values.

Series.notna : Indicate non-null values.

Series.fillna : Replace null values.

cudf.DataFrame.dropnaDrop rows or columns which

contain null values.

cudf.Index.dropna : Drop null indices.

Examples#

>>> import cudf
>>> ser = cudf.Series([1, 2, None])
>>> ser
0       1
1       2
2    <NA>
dtype: int64

Drop null values from a Series.

>>> ser.dropna()
0    1
1    2
dtype: int64

Keep the Series with valid entries in the same variable.

>>> ser.dropna(inplace=True)
>>> ser
0    1
1    2
dtype: int64

Empty strings are not considered null values. None is considered a null value.

>>> ser = cudf.Series(['', None, 'abc'])
>>> ser
0
1    <NA>
2     abc
dtype: object
>>> ser.dropna()
0
2    abc
dtype: object
drop_duplicates(keep='first', inplace=False, ignore_index=False)#

Return Series with duplicate values removed.

Parameters#

keep{‘first’, ‘last’, False}, default ‘first’

Method to handle dropping duplicates:

  • ‘first’ : Drop duplicates except for the first occurrence.

  • ‘last’ : Drop duplicates except for the last occurrence.

  • False : Drop all duplicates.

inplacebool, default False

If True, performs operation inplace and returns None.

Returns#

Series or None

Series with duplicates dropped or None if inplace=True.

Examples#

>>> s = cudf.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'],
...               name='animal')
>>> s
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

With the keep parameter, the selection behavior of duplicated values can be changed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’. Note that order of the rows being returned is not guaranteed to be sorted.

>>> s.drop_duplicates()
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

The value ‘last’ for parameter keep keeps the last occurrence for each set of duplicated entries.

>>> s.drop_duplicates(keep='last')
1       cow
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

The value False for parameter keep discards all sets of duplicated entries. Setting the value of ‘inplace’ to True performs the operation inplace and returns None.

>>> s.drop_duplicates(keep=False, inplace=True)
>>> s
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
fillna(value=None, method=None, axis=None, inplace=False, limit=None)#

Fill null values with value or specified method.

Parameters#

valuescalar, Series-like or dict

Value to use to fill nulls. If Series-like, null values are filled with values in corresponding indices. A dict can be used to provide different values to fill nulls in different columns. Cannot be used with method.

method{‘ffill’, ‘bfill’}, default None

Method to use for filling null values in the dataframe or series. ffill propagates the last non-null values forward to the next non-null value. bfill propagates backward with the next non-null value. Cannot be used with value.

Returns#

resultDataFrame, Series, or Index

Copy with nulls filled.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, None], 'b': [3, None, 5]})
>>> df
      a     b
0     1     3
1     2  <NA>
2  <NA>     5
>>> df.fillna(4)
   a  b
0  1  3
1  2  4
2  4  5
>>> df.fillna({'a': 3, 'b': 4})
   a  b
0  1  3
1  2  4
2  3  5

fillna on a Series object:

>>> ser = cudf.Series(['a', 'b', None, 'c'])
>>> ser
0       a
1       b
2    <NA>
3       c
dtype: object
>>> ser.fillna('z')
0    a
1    b
2    z
3    c
dtype: object

fillna can also supports inplace operation:

>>> ser.fillna('z', inplace=True)
>>> ser
0    a
1    b
2    z
3    c
dtype: object
>>> df.fillna({'a': 3, 'b': 4}, inplace=True)
>>> df
   a  b
0  1  3
1  2  4
2  3  5

fillna specified with fill method

>>> ser = cudf.Series([1, None, None, 2, 3, None, None])
>>> ser.fillna(method='ffill')
0    1
1    1
2    1
3    2
4    3
5    3
6    3
dtype: int64
>>> ser.fillna(method='bfill')
0       1
1       2
2       2
3       2
4       3
5    <NA>
6    <NA>
dtype: int64
between(left, right, inclusive='both') Series#

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters#

leftscalar or list-like

Left boundary.

rightscalar or list-like

Right boundary.

inclusive{“both”, “neither”, “left”, “right”}

Include boundaries. Whether to set each bound as closed or open.

Returns#

Series

Series representing whether each element is between left and right (inclusive).

See Also#

Series.gt : Greater than of series and other. Series.lt : Less than of series and other.

Notes#

This function is equivalent to (left <= ser) & (ser <= right)

Examples#

>>> import cudf
>>> s = cudf.Series([2, 0, 4, 8, None])

Boundary values are included by default:

>>> s.between(1, 4)
0     True
1    False
2     True
3    False
4     <NA>
dtype: bool

With inclusive set to "neither" boundary values are excluded:

>>> s.between(1, 4, inclusive="neither")
0     True
1    False
2    False
3    False
4     <NA>
dtype: bool

left and right can be any scalar value:

>>> s = cudf.Series(['Alice', 'Bob', 'Carol', 'Eve'])
>>> s.between('Anna', 'Daniel')
0    False
1     True
2     True
3    False
dtype: bool
all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)#

Return whether all elements are True in DataFrame.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 or ‘index’reduce the index, return a Series

    whose index is the original column labels.

  • 1 or ‘columns’reduce the columns, return a Series

    whose index is the original index.

  • None : reduce all axes, return a scalar.

skipna: bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

Returns#

Series

Notes#

Parameters currently not supported are bool_only, level.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]})
>>> df.all()
a     True
b    False
dtype: bool
any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)#

Return whether any elements is True in DataFrame.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 or ‘index’reduce the index, return a Series

    whose index is the original column labels.

  • 1 or ‘columns’reduce the columns, return a Series

    whose index is the original index.

  • None : reduce all axes, return a scalar.

skipna: bool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

Returns#

Series

Notes#

Parameters currently not supported are bool_only, level.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 0, 10, 10]})
>>> df.any()
a    True
b    True
dtype: bool
to_pandas(index=True, nullable=False, **kwargs)#

Convert to a Pandas Series.

Parameters#

indexBoolean, Default True

If index is True, converts the index of cudf.Series and sets it to the pandas.Series. If index is False, no index conversion is performed and pandas.Series will assign a default index.

nullableBoolean, Default False

If nullable is True, the resulting series will be having a corresponding nullable Pandas dtype. If there is no corresponding nullable Pandas dtype present, the resulting dtype will be a regular pandas dtype. If nullable is False, the resulting series will either convert null values to np.nan or None depending on the dtype.

Returns#

out : Pandas Series

Examples#

>>> import cudf
>>> ser = cudf.Series([-3, 2, 0])
>>> pds = ser.to_pandas()
>>> pds
0   -3
1    2
2    0
dtype: int64
>>> type(pds)
<class 'pandas.core.series.Series'>

nullable parameter can be used to control whether dtype can be Pandas Nullable or not:

>>> ser = cudf.Series([10, 20, None, 30])
>>> ser
0      10
1      20
2    <NA>
3      30
dtype: int64
>>> ser.to_pandas(nullable=True)
0      10
1      20
2    <NA>
3      30
dtype: Int64
>>> ser.to_pandas(nullable=False)
0    10.0
1    20.0
2     NaN
3    30.0
dtype: float64
property data#

The gpu buffer for the data

Returns#

out : The GPU buffer of the Series.

Examples#

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series
0    1
1    2
2    3
3    4
dtype: int64
>>> np.array(series.data.memoryview())
array([1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 0, 4, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
property nullmask#

The gpu buffer for the null-mask

astype(dtype, copy=False, errors='raise', **kwargs)#

Cast the object to the given dtype.

Parameters#

dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire DataFrame object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default False

Return a deep-copy when copy=True. Note by default copy=False setting is used and hence changes to values then may propagate to other cudf objects.

errors{‘raise’, ‘ignore’, ‘warn’}, default ‘raise’

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised

  • ignore : suppress exceptions. On error return original object.

**kwargs : extra arguments to pass on to the constructor

Returns#

DataFrame/Series

Examples#

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [10, 20, 30], 'b': [1, 2, 3]})
>>> df
    a  b
0  10  1
1  20  2
2  30  3
>>> df.dtypes
a    int64
b    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes
a    int32
b    int32
dtype: object

Cast a to float32 using a dictionary:

>>> df.astype({'a': 'float32'}).dtypes
a    float32
b      int64
dtype: object
>>> df.astype({'a': 'float32'})
      a  b
0  10.0  1
1  20.0  2
2  30.0  3

Series

>>> import cudf
>>> series = cudf.Series([1, 2], dtype='int32')
>>> series
0    1
1    2
dtype: int32
>>> series.astype('int64')
0    1
1    2
dtype: int64

Convert to categorical type:

>>> series.astype('category')
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> cat_dtype = cudf.CategoricalDtype(categories=[2, 1], ordered=True)
>>> series.astype(cat_dtype)
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False (enabled by default) and changing data on a new Series will propagate changes:

>>> s1 = cudf.Series([1, 2])
>>> s1
0    1
1    2
dtype: int64
>>> s2 = s1.astype('int64', copy=False)
>>> s2[0] = 10
>>> s1
0    10
1     2
dtype: int64
sort_index(axis=0, *args, **kwargs)#

Sort object by labels (along an axis).

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.

levelint or level name or list of ints or list of level names

If not None, sort on values in specified index level(s). This is only useful in the case of MultiIndex.

ascendingbool, default True

Sort ascending vs. descending.

inplacebool, default False

If True, perform operation in-place.

kindsorting method such as quick sort and others.

Not yet supported.

na_position{‘first’, ‘last’}, default ‘last’

Puts NaNs at the beginning if first; last puts NaNs at the end.

sort_remainingbool, default True

When sorting a multiindex on a subset of its levels, should entries be lexsorted by the remaining (non-specified) levels as well?

ignore_indexbool, default False

if True, index will be replaced with RangeIndex.

keycallable, optional

If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

Returns#

Frame or None

Notes#

Difference from pandas:
  • Not supporting: kind, sort_remaining=False

Examples#

Series

>>> import cudf
>>> series = cudf.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
>>> series
3    a
2    b
1    c
4    d
dtype: object
>>> series.sort_index()
1    c
2    b
3    a
4    d
dtype: object

Sort Descending

>>> series.sort_index(ascending=False)
4    d
3    a
2    b
1    c
dtype: object

DataFrame

>>> df = cudf.DataFrame(
... {"b":[3, 2, 1], "a":[2, 1, 3]}, index=[1, 3, 2])
>>> df.sort_index(axis=0)
   b  a
1  3  2
2  1  3
3  2  1
>>> df.sort_index(axis=1)
   a  b
1  2  3
3  1  2
2  3  1
sort_values(axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)#

Sort by the values along either axis.

Parameters#

ascendingbool or list of bool, default True

Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position{‘first’, ‘last’}, default ‘last’

‘first’ puts nulls at the beginning, ‘last’ puts nulls at the end

ignore_indexbool, default False

If True, index will not be sorted.

Returns#

Series : Series with sorted values.

Notes#

Difference from pandas:
  • Support axis=’index’ only.

  • Not supporting: inplace, kind

Examples#

>>> import cudf
>>> s = cudf.Series([1, 5, 2, 4, 3])
>>> s.sort_values()
0    1
2    2
4    3
3    4
1    5
dtype: int64
nlargest(n=5, keep='first')#

Returns a new Series of the n largest element.

Parameters#

nint, default 5

Return this many descending sorted values.

keep{‘first’, ‘last’}, default ‘first’

When there are duplicate values that cannot all fit in a Series of n elements:

  • first : return the first n occurrences in order of appearance.

  • last : return the last n occurrences in reverse order of appearance.

Returns#

Series

The n largest values in the Series, sorted in decreasing order.

Examples#

>>> import cudf
>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> series = cudf.Series(countries_population)
>>> series
Italy         59000000
France        65000000
Malta           434000
Maldives        434000
Brunei          434000
Iceland         337000
Nauru            11300
Tuvalu           11300
Anguilla         11300
Montserrat        5200
dtype: int64
>>> series.nlargest()
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64
>>> series.nlargest(3)
France    65000000
Italy     59000000
Malta       434000
dtype: int64
>>> series.nlargest(3, keep='last')
France    65000000
Italy     59000000
Brunei      434000
dtype: int64
nsmallest(n=5, keep='first')#

Returns a new Series of the n smallest element.

Parameters#

nint, default 5

Return this many ascending sorted values.

keep{‘first’, ‘last’}, default ‘first’

When there are duplicate values that cannot all fit in a Series of n elements:

  • first : return the first n occurrences in order of appearance.

  • last : return the last n occurrences in reverse order of appearance.

Returns#

Series

The n smallest values in the Series, sorted in increasing order.

Examples#

>>> import cudf
>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = cudf.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The n smallest elements where n=5 by default.

>>> s.nsmallest()
Montserrat    5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)
Montserrat   5200
Nauru       11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')
Montserrat   5200
Anguilla    11300
Tuvalu      11300
dtype: int64
argsort(axis=0, kind='quicksort', order=None, ascending=True, na_position='last')#

Return the integer indices that would sort the Series values.

Parameters#

bystr or list of str, default None

Name or list of names to sort by. If None, sort by all columns.

axis{0 or “index”}

Has no effect but is accepted for compatibility with numpy.

kind{‘mergesort’, ‘quicksort’, ‘heapsort’, ‘stable’}, default ‘quicksort’

Choice of sorting algorithm. See numpy.sort() for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. Only quicksort is supported in cuDF.

orderNone

Has no effect but is accepted for compatibility with numpy.

ascendingbool or list of bool, default True

If True, sort values in ascending order, otherwise descending.

na_position{‘first’ or ‘last’}, default ‘last’

Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.

Returns#

cupy.ndarray: The indices sorted based on input.

Examples#

Series

>>> import cudf
>>> s = cudf.Series([3, 1, 2])
>>> s
0    3
1    1
2    2
dtype: int64
>>> s.argsort()
0    1
1    2
2    0
dtype: int32
>>> s[s.argsort()]
1    1
2    2
0    3
dtype: int64

DataFrame >>> import cudf >>> df = cudf.DataFrame({‘foo’: [3, 1, 2]}) >>> df.argsort() array([1, 2, 0], dtype=int32)

Index >>> import cudf >>> idx = cudf.Index([3, 1, 2]) >>> idx.argsort() array([1, 2, 0], dtype=int32)

replace(to_replace=None, value=None, *args, **kwargs)#

Replace values given in to_replace with value.

Parameters#

to_replacenumeric, str or list-like

Value(s) to replace.

  • numeric or str:
    • values equal to to_replace will be replaced with value

  • list of numeric or str:
    • If value is also list-like, to_replace and value must be of same length.

  • dict:
    • Dicts can be used to specify different replacement values for different existing values. For example, {‘a’: ‘b’, ‘y’: ‘z’} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be None.

valuescalar, dict, list-like, str, default None

Value to replace any values matching to_replace with.

inplacebool, default False

If True, in place.

See Also#

Series.fillna

Raises#

TypeError
  • If to_replace is not a scalar, array-like, dict, or None

  • If to_replace is a dict and value is not a list, dict, or Series

ValueError
  • If a list is passed to to_replace and value but they are not the same length.

Returns#

resultSeries

Series after replacement. The mask and index are preserved.

Notes#

Parameters that are currently not supported are: limit, regex, method

Examples#

Series

Scalar to_replace and value

>>> import cudf
>>> s = cudf.Series([0, 1, 2, 3, 4])
>>> s
0    0
1    1
2    2
3    3
4    4
dtype: int64
>>> s.replace(0, 5)
0    5
1    1
2    2
3    3
4    4
dtype: int64

List-like to_replace

>>> s.replace([1, 2], 10)
0     0
1    10
2    10
3     3
4     4
dtype: int64

dict-like to_replace

>>> s.replace({1:5, 3:50})
0     0
1     5
2     2
3    50
4     4
dtype: int64
>>> s = cudf.Series(['b', 'a', 'a', 'b', 'a'])
>>> s
0     b
1     a
2     a
3     b
4     a
dtype: object
>>> s.replace({'a': None})
0       b
1    <NA>
2    <NA>
3       b
4    <NA>
dtype: object

If there is a mismatch in types of the values in to_replace & value with the actual series, then cudf exhibits different behavior with respect to pandas and the pairs are ignored silently:

>>> s = cudf.Series(['b', 'a', 'a', 'b', 'a'])
>>> s
0    b
1    a
2    a
3    b
4    a
dtype: object
>>> s.replace('a', 1)
0    b
1    a
2    a
3    b
4    a
dtype: object
>>> s.replace(['a', 'c'], [1, 2])
0    b
1    a
2    a
3    b
4    a
dtype: object

DataFrame

Scalar to_replace and value

>>> import cudf
>>> df = cudf.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df
   A  B  C
0  0  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
>>> df.replace(0, 5)
   A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like to_replace

>>> df.replace([0, 1, 2, 3], 4)
   A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
   A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e

dict-like to_replace

>>> df.replace({0: 10, 1: 100})
     A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
     A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
update(other)#

Modify Series in place using values from passed Series. Uses non-NA values from passed Series to make updates. Aligns on index.

Parameters#

other : Series, or object coercible into Series

Examples#

>>> import cudf
>>> s = cudf.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.update(cudf.Series([4, 5, 6]))
>>> s
0    4
1    5
2    6
dtype: int64
>>> s = cudf.Series(['a', 'b', 'c'])
>>> s
0    a
1    b
2    c
dtype: object
>>> s.update(cudf.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object
>>> s = cudf.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.update(cudf.Series([4, 5, 6, 7, 8]))
>>> s
0    4
1    5
2    6
dtype: int64

If other contains NaNs the corresponding values are not updated in the original Series.

>>> s = cudf.Series([1.0, 2.0, 3.0])
>>> s
0    1.0
1    2.0
2    3.0
dtype: float64
>>> s.update(cudf.Series([4.0, np.nan, 6.0], nan_as_null=False))
>>> s
0    4.0
1    2.0
2    6.0
dtype: float64

other can also be a non-Series object type that is coercible into a Series

>>> s = cudf.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.update([4, np.nan, 6])
>>> s
0    4
1    2
2    6
dtype: int64
>>> s = cudf.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.update({1: 9})
>>> s
0    1
1    9
2    3
dtype: int64
apply(func, convert_dtype=True, args=(), **kwargs)#

Apply a scalar function to the values of a Series. Similar to pandas.Series.apply.

apply relies on Numba to JIT compile func. Thus the allowed operations within func are limited to those supported by the CUDA Python Numba target. For more information, see the cuDF guide to user defined functions.

Some string functions and methods are supported. Refer to the guide to UDFs for details.

Parameters#

funcfunction

Scalar Python function to apply.

convert_dtypebool, default True

In cuDF, this parameter is always True. Because cuDF does not support arbitrary object dtypes, the result will always be the common type as determined by numba based on the function logic and argument types. See examples for details.

argstuple

Positional arguments passed to func after the series value.

**kwargs

Not supported

Returns#

resultSeries

The mask and index are preserved.

Notes#

UDFs are cached in memory to avoid recompilation. The first call to the UDF will incur compilation overhead. func may call nested functions that are decorated with the decorator numba.cuda.jit(device=True), otherwise numba will raise a typing error.

Examples#

Apply a basic function to a series:

>>> sr = cudf.Series([1,2,3])
>>> def f(x):
...     return x + 1
>>> sr.apply(f)
0    2
1    3
2    4
dtype: int64

Apply a basic function to a series with nulls:

>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
...     return x + 1
>>> sr.apply(f)
0       2
1    <NA>
2       4
dtype: int64

Use a function that does something conditionally, based on if the value is or is not null:

>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
...     if x is cudf.NA:
...         return 42
...     else:
...         return x - 1
>>> sr.apply(f)
0     0
1    42
2     2
dtype: int64

Results will be upcast to the common dtype required as derived from the UDFs logic. Note that this means the common type will be returned even if such data is passed that would not result in any values of that dtype:

>>> sr = cudf.Series([1,cudf.NA,3])
>>> def f(x):
...     return x + 1.5
>>> sr.apply(f)
0     2.5
1    <NA>
2     4.5
dtype: float64

UDFs manipulating string data are allowed, as long as they neither modify strings in place nor create new strings. For example, the following UDF is allowed:

>>> def f(st):
...     if len(st) == 0:
...             return -1
...     elif st.startswith('a'):
...             return 1
...     elif 'example' in st:
...             return 2
...     else:
...             return 3
...
>>> sr = cudf.Series(['', 'abc', 'some_example'])
>>> sr.apply(f)  
0   -1
1    1
2    2
dtype: int64

However, the following UDF is not allowed since it includes an operation that requires the creation of a new string: a call to the upper method. Methods that are not supported in this manner will raise an AttributeError.

>>> def f(st):
...     new = st.upper()
...     return 'ABC' in new
...
>>> sr.apply(f)  

For a complete list of supported functions and methods that may be used to manipulate string data, see the UDF guide, <https://docs.rapids.ai/api/cudf/stable/user_guide/guide-to-udfs.html>

count(level=None)#

Return number of non-NA/null observations in the Series

Returns#

int

Number of non-null values in the Series.

Notes#

Parameters currently not supported is level.

Examples#

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.count()
5
mode(dropna=True)#

Return the mode(s) of the dataset.

Always returns Series even if only one value is returned.

Parameters#

dropnabool, default True

Don’t consider counts of NA/NaN/NaT.

Returns#

Series

Modes of the Series in sorted order.

Examples#

>>> import cudf
>>> series = cudf.Series([7, 6, 5, 4, 3, 2, 1])
>>> series
0    7
1    6
2    5
3    4
4    3
5    2
6    1
dtype: int64
>>> series.mode()
0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

We can include <NA> values in mode by passing dropna=False.

>>> series = cudf.Series([7, 4, 3, 3, 7, None, None])
>>> series
0       7
1       4
2       3
3       3
4       7
5    <NA>
6    <NA>
dtype: int64
>>> series.mode()
0    3
1    7
dtype: int64
>>> series.mode(dropna=False)
0       3
1       7
2    <NA>
dtype: int64
round(decimals=0, how='half_even')#

Round to a variable number of decimal places.

Parameters#

decimalsint, dict, Series

Number of decimal places to round each column to. This parameter must be an int for a Series. For a DataFrame, a dict or a Series are also valid inputs. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

howstr, optional

Type of rounding. Can be either “half_even” (default) or “half_up” rounding.

Returns#

Series or DataFrame

A Series or DataFrame with the affected columns rounded to the specified number of decimal places.

Examples#

Series

>>> s = cudf.Series([0.1, 1.4, 2.9])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64

DataFrame

>>> df = cudf.DataFrame(
...     [(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
...     columns=['dogs', 'cats'],
... )
>>> df
   dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number of decimal places.

>>> df.round(1)
   dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be specified with the column names as keys and the number of decimal places as values.

>>> df.round({'dogs': 1, 'cats': 0})
   dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be specified with the column names as the index and the number of decimal places as the values.

>>> decimals = cudf.Series([0, 1], index=['cats', 'dogs'])
>>> df.round(decimals)
   dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0
cov(other, min_periods=None)#

Compute covariance with Series, excluding missing values.

Parameters#

otherSeries

Series with which to compute the covariance.

Returns#

float

Covariance between Series and other normalized by N-1 (unbiased estimator).

Notes#

min_periods parameter is not yet supported.

Examples#

>>> import cudf
>>> ser1 = cudf.Series([0.9, 0.13, 0.62])
>>> ser2 = cudf.Series([0.12, 0.26, 0.51])
>>> ser1.cov(ser2)
-0.015750000000000004
transpose()#

Return the transpose, which is by definition self.

property T#

Return the transpose, which is by definition self.

duplicated(keep='first')#

Indicate duplicate Series values.

Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

Parameters#

keep{‘first’, ‘last’, False}, default ‘first’

Method to handle dropping duplicates:

  • 'first' : Mark duplicates as True except for the first occurrence.

  • 'last' : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Returns#

Series[bool]

Series indicating whether each value has occurred in the preceding values.

See Also#

Index.duplicated : Equivalent method on cudf.Index. DataFrame.duplicated : Equivalent method on cudf.DataFrame. Series.drop_duplicates : Remove duplicate values from Series.

Examples#

By default, for each set of duplicated values, the first occurrence is set on False and all others on True:

>>> import cudf
>>> animals = cudf.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> animals.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool

which is equivalent to

>>> animals.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:

>>> animals.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool

By setting keep on False, all duplicates are True:

>>> animals.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool
corr(other, method='pearson', min_periods=None)#

Calculates the sample correlation between two Series, excluding missing values.

Parameters#

otherSeries

Series with which to compute the correlation.

method{‘pearson’, ‘spearman’}, default ‘pearson’

Method used to compute correlation:

  • pearson : Standard correlation coefficient

  • spearman : Spearman rank correlation

min_periodsint, optional

Minimum number of observations needed to have a valid result.

Examples#

>>> import cudf
>>> ser1 = cudf.Series([0.9, 0.13, 0.62])
>>> ser2 = cudf.Series([0.12, 0.26, 0.51])
>>> ser1.corr(ser2, method="pearson")
-0.20454263717316112
>>> ser1.corr(ser2, method="spearman")
-0.5
autocorr(lag=1)#

Compute the lag-N autocorrelation. This method computes the Pearson correlation between the Series and its shifted self.

Parameters#

lagint, default 1

Number of lags to apply before performing autocorrelation.

Returns#

resultfloat

The Pearson correlation between self and self.shift(lag).

Examples#

>>> import cudf
>>> s = cudf.Series([0.25, 0.5, 0.2, -0.05, 0.17])
>>> s.autocorr()
0.1438853844...
>>> s.autocorr(lag=2)
-0.9647548490...
isin(values)#

Check whether values are contained in Series.

Parameters#

valuesset or list-like

The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Returns#

resultSeries

Series of booleans indicating if each element is in values.

Raises#

TypeError

If values is a string

Examples#

>>> import cudf
>>> s = cudf.Series(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:

>>> s.isin(['lama'])
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Strings and integers are distinct and are therefore not comparable:

>>> cudf.Series([1]).isin(['1'])
0    False
dtype: bool
>>> cudf.Series([1.1]).isin(['1.1'])
0    False
dtype: bool
unique()#

Returns unique values of this Series.

Returns#

Series

A series with only the unique values.

Examples#

>>> import cudf
>>> series = cudf.Series(['a', 'a', 'b', None, 'b', None, 'c'])
>>> series
0       a
1       a
2       b
3    <NA>
4       b
5    <NA>
6       c
dtype: object
>>> series.unique()
0       a
1       b
2    <NA>
3       c
dtype: object
value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)#

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters#

normalizebool, default False

If True then the object returned will contain the relative frequencies of the unique values.

sortbool, default True

Sort by frequencies.

ascendingbool, default False

Sort in ascending order.

binsint, optional

Rather than count values, group them into half-open bins, only works with numeric data.

dropnabool, default True

Don’t include counts of NaN and None.

Returns#

result : Series containing counts of unique values.

See Also#

Series.count

Number of non-NA elements in a Series.

cudf.DataFrame.count

Number of non-NA elements in a DataFrame.

Examples#

>>> import cudf
>>> sr = cudf.Series([1.0, 2.0, 2.0, 3.0, 3.0, 3.0, None])
>>> sr
0     1.0
1     2.0
2     2.0
3     3.0
4     3.0
5     3.0
6    <NA>
dtype: float64
>>> sr.value_counts()
3.0    3
2.0    2
1.0    1
dtype: int64

The order of the counts can be changed by passing ascending=True:

>>> sr.value_counts(ascending=True)
1.0    1
2.0    2
3.0    3
dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> sr.value_counts(normalize=True)
3.0    0.500000
2.0    0.333333
1.0    0.166667
dtype: float64

To include NA value counts, pass dropna=False:

>>> sr = cudf.Series([1.0, 2.0, 2.0, 3.0, None, 3.0, 3.0, None])
>>> sr
0     1.0
1     2.0
2     2.0
3     3.0
4    <NA>
5     3.0
6     3.0
7    <NA>
dtype: float64
>>> sr.value_counts(dropna=False)
3.0     3
2.0     2
<NA>    2
1.0     1
dtype: int64
>>> s = cudf.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(bins=3)
(2.0, 3.0]      2
(0.996, 2.0]    2
(3.0, 4.0]      1
dtype: int64
quantile(q=0.5, interpolation='linear', exact=True, quant_index=True)#

Return values at the given quantile.

Parameters#

qfloat or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

interpolation{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}

This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

  • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

  • lower: i.

  • higher: j.

  • nearest: i or j whichever is nearest.

  • midpoint: (i + j) / 2.

exactboolean

Whether to use approximate or exact quantile algorithm.

quant_indexboolean

Whether to use the list of quantiles as index.

Returns#

float or Series

If q is an array, a Series will be returned where the index is q and the values are the quantiles, otherwise a float will be returned.

Examples#

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series
0    1
1    2
2    3
3    4
dtype: int64
>>> series.quantile(0.5)
2.5
>>> series.quantile([0.25, 0.5, 0.75])
0.25    1.75
0.50    2.50
0.75    3.25
dtype: float64
describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)#

Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters#

percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None(default), optional

A list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.

  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

  • None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'

  • None (default) : The result will exclude nothing.

datetime_is_numericbool, default False

For DataFrame input, this also controls whether datetime columns are included by default.

Deprecated since version 23.04: datetime_is_numeric is deprecated and will be removed in a future version of cudf.

Returns#

output_frameSeries or DataFrame

Summary statistics of the Series or Dataframe provided.

Notes#

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For strings dtype or datetime dtype, the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples#

Describing a Series containing numeric values.

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> s
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64
>>> s.describe()
count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
dtype: float64

Describing a categorical Series.

>>> s = cudf.Series(['a', 'b', 'a', 'b', 'c', 'a'], dtype='category')
>>> s
0    a
1    b
2    a
3    b
4    c
5    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.describe()
count     6
unique    3
top       a
freq      3
dtype: object

Describing a timestamp Series.

>>> import numpy as np
>>> s = cudf.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s
0   2000-01-01
1   2010-01-01
2   2010-01-01
dtype: datetime64[s]
>>> s.describe()
count                     3
mean    2006-09-01 08:00:00
min     2000-01-01 00:00:00
25%     2004-12-31 12:00:00
50%     2010-01-01 00:00:00
75%     2010-01-01 00:00:00
max     2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = cudf.DataFrame({"categorical": cudf.Series(['d', 'e', 'f'],
...                         dtype='category'),
...                      "numeric": [1, 2, 3],
...                      "object": ['a', 'b', 'c']
... })
>>> df
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')
       categorical numeric object
count            3     3.0      3
unique           3    <NA>      3
top              d    <NA>      a
freq             1    <NA>      1
mean          <NA>     2.0   <NA>
std           <NA>     1.0   <NA>
min           <NA>     1.0   <NA>
25%           <NA>     1.5   <NA>
50%           <NA>     2.0   <NA>
75%           <NA>     2.5   <NA>
max           <NA>     3.0   <NA>

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              d      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])
       categorical numeric
count            3     3.0
unique           3    <NA>
top              d    <NA>
freq             1    <NA>
mean          <NA>     2.0
std           <NA>     1.0
min           <NA>     1.0
25%           <NA>     1.5
50%           <NA>     2.0
75%           <NA>     2.5
max           <NA>     3.0
digitize(bins, right=False)#

Return the indices of the bins to which each value belongs.

Notes#

Monotonicity of bins is assumed and not checked.

Parameters#

binsnp.array

1-D monotonically, increasing array with same type as this series.

rightbool

Indicates whether interval contains the right or left bin edge.

Returns#

A new Series containing the indices.

Examples#

>>> import cudf
>>> s = cudf.Series([0.2, 6.4, 3.0, 1.6])
>>> bins = cudf.Series([0.0, 1.0, 2.5, 4.0, 10.0])
>>> inds = s.digitize(bins)
>>> inds
0    1
1    4
2    3
3    2
dtype: int32
diff(periods=1)#

First discrete difference of element.

Calculates the difference of a Series element compared with another element in the Series (default is element in previous row).

Parameters#

periodsint, default 1

Periods to shift for calculating difference, accepts negative values.

Returns#

Series

First differences of the Series.

Examples#

>>> import cudf
>>> series = cudf.Series([1, 1, 2, 3, 5, 8])
>>> series
0    1
1    1
2    2
3    3
4    5
5    8
dtype: int64

Difference with previous row

>>> series.diff()
0    <NA>
1       0
2       1
3       1
4       2
5       3
dtype: int64

Difference with 3rd previous row

>>> series.diff(periods=3)
0    <NA>
1    <NA>
2    <NA>
3       2
4       4
5       6
dtype: int64

Difference with following row

>>> series.diff(periods=-1)
0       0
1      -1
2      -1
3      -2
4      -3
5    <NA>
dtype: int64
groupby(by=None, axis=0, level=None, as_index=True, sort=<no_default>, group_keys=False, squeeze=False, observed=True, dropna=True)#

Group using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters#

bymapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an cupy array is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

levelint, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels.

as_indexbool, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sortbool, default False

Sort result by group key. Differ from Pandas, cudf defaults to False for better performance. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keysbool, optional

When calling apply and the by argument produces a like-indexed result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise. This argument has no effect if the result produced is not like-indexed with respect to the input.

Returns#

SeriesGroupBy

Returns a SeriesGroupBy object that contains information about the groups.

Examples#

Series

>>> ser = cudf.Series([390., 350., 30., 20.],
...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...                 name="Max Speed")
>>> ser
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

DataFrame

>>> import cudf
>>> import pandas as pd
>>> df = cudf.DataFrame({
...     'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...     'Max Speed': [380., 370., 24., 26.],
... })
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = cudf.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...     index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
        Wild         350.0
Parrot Captive       30.0
        Wild          20.0
>>> df.groupby(level=0).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()
        Max Speed
Type
Wild         185.0
Captive      210.0
>>> df = cudf.DataFrame({'A': 'a a b'.split(),
...                      'B': [1,2,3],
...                      'C': [4,6,5]})
>>> g1 = df.groupby('A', group_keys=False)
>>> g2 = df.groupby('A', group_keys=True)

Notice that g1 have g2 have two groups, a and b, and only differ in their group_keys argument. Calling apply in various ways, we can get different grouping results:

>>> g1[['B', 'C']].apply(lambda x: x / x.sum())
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0

In the above, the groups are not part of the index. We can have them included by using g2 where group_keys=True:

>>> g2[['B', 'C']].apply(lambda x: x / x.sum())
            B    C
A
a 0  0.333333  0.4
  1  0.666667  0.6
b 2  1.000000  1.0
rename(index=None, copy=True)#

Alter Series name

Change Series.name with a scalar value

Parameters#

indexScalar, optional

Scalar to alter the Series.name attribute

copyboolean, default True

Also copy underlying data

Returns#

Series

Notes#

Difference from pandas:
  • Supports scalar values only for changing name attribute

  • Not supporting : inplace, level

Examples#

>>> import cudf
>>> series = cudf.Series([10, 20, 30])
>>> series
0    10
1    20
2    30
dtype: int64
>>> series.name
>>> renamed_series = series.rename('numeric_series')
>>> renamed_series
0    10
1    20
2    30
Name: numeric_series, dtype: int64
>>> renamed_series.name
'numeric_series'
add_prefix(prefix)#

Prefix labels with string prefix.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters#

prefixstr

The string to add before each label.

Returns#

Series or DataFrame

New Series with updated labels or DataFrame with updated labels.

See Also#

Series.add_suffix: Suffix row labels with string ‘suffix’. DataFrame.add_suffix: Suffix column labels with string ‘suffix’.

Examples#

Series

>>> s = cudf.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_prefix('item_')
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64

DataFrame

>>> df = cudf.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_prefix('col_')
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
add_suffix(suffix)#

Suffix labels with string suffix.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters#

prefixstr

The string to add after each label.

Returns#

Series or DataFrame

New Series with updated labels or DataFrame with updated labels.

See Also#

Series.add_prefix: prefix row labels with string ‘prefix’. DataFrame.add_prefix: Prefix column labels with string ‘prefix’.

Examples#

Series

>>> s = cudf.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_suffix('_item')
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64

DataFrame

>>> df = cudf.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})
>>> df
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_suffix('_col')
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
keys()#

Return alias for index.

Returns#

Index

Index of the Series.

Examples#

>>> import cudf
>>> sr = cudf.Series([10, 11, 12, 13, 14, 15])
>>> sr
0    10
1    11
2    12
3    13
4    14
5    15
dtype: int64
>>> sr.keys()
RangeIndex(start=0, stop=6, step=1)
>>> sr = cudf.Series(['a', 'b', 'c'])
>>> sr
0    a
1    b
2    c
dtype: object
>>> sr.keys()
RangeIndex(start=0, stop=3, step=1)
>>> sr = cudf.Series([1, 2, 3], index=['a', 'b', 'c'])
>>> sr
a    1
b    2
c    3
dtype: int64
>>> sr.keys()
StringIndex(['a' 'b' 'c'], dtype='object')
explode(ignore_index=False)#

Transform each element of a list-like to a row, replicating index values.

Parameters#

ignore_indexbool, default False

If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns#

Series

Examples#

>>> import cudf
>>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]])
>>> s
0    [1, 2, 3]
1           []
2         None
3       [4, 5]
dtype: list
>>> s.explode()
0       1
0       2
0       3
1    <NA>
2    <NA>
3       4
3       5
dtype: int64
pct_change(periods=1, fill_method='ffill', limit=None, freq=None)#

Calculates the percent change between sequential elements in the Series.

Parameters#

periodsint, default 1

Periods to shift for forming percent change.

fill_methodstr, default ‘ffill’

How to handle NAs before computing percent changes.

limitint, optional

The number of consecutive NAs to fill before stopping. Not yet implemented.

freqstr, optional

Increment to use from time series API. Not yet implemented.

Returns#

Series

where(cond, other=None, inplace=False)#

Replace values where the condition is False.

Parameters#

condbool Series/DataFrame, array-like

Where cond is True, keep the original value. Where False, replace with corresponding value from other. Callables are not supported.

other: scalar, list of scalars, Series/DataFrame

Entries where cond is False are replaced with corresponding value from other. Callables are not supported. Default is None.

DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.

Series expects only scalar or series like with same length

inplacebool, default False

Whether to perform the operation in place on the data.

Returns#

Same type as caller

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]})
>>> df.where(df % 2 == 0, [-1, -1])
   A  B
0 -1 -1
1  4 -1
2 -1  8
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.where(ser > 2, 10)
0     4
1     3
2    10
3    10
4    10
dtype: int64
>>> ser.where(ser > 2)
0       4
1       3
2    <NA>
3    <NA>
4    <NA>
dtype: int64
abs()#

Return a Series/DataFrame with absolute numeric value of each element.

This function only applies to elements that are all numeric.

Returns#

DataFrame/Series

Absolute value of each element.

Examples#

Absolute numeric values in a Series

>>> s = cudf.Series([-1.10, 2, -3.33, 4])
>>> s.abs()
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64
add(other, level=None, fill_value=None, axis=0)#

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.add(1)
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.add(b)
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.add(b, fill_value=0)
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
backfill(value=None, axis=None, inplace=None, limit=None)#

Synonym for Series.fillna() with method='bfill'.

Deprecated since version 23.06: Use DataFrame.bfill/Series.bfill instead.

Returns#

Object with missing values filled or None if inplace=True.

bfill(value=None, axis=None, inplace=None, limit=None)#

Synonym for Series.fillna() with method='bfill'.

Returns#

Object with missing values filled or None if inplace=True.

clip(lower=None, upper=None, inplace=False, axis=1)#

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis. Currently only axis=1 is supported.

Parameters#

lowerscalar or array_like, default None

Minimum threshold value. All values below this threshold will be set to it. If it is None, there will be no clipping based on lower. In case of Series/Index, lower is expected to be a scalar or an array of size 1.

upperscalar or array_like, default None

Maximum threshold value. All values below this threshold will be set to it. If it is None, there will be no clipping based on upper. In case of Series, upper is expected to be a scalar or an array of size 1.

inplace : bool, default False

Returns#

Clipped DataFrame/Series/Index/MultiIndex

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"a":[1, 2, 3, 4], "b":['a', 'b', 'c', 'd']})
>>> df.clip(lower=[2, 'b'], upper=[3, 'c'])
   a  b
0  2  b
1  2  b
2  3  c
3  3  c
>>> df.clip(lower=None, upper=[3, 'c'])
   a  b
0  1  a
1  2  b
2  3  c
3  3  c
>>> df.clip(lower=[2, 'b'], upper=None)
   a  b
0  2  b
1  2  b
2  3  c
3  4  d
>>> df.clip(lower=2, upper=3, inplace=True)
>>> df
   a  b
0  2  2
1  2  3
2  3  3
3  3  3
>>> import cudf
>>> sr = cudf.Series([1, 2, 3, 4])
>>> sr.clip(lower=2, upper=3)
0    2
1    2
2    3
3    3
dtype: int64
>>> sr.clip(lower=None, upper=3)
0    1
1    2
2    3
3    3
dtype: int64
>>> sr.clip(lower=2, upper=None, inplace=True)
>>> sr
0    2
1    2
2    3
3    4
dtype: int64
convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True, dtype_backend=None)#

Convert columns to the best possible nullable dtypes.

If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating type.

All other dtypes are always returned as-is as all dtypes in cudf are nullable.

copy(deep: bool = True) Self#

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below). When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Parameters#

deepbool, default True

Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.

Returns#

copySeries or DataFrame

Object type matches caller.

Examples#

>>> s = cudf.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64
>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

Shallow copy versus default (deep) copy:

>>> s = cudf.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.

>>> s['a'] = 3
>>> shallow['b'] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64
cummax(axis=None, skipna=True)#

Return cumulative max of the Series.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

Series

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34
cummin(axis=None, skipna=True)#

Return cumulative min of the Series.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

Series

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34
cumprod(axis=None, skipna=True)#

Return cumulative product of the Series.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

Series

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34
cumsum(axis=None, skipna=True)#

Return cumulative sum of the Series.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Returns#

Series

Examples#

Series

>>> import cudf
>>> ser = cudf.Series([1, 5, 2, 4, 3])
>>> ser.cumsum()
0    1
1    6
2    8
3    12
4    15

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> s.cumsum()
    a   b
0   1   7
1   3  15
2   6  24
3  10  34
div(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64
divide(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64
dot(other, reflect=False)#

Get dot product of frame and other, (binary operator dot).

Among flexible wrappers (add, sub, mul, div, mod, pow, dot) to arithmetic operators: +, -, *, /, //, %, **, @.

Parameters#

otherSequence, Series, or DataFrame

Any multiple element data structure, or list-like object.

reflectbool, default False

If True, swap the order of the operands. See https://docs.python.org/3/reference/datamodel.html#object.__ror__ for more information on when this is necessary.

Returns#

scalar, Series, or DataFrame

The result of the operation.

Examples#

>>> import cudf
>>> df = cudf.DataFrame([[1, 2, 3, 4],
...                      [5, 6, 7, 8]])
>>> df @ df.T
    0    1
0  30   70
1  70  174
>>> s = cudf.Series([1, 1, 1, 1])
>>> df @ s
0    10
1    26
dtype: int64
>>> [1, 2, 3, 4] @ s
10
property empty#

Indicator whether DataFrame or Series is empty.

True if DataFrame/Series is entirely empty (no items), meaning any of the axes are of length 0.

Returns#

outbool

If DataFrame/Series is empty, return True, if not return False.

Notes#

If DataFrame/Series contains only null values, it is still not considered empty. See the example below.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'A' : []})
>>> df
Empty DataFrame
Columns: [A]
Index: []
>>> df.empty
True

If we only have null values in our DataFrame, it is not considered empty! We will need to drop the null’s to make the DataFrame empty:

>>> df = cudf.DataFrame({'A' : [None, None]})
>>> df
      A
0  <NA>
1  <NA>
>>> df.empty
False
>>> df.dropna().empty
True

Non-empty and empty Series example:

>>> s = cudf.Series([1, 2, None])
>>> s
0       1
1       2
2    <NA>
dtype: int64
>>> s.empty
False
>>> s = cudf.Series([])
>>> s
Series([], dtype: float64)
>>> s.empty
True
eq(other, level=None, fill_value=None, axis=0)#

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.eq(1)
        angles  degrees
circle      False    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.eq(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.eq(b, fill_value=0)
a    True
b   False
c   False
d   False
e    <NA>
dtype: bool
equals(other)#

Test whether two objects contain the same elements.

This function allows two objects to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type.

Parameters#

otherIndex, Series, DataFrame

The other object to be compared with.

Returns#

bool

True if all elements are the same in both objects, False otherwise.

Examples#

>>> import cudf

Comparing Series with equals:

>>> s = cudf.Series([1, 2, 3])
>>> other = cudf.Series([1, 2, 3])
>>> s.equals(other)
True
>>> different = cudf.Series([1.5, 2, 3])
>>> s.equals(different)
False

Comparing DataFrames with equals:

>>> df = cudf.DataFrame({1: [10], 2: [20]})
>>> df
    1   2
0  10  20
>>> exactly_equal = cudf.DataFrame({1: [10], 2: [20]})
>>> exactly_equal
    1   2
0  10  20
>>> df.equals(exactly_equal)
True

For two DataFrames to compare equal, the types of column values must be equal, but the types of column labels need not:

>>> different_column_type = cudf.DataFrame({1.0: [10], 2.0: [20]})
>>> different_column_type
   1.0  2.0
0   10   20
>>> df.equals(different_column_type)
True
factorize(sort=False, na_sentinel=None, use_na_sentinel=None)#

Encode the input values as integer labels.

Parameters#

sortbool, default True

Sort uniques and shuffle codes to maintain the relationship.

na_sentinelnumber, default -1

Value to indicate missing category.

Deprecated since version 23.04: The na_sentinel argument is deprecated and will be removed in a future version of cudf. Specify use_na_sentinel as either True or False.

use_na_sentinelbool, default True

If True, the sentinel -1 will be used for NA values. If False, NA values will be encoded as non-negative integers and will not drop the NA from the uniques of the values.

Returns#

(labels, cats)(cupy.ndarray, cupy.ndarray or Index)
  • labels contains the encoded values

  • cats contains the categories in order that the N-th item corresponds to the (N-1) code.

Examples#

>>> import cudf
>>> s = cudf.Series(['a', 'a', 'c'])
>>> codes, uniques = s.factorize()
>>> codes
array([0, 0, 1], dtype=int8)
>>> uniques
StringIndex(['a' 'c'], dtype='object')
ffill(value=None, axis=None, inplace=None, limit=None)#

Synonym for Series.fillna() with method='ffill'.

Returns#

Object with missing values filled or None if inplace=True.

first(offset)#

Select initial periods of time series data based on a date offset.

When having a DataFrame with sorted dates as index, this function can select the first few rows based on a date offset.

Parameters#

offset: str

The offset length of the data that will be selected. For instance, ‘1M’ will display all rows having their index within the first month.

Returns#

Series or DataFrame

A subset of the caller.

Raises#

TypeError

If the index is not a DatetimeIndex

Examples#

>>> i = cudf.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = cudf.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
>>> ts.first('3D')
            A
2018-04-09  1
2018-04-11  2
floordiv(other, level=None, fill_value=None, axis=0)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.floordiv(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.floordiv(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.floordiv(b, fill_value=0)
a                      1
b    9223372036854775807
c    9223372036854775807
d                      0
e                   <NA>
dtype: int64
classmethod from_arrow(array)#

Create from PyArrow Array/ChunkedArray.

Parameters#

arrayPyArrow Array/ChunkedArray

PyArrow Object which has to be converted.

Raises#

TypeError for invalid input type.

Returns#

SingleColumnFrame

Examples#

>>> import cudf
>>> import pyarrow as pa
>>> cudf.Index.from_arrow(pa.array(["a", "b", None]))
StringIndex(['a' 'b' None], dtype='object')
>>> cudf.Series.from_arrow(pa.array(["a", "b", None]))
0       a
1       b
2    <NA>
dtype: object
ge(other, level=None, fill_value=None, axis=0)#

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.ge(1)
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.ge(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ge(b, fill_value=0)
a   True
b    True
c    True
d   False
e    <NA>
dtype: bool
gt(other, level=None, fill_value=None, axis=0)#

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.gt(1)
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.gt(b)
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.gt(b, fill_value=0)
a   False
b    True
c    True
d   False
e    <NA>
dtype: bool
hash_values(method='murmur3', seed=None)#

Compute the hash of values in this column.

Parameters#

method{‘murmur3’, ‘md5’}, default ‘murmur3’

Hash function to use: * murmur3: MurmurHash3 hash function. * md5: MD5 hash function.

seedint, optional

Seed value to use for the hash function. Note - This only has effect for the following supported hash functions: * murmur3: MurmurHash3 hash function.

Returns#

Series

A Series with hash values.

Examples#

Series

>>> import cudf
>>> series = cudf.Series([10, 120, 30])
>>> series
0     10
1    120
2     30
dtype: int64
>>> series.hash_values(method="murmur3")
0   -1930516747
1     422619251
2    -941520876
dtype: int32
>>> series.hash_values(method="md5")
0    7be4bbacbfdb05fb3044e36c22b41e8b
1    947ca8d2c5f0f27437f156cfbfab0969
2    d0580ef52d27c043c8e341fd5039b166
dtype: object
>>> series.hash_values(method="murmur3", seed=42)
0    2364453205
1     422621911
2    3353449140
dtype: uint32

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({"a": [10, 120, 30], "b": [0.0, 0.25, 0.50]})
>>> df
     a     b
0   10  0.00
1  120  0.25
2   30  0.50
>>> df.hash_values(method="murmur3")
0    -330519225
1    -397962448
2   -1345834934
dtype: int32
>>> df.hash_values(method="md5")
0    57ce879751b5169c525907d5c563fae1
1    948d6221a7c4963d4be411bcead7e32b
2    fe061786ea286a515b772d91b0dfcd70
dtype: object
head(n=5)#

Return the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

Parameters#

nint, default 5

Number of rows to select.

Returns#

DataFrame or Series

The first n rows of the caller object.

Examples#

Series

>>> ser = cudf.Series(['alligator', 'bee', 'falcon',
... 'lion', 'monkey', 'parrot', 'shark', 'whale', 'zebra'])
>>> ser
0    alligator
1          bee
2       falcon
3         lion
4       monkey
5       parrot
6        shark
7        whale
8        zebra
dtype: object

Viewing the first 5 lines

>>> ser.head()
0    alligator
1          bee
2       falcon
3         lion
4       monkey
dtype: object

Viewing the first n lines (three in this case)

>>> ser.head(3)
0    alligator
1          bee
2       falcon
dtype: object

For negative values of n

>>> ser.head(-3)
0    alligator
1          bee
2       falcon
3         lion
4       monkey
5       parrot
dtype: object

DataFrame

>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df.head(2)
   key   val
0    0  10.0
1    1  11.0
property iloc#

Select values by position.

Examples#

Series

>>> import cudf
>>> s = cudf.Series([10, 20, 30])
>>> s
0    10
1    20
2    30
dtype: int64
>>> s.iloc[2]
30

DataFrame

Selecting rows and column by position.

>>> df = cudf.DataFrame({'a': range(20),
...                      'b': range(20),
...                      'c': range(20)})

Select a single row using an integer index.

>>> df.iloc[1]
a    1
b    1
c    1
Name: 1, dtype: int64

Select multiple rows using a list of integers.

>>> df.iloc[[0, 2, 9, 18]]
      a    b    c
 0    0    0    0
 2    2    2    2
 9    9    9    9
18   18   18   18

Select rows using a slice.

>>> df.iloc[3:10:2]
     a    b    c
3    3    3    3
5    5    5    5
7    7    7    7
9    9    9    9

Select both rows and columns.

>>> df.iloc[[1, 3, 5, 7], 2]
1    1
3    3
5    5
7    7
Name: c, dtype: int64

Setting values in a column using iloc.

>>> df.iloc[:4] = 0
>>> df
   a  b  c
0  0  0  0
1  0  0  0
2  0  0  0
3  0  0  0
4  4  4  4
5  5  5  5
6  6  6  6
7  7  7  7
8  8  8  8
9  9  9  9
[10 more rows]
property index#

Get the labels for the rows.

interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)#

Interpolate data values between some points.

Parameters#

methodstr, default ‘linear’

Interpolation technique to use. Currently, only ‘linear` is supported. * ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes. * ‘index’, ‘values’: linearly interpolate using the index as an x-axis. Unsorted indices can lead to erroneous results.

axisint, default 0

Axis to interpolate along. Currently, only ‘axis=0’ is supported.

inplacebool, default False

Update the data in place if possible.

Returns#

Series or DataFrame

Returns the same object type as the caller, interpolated at some or all NaN values

property is_monotonic#

Return boolean if values in the object are monotonically increasing.

This property is an alias for is_monotonic_increasing.

Returns#

bool

property is_monotonic_decreasing#

Return boolean if values in the object are monotonically decreasing.

Returns#

bool

property is_monotonic_increasing#

Return boolean if values in the object are monotonically increasing.

Returns#

bool

isna()#

Identify missing values.

Return a boolean same-sized object indicating if the values are <NA>. <NA> values gets mapped to True values. Everything else gets mapped to False values. <NA> values include:

  • Values where null mask is set.

  • NaN in float dtype.

  • NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index

Mask of bool values for each element in the object that indicates whether an element is an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
3    False
4    False
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf])
>>> idx
Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.isna()
array([False, False,  True,  True, False, False])
isnull()#

Identify missing values.

Return a boolean same-sized object indicating if the values are <NA>. <NA> values gets mapped to True values. Everything else gets mapped to False values. <NA> values include:

  • Values where null mask is set.

  • NaN in float dtype.

  • NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index

Mask of bool values for each element in the object that indicates whether an element is an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
3    False
4    False
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf])
>>> idx
Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.isna()
array([False, False,  True,  True, False, False])
kurt(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return Fisher’s unbiased kurtosis of a sample.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

Returns#

Series or scalar

Notes#

Parameters currently not supported are level and numeric_only

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series.kurtosis()
-1.1999999999999904

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.kurt()
a   -1.2
b   -1.2
dtype: float64
kurtosis(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return Fisher’s unbiased kurtosis of a sample.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

Returns#

Series or scalar

Notes#

Parameters currently not supported are level and numeric_only

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4])
>>> series.kurtosis()
-1.1999999999999904

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.kurt()
a   -1.2
b   -1.2
dtype: float64
last(offset)#

Select final periods of time series data based on a date offset.

When having a DataFrame with sorted dates as index, this function can select the last few rows based on a date offset.

Parameters#

offset: str

The offset length of the data that will be selected. For instance, ‘3D’ will display all rows having their index within the last 3 days.

Returns#

Series or DataFrame

A subset of the caller.

Raises#

TypeError

If the index is not a DatetimeIndex

Examples#

>>> i = cudf.date_range('2018-04-09', periods=4, freq='2D')
>>> ts = cudf.DataFrame({'A': [1, 2, 3, 4]}, index=i)
>>> ts
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4
>>> ts.last('3D')
            A
2018-04-13  3
2018-04-15  4
le(other, level=None, fill_value=None, axis=0)#

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.le(1)
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.le(b)
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.le(b, fill_value=0)
a    True
b   False
c   False
d    True
e    <NA>
dtype: bool
property loc#

Select rows and columns by label or boolean mask.

Examples#

Series

>>> import cudf
>>> series = cudf.Series([10, 11, 12], index=['a', 'b', 'c'])
>>> series
a    10
b    11
c    12
dtype: int64
>>> series.loc['b']
11

DataFrame

DataFrame with string index.

>>> df
   a  b
a  0  5
b  1  6
c  2  7
d  3  8
e  4  9

Select a single row by label.

>>> df.loc['a']
a    0
b    5
Name: a, dtype: int64

Select multiple rows and a single column.

>>> df.loc[['a', 'c', 'e'], 'b']
a    5
c    7
e    9
Name: b, dtype: int64

Selection by boolean mask.

>>> df.loc[df.a > 2]
   a  b
d  3  8
e  4  9

Setting values using loc.

>>> df.loc[['a', 'c', 'e'], 'a'] = 0
>>> df
   a  b
a  0  5
b  1  6
c  0  7
d  3  8
e  0  9
lt(other, level=None, fill_value=None, axis=0)#

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.lt(1)
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.lt(b)
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.lt(b, fill_value=0)
a   False
b   False
c   False
d    True
e    <NA>
dtype: bool
mask(cond, other=None, inplace=False)#

Replace values where the condition is True.

Parameters#

condbool Series/DataFrame, array-like

Where cond is False, keep the original value. Where True, replace with corresponding value from other. Callables are not supported.

other: scalar, list of scalars, Series/DataFrame

Entries where cond is True are replaced with corresponding value from other. Callables are not supported. Default is None.

DataFrame expects only Scalar or array like with scalars or dataframe with same dimension as self.

Series expects only scalar or series like with same length

inplacebool, default False

Whether to perform the operation in place on the data.

Returns#

Same type as caller

Examples#

>>> import cudf
>>> df = cudf.DataFrame({"A":[1, 4, 5], "B":[3, 5, 8]})
>>> df.mask(df % 2 == 0, [-1, -1])
   A  B
0  1  3
1 -1  5
2  5 -1
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.mask(ser > 2, 10)
0    10
1    10
2     2
3     1
4     0
dtype: int64
>>> ser.mask(ser > 2)
0    <NA>
1    <NA>
2       2
3       1
4       0
dtype: int64
max(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return the maximum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

level: int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only: bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

Returns#

Series

Notes#

Parameters currently not supported are level, numeric_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.max()
a     4
b    10
dtype: int64
mean(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return the mean of the values for the requested axis.

Parameters#

axis{0 or ‘index’, 1 or ‘columns’}

Axis for the function to be applied on.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

**kwargs

Additional keyword arguments to be passed to the function.

Returns#

mean : Series or DataFrame (if level specified)

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.mean()
a    2.5
b    8.5
dtype: float64
median(axis=None, skipna=True, level=None, numeric_only=None, **kwargs)#

Return the median of the values for the requested axis.

Parameters#

skipnabool, default True

Exclude NA/null values when computing the result.

Returns#

scalar

Notes#

Parameters currently not supported are level and numeric_only.

Examples#

>>> import cudf
>>> ser = cudf.Series([10, 25, 3, 25, 24, 6])
>>> ser
0    10
1    25
2     3
3    25
4    24
5     6
dtype: int64
>>> ser.median()
17.0
min(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return the minimum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

level: int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

numeric_only: bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

Returns#

Series

Notes#

Parameters currently not supported are level, numeric_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.min()
a    1
b    7
dtype: int64
mod(other, level=None, fill_value=None, axis=0)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.mod(1)
        angles  degrees
circle          0        0
triangle        0        0
rectangle       0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.mod(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.mod(b, fill_value=0)
a             0
b    4294967295
c    4294967295
d             0
e          <NA>
dtype: int64
mul(other, level=None, fill_value=None, axis=0)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.multiply(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.multiply(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
multiply(other, level=None, fill_value=None, axis=0)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.multiply(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.multiply(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
property name#

Get the name of this object.

nans_to_nulls()#

Convert nans (if any) to nulls

Returns#

DataFrame or Series

Examples#

Series

>>> import cudf, numpy as np
>>> series = cudf.Series([1, 2, np.nan, None, 10], nan_as_null=False)
>>> series
0     1.0
1     2.0
2     NaN
3    <NA>
4    10.0
dtype: float64
>>> series.nans_to_nulls()
0     1.0
1     2.0
2    <NA>
3    <NA>
4    10.0
dtype: float64

DataFrame

>>> df = cudf.DataFrame()
>>> df['a'] = cudf.Series([1, None, np.nan], nan_as_null=False)
>>> df['b'] = cudf.Series([None, 3.14, np.nan], nan_as_null=False)
>>> df
      a     b
0   1.0  <NA>
1  <NA>  3.14
2   NaN   NaN
>>> df.nans_to_nulls()
      a     b
0   1.0  <NA>
1  <NA>  3.14
2  <NA>  <NA>
property ndim#

Number of dimensions of the underlying data, by definition 1.

ne(other, level=None, fill_value=None, axis=0)#

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.ne(1)
        angles  degrees
circle       True     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.ne(b)
a    False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ne(b, fill_value=0)
a   False
b    True
c    True
d    True
e    <NA>
dtype: bool
notna()#

Identify non-missing values.

Return a boolean same-sized object indicating if the values are not <NA>. Non-missing values get mapped to True. <NA> values get mapped to False values. <NA> values include:

  • Values where null mask is set.

  • NaN in float dtype.

  • NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index

Mask of bool values for each element in the object that indicates whether an element is not an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
3     True
4     True
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf])
>>> idx
Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.notna()
array([ True,  True, False, False,  True,  True])
notnull()#

Identify non-missing values.

Return a boolean same-sized object indicating if the values are not <NA>. Non-missing values get mapped to True. <NA> values get mapped to False values. <NA> values include:

  • Values where null mask is set.

  • NaN in float dtype.

  • NaT in datetime64 and timedelta64 types.

Characters such as empty strings '' or inf in case of float are not considered <NA> values.

Returns#

DataFrame/Series/Index

Mask of bool values for each element in the object that indicates whether an element is not an NA value.

Examples#

Show which entries in a DataFrame are NA.

>>> import cudf
>>> import numpy as np
>>> import pandas as pd
>>> df = cudf.DataFrame({'age': [5, 6, np.NaN],
...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
...                             pd.Timestamp('1940-04-25')],
...                    'name': ['Alfred', 'Batman', ''],
...                    'toy': [None, 'Batmobile', 'Joker']})
>>> df
    age                        born    name        toy
0     5                        <NA>  Alfred       <NA>
1     6  1939-05-27 00:00:00.000000  Batman  Batmobile
2  <NA>  1940-04-25 00:00:00.000000              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are NA.

>>> ser = cudf.Series([5, 6, np.NaN, np.inf, -np.inf])
>>> ser
0     5.0
1     6.0
2    <NA>
3     Inf
4    -Inf
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
3     True
4     True
dtype: bool

Show which entries in an Index are NA.

>>> idx = cudf.Index([1, 2, None, np.NaN, 0.32, np.inf])
>>> idx
Float64Index([1.0, 2.0, <NA>, <NA>, 0.32, Inf], dtype='float64')
>>> idx.notna()
array([ True,  True, False, False,  True,  True])
nunique(dropna: bool = True)#

Return count of unique values for the column.

Parameters#

dropnabool, default True

Don’t include NaN in the counts.

Returns#

int

Number of unique values in the column.

pad(value=None, axis=None, inplace=None, limit=None)#

Synonym for Series.fillna() with method='ffill'.

Deprecated since version 23.06: Use DataFrame.ffill/Series.ffill instead.

Returns#

Object with missing values filled or None if inplace=True.

pipe(func, *args, **kwargs)#

Apply func(self, *args, **kwargs).

Parameters#

funcfunction

Function to apply to the Series/DataFrame/Index. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame/Index.

argsiterable, optional

Positional arguments passed into func.

kwargsmapping, optional

A dictionary of keyword arguments passed into func.

Returns#

object : the return type of func.

Examples#

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> func(g(h(df), arg1=a), arg2=b, arg3=c)

You can write

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe(func, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)
...    .pipe(g, arg1=a)
...    .pipe((func, 'arg2'), arg1=a, arg3=c)
...  )
pow(other, level=None, fill_value=None, axis=0)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.pow(1)
        angles  degrees
circle          0      360
triangle        2      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.pow(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.pow(b, fill_value=0)
a       1
b       1
c       1
d       0
e    <NA>
dtype: int64
prod(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#

Return product of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Notes#

Parameters currently not supported are level`, numeric_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.product()
a      24
b    5040
dtype: int64
product(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#

Return product of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Notes#

Parameters currently not supported are level`, numeric_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.product()
a      24
b    5040
dtype: int64
radd(other, level=None, fill_value=None, axis=0)#

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.radd(1)
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.radd(b)
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.radd(b, fill_value=0)
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)#

Compute numerical data ranks (1 through n) along axis.

By default, equal values are assigned a rank that is the average of the ranks of those values.

Parameters#

axis{0 or ‘index’}, default 0

Index to direct ranking.

method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’

How to rank the group of records that have the same value (i.e. ties): * average: average rank of the group * min: lowest rank in the group * max: highest rank in the group * first: ranks assigned in order they appear in the array * dense: like ‘min’, but rank always increases by 1 between groups.

numeric_onlybool, optional

For DataFrame objects, rank only numeric columns if set to True.

na_option{‘keep’, ‘top’, ‘bottom’}, default ‘keep’

How to rank NaN values: * keep: assign NaN rank to NaN values * top: assign smallest rank to NaN values if ascending * bottom: assign highest rank to NaN values if ascending.

ascendingbool, default True

Whether or not the elements should be ranked in ascending order.

pctbool, default False

Whether or not to display the returned rankings in percentile form.

Returns#

same type as caller

Return a Series or DataFrame with data ranks as values.

rdiv(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rtruediv(1)
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rtruediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64
repeat(repeats, axis=None)#

Repeats elements consecutively.

Returns a new object of caller type(DataFrame/Series) where each element of the current object is repeated consecutively a given number of times.

Parameters#

repeatsint, or array of ints

The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty object.

Returns#

Series/DataFrame

A newly created object of same type as caller with repeated elements.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3], 'b': [10, 20, 30]})
>>> df
   a   b
0  1  10
1  2  20
2  3  30
>>> df.repeat(3)
   a   b
0  1  10
0  1  10
0  1  10
1  2  20
1  2  20
1  2  20
2  3  30
2  3  30
2  3  30

Repeat on Series

>>> s = cudf.Series([0, 2])
>>> s
0    0
1    2
dtype: int64
>>> s.repeat([3, 4])
0    0
0    0
0    0
1    2
1    2
1    2
1    2
dtype: int64
>>> s.repeat(2)
0    0
0    0
1    2
1    2
dtype: int64
resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)#

Convert the frequency of (“resample”) the given time series data.

Parameters#

rule: str

The offset string representing the frequency to use. Note that DateOffset objects are not yet supported.

closed: {“right”, “left”}, default None

Which side of bin interval is closed. The default is “left” for all frequency offsets except for “M” and “W”, which have a default of “right”.

label: {“right”, “left”}, default None

Which bin edge label to label bucket with. The default is “left” for all frequency offsets except for “M” and “W”, which have a default of “right”.

on: str, optional

For a DataFrame, column to use instead of the index for resampling. Column must be a datetime-like.

level: str or int, optional

For a MultiIndex, level to use instead of the index for resampling. The level must be a datetime-like.

Returns#

A Resampler object

Examples#

First, we create a time series with 1 minute intervals:

>>> index = cudf.date_range(start="2001-01-01", periods=10, freq="1T")
>>> sr = cudf.Series(range(10), index=index)
>>> sr
2001-01-01 00:00:00    0
2001-01-01 00:01:00    1
2001-01-01 00:02:00    2
2001-01-01 00:03:00    3
2001-01-01 00:04:00    4
2001-01-01 00:05:00    5
2001-01-01 00:06:00    6
2001-01-01 00:07:00    7
2001-01-01 00:08:00    8
2001-01-01 00:09:00    9
dtype: int64

Downsampling to 3 minute intervals, followed by a “sum” aggregation:

>>> sr.resample("3T").sum()
2001-01-01 00:00:00     3
2001-01-01 00:03:00    12
2001-01-01 00:06:00    21
2001-01-01 00:09:00     9
dtype: int64

Use the right side of each interval to label the bins:

>>> sr.resample("3T", label="right").sum()
2001-01-01 00:03:00     3
2001-01-01 00:06:00    12
2001-01-01 00:09:00    21
2001-01-01 00:12:00     9
dtype: int64

Close the right side of the interval instead of the left:

>>> sr.resample("3T", closed="right").sum()
2000-12-31 23:57:00     0
2001-01-01 00:00:00     6
2001-01-01 00:03:00    15
2001-01-01 00:06:00    24
dtype: int64

Upsampling to 30 second intervals:

>>> sr.resample("30s").asfreq()[:5]  # show the first 5 rows
2001-01-01 00:00:00       0
2001-01-01 00:00:30    <NA>
2001-01-01 00:01:00       1
2001-01-01 00:01:30    <NA>
2001-01-01 00:02:00       2
dtype: int64

Upsample and fill nulls using the “bfill” method:

>>> sr.resample("30s").bfill()[:5]
2001-01-01 00:00:00    0
2001-01-01 00:00:30    1
2001-01-01 00:01:00    1
2001-01-01 00:01:30    2
2001-01-01 00:02:00    2
dtype: int64

Resampling by a specified column of a Dataframe:

>>> df = cudf.DataFrame({
...     "price": [10, 11, 9, 13, 14, 18, 17, 19],
...     "volume": [50, 60, 40, 100, 50, 100, 40, 50],
...     "week_starting": cudf.date_range(
...         "2018-01-01", periods=8, freq="7D"
...     )
... })
>>> df
price  volume week_starting
0     10      50    2018-01-01
1     11      60    2018-01-08
2      9      40    2018-01-15
3     13     100    2018-01-22
4     14      50    2018-01-29
5     18     100    2018-02-05
6     17      40    2018-02-12
7     19      50    2018-02-19
>>> df.resample("M", on="week_starting").mean()
               price     volume
week_starting
2018-01-31      11.4  60.000000
2018-02-28      18.0  63.333333

Notes#

Note that the dtype of the index (or the ‘on’ column if using ‘on=’) in the result will be of a frequency closest to the resampled frequency. For example, if resampling from nanoseconds to milliseconds, the index will be of dtype ‘datetime64[ms]’.

rfloordiv(other, level=None, fill_value=None, axis=0)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rfloordiv(1)
                        angles  degrees
circle     9223372036854775807        0
triangle                     0        0
rectangle                    0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rfloordiv(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rfloordiv(b, fill_value=0)
a                      1
b                      0
c                      0
d    9223372036854775807
e                   <NA>
dtype: int64
rmod(other, level=None, fill_value=None, axis=0)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmod(1)
            angles  degrees
circle     4294967295        1
triangle            1        1
rectangle           1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rmod(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmod(b, fill_value=0)
a             0
b             0
c             0
d    4294967295
e          <NA>
dtype: int64
rmul(other, level=None, fill_value=None, axis=0)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmul(1)
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rmul(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmul(b, fill_value=0)
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
rolling(window, min_periods=None, center=False, axis=0, win_type=None)#

Rolling window calculations.

Parameters#

windowint, offset or a BaseIndexer subclass

Size of the window, i.e., the number of observations used to calculate the statistic. For datetime indexes, an offset can be provided instead of an int. The offset must be convertible to a timedelta. As opposed to a fixed window size, each window will be sized to accommodate observations within the time period specified by the offset. If a BaseIndexer subclass is passed, calculates the window boundaries based on the defined get_window_bounds method.

min_periodsint, optional

The minimum number of observations in the window that are required to be non-null, so that the result is non-null. If not provided or None, min_periods is equal to the window size.

centerbool, optional

If True, the result is set at the center of the window. If False (default), the result is set at the right edge of the window.

Returns#

Rolling object.

Examples#

>>> import cudf
>>> a = cudf.Series([1, 2, 3, None, 4])

Rolling sum with window size 2.

>>> print(a.rolling(2).sum())
0
1    3
2    5
3
4
dtype: int64

Rolling sum with window size 2 and min_periods 1.

>>> print(a.rolling(2, min_periods=1).sum())
0    1
1    3
2    5
3    3
4    4
dtype: int64

Rolling count with window size 3.

>>> print(a.rolling(3).count())
0    1
1    2
2    3
3    2
4    2
dtype: int64

Rolling count with window size 3, but with the result set at the center of the window.

>>> print(a.rolling(3, center=True).count())
0    2
1    3
2    2
3    2
4    1 dtype: int64

Rolling max with variable window size specified by an offset; only valid for datetime index.

>>> a = cudf.Series(
...     [1, 9, 5, 4, np.nan, 1],
...     index=[
...         pd.Timestamp('20190101 09:00:00'),
...         pd.Timestamp('20190101 09:00:01'),
...         pd.Timestamp('20190101 09:00:02'),
...         pd.Timestamp('20190101 09:00:04'),
...         pd.Timestamp('20190101 09:00:07'),
...         pd.Timestamp('20190101 09:00:08')
...     ]
... )
>>> print(a.rolling('2s').max())
2019-01-01T09:00:00.000    1
2019-01-01T09:00:01.000    9
2019-01-01T09:00:02.000    9
2019-01-01T09:00:04.000    4
2019-01-01T09:00:07.000
2019-01-01T09:00:08.000    1
dtype: int64

Apply custom function on the window with the apply method

>>> import numpy as np
>>> import math
>>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64)
>>> def some_func(A):
...     b = 0
...     for a in A:
...         b = b + math.sqrt(a)
...     return b
...
>>> print(b.rolling(3, min_periods=1).apply(some_func))
0     4.0
1     9.0
2    15.0
3    18.0
4    21.0
5    24.0
dtype: float64

And this also works for window rolling set by an offset

>>> import pandas as pd
>>> c = cudf.Series(
...     [16, 25, 36, 49, 64, 81],
...     index=[
...          pd.Timestamp('20190101 09:00:00'),
...          pd.Timestamp('20190101 09:00:01'),
...          pd.Timestamp('20190101 09:00:02'),
...          pd.Timestamp('20190101 09:00:04'),
...          pd.Timestamp('20190101 09:00:07'),
...          pd.Timestamp('20190101 09:00:08')
...      ],
...     dtype=np.float64
... )
>>> print(c.rolling('2s').apply(some_func))
2019-01-01T09:00:00.000     4.0
2019-01-01T09:00:01.000     9.0
2019-01-01T09:00:02.000    11.0
2019-01-01T09:00:04.000     7.0
2019-01-01T09:00:07.000     8.0
2019-01-01T09:00:08.000    17.0
dtype: float64
rpow(other, level=None, fill_value=None, axis=0)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rpow(1)
        angles  degrees
circle          1        1
triangle        1        1
rectangle       1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rpow(b)
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rpow(b, fill_value=0)
a       1
b       0
c       0
d       1
e    <NA>
dtype: int64
rsub(other, level=None, fill_value=None, axis=0)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rsub(1)
        angles  degrees
circle          1     -359
triangle       -2     -179
rectangle      -3     -359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rsub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rsub(b, fill_value=0)
a       0
b      -1
c      -1
d       1
e    <NA>
dtype: int64
rtruediv(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rtruediv(1)
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.rtruediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64
sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)#

Return a random sample of items from an axis of object.

If reproducible results are required, a random number generator may be provided via the random_state parameter. This function will always produce the same sample given an identical random_state.

Notes#

When sampling from axis=0/'index', random_state can be either a numpy random state (numpy.random.RandomState) or a cupy random state (cupy.random.RandomState). When a numpy random state is used, the output is guaranteed to match the output of the corresponding pandas method call, but generating the sample may be slow. If exact pandas equivalence is not required, using a cupy random state will achieve better performance, especially when sampling large number of items. It’s advised to use the matching ndarray type to the random state for the weights array.

Parameters#

nint, optional

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

fracfloat, optional

Fraction of axis items to return. Cannot be used with n.

replacebool, default False

Allow or disallow sampling of the same row more than once. replace == True is not supported for axis = 1/”columns”. replace == False is not supported for axis = 0/”index” given random_state is None or a cupy random state, and weights is specified.

weightsndarray-like, optional

Default None for uniform probability distribution over rows to sample from. If ndarray is passed, the length of weights should equal to the number of rows to sample from, and will be normalized to have a sum of 1. Unlike pandas, index alignment is not currently not performed.

random_stateint, numpy/cupy RandomState, or None, default None

If None, default cupy random state is chosen. If int, the seed for the default cupy random state. If RandomState, rows-to-sample are generated from the RandomState.

axis{0 or index, 1 or columns, None}, default None

Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames). Series doesn’t support axis=1.

ignore_indexbool, default False

If True, the resulting index will be labeled 0, 1, …, n - 1.

Returns#

Series or DataFrame

A new object of same type as caller containing n items randomly sampled from the caller object.

Examples#

>>> import cudf as cudf
>>> df = cudf.DataFrame({"a":{1, 2, 3, 4, 5}})
>>> df.sample(3)
   a
1  2
3  4
0  1
>>> sr = cudf.Series([1, 2, 3, 4, 5])
>>> sr.sample(10, replace=True)
1    4
3    1
2    4
0    5
0    1
4    5
4    1
0    2
0    3
3    2
dtype: int64
>>> df = cudf.DataFrame(
...     {"a": [1, 2], "b": [2, 3], "c": [3, 4], "d": [4, 5]}
... )
>>> df.sample(2, axis=1)
   a  c
0  1  3
1  2  4
scale()#

Scale values to [0, 1] in float64

Returns#

DataFrame or Series

Values scaled to [0, 1].

Examples#

>>> import cudf
>>> series = cudf.Series([10, 11, 12, 0.5, 1])
>>> series
0    10.0
1    11.0
2    12.0
3     0.5
4     1.0
dtype: float64
>>> series.scale()
0    0.826087
1    0.913043
2    1.000000
3    0.000000
4    0.043478
dtype: float64
searchsorted(values, side='left', ascending=True, na_position='last')#

Find indices where elements should be inserted to maintain order

Parameters#

valueFrame (Shape must be consistent with self)

Values to be hypothetically inserted into Self

sidestr {‘left’, ‘right’} optional, default ‘left’

If ‘left’, the index of the first suitable location found is given If ‘right’, return the last such index

ascendingbool optional, default True

Sorted Frame is in ascending order (otherwise descending)

na_positionstr {‘last’, ‘first’} optional, default ‘last’

Position of null values in sorted order

Returns#

1-D cupy array of insertion points

Examples#

>>> s = cudf.Series([1, 2, 3])
>>> s.searchsorted(4)
3
>>> s.searchsorted([0, 4])
array([0, 3], dtype=int32)
>>> s.searchsorted([1, 3], side='left')
array([0, 2], dtype=int32)
>>> s.searchsorted([1, 3], side='right')
array([1, 3], dtype=int32)

If the values are not monotonically sorted, wrong locations may be returned:

>>> s = cudf.Series([2, 1, 3])
>>> s.searchsorted(1)
0   # wrong result, correct would be 1
>>> df = cudf.DataFrame({'a': [1, 3, 5, 7], 'b': [10, 12, 14, 16]})
>>> df
   a   b
0  1  10
1  3  12
2  5  14
3  7  16
>>> values_df = cudf.DataFrame({'a': [0, 2, 5, 6],
... 'b': [10, 11, 13, 15]})
>>> values_df
   a   b
0  0  10
1  2  17
2  5  13
3  6  15
>>> df.searchsorted(values_df, ascending=False)
array([4, 4, 4, 0], dtype=int32)
property shape#

Get a tuple representing the dimensionality of the Index.

shift(periods=1, freq=None, axis=0, fill_value=None)#

Shift values by periods positions.

property size#

Return the number of elements in the underlying data.

Returns#

size : Size of the DataFrame / Index / Series / MultiIndex

Examples#

Size of an empty dataframe is 0.

>>> import cudf
>>> df = cudf.DataFrame()
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df.size
0
>>> df = cudf.DataFrame(index=[1, 2, 3])
>>> df
Empty DataFrame
Columns: []
Index: [1, 2, 3]
>>> df.size
0

DataFrame with values

>>> df = cudf.DataFrame({'a': [10, 11, 12],
...         'b': ['hello', 'rapids', 'ai']})
>>> df
    a       b
0  10   hello
1  11  rapids
2  12      ai
>>> df.size
6
>>> df.index
RangeIndex(start=0, stop=3)
>>> df.index.size
3

Size of an Index

>>> index = cudf.Index([])
>>> index
Float64Index([], dtype='float64')
>>> index.size
0
>>> index = cudf.Index([1, 2, 3, 10])
>>> index
Int64Index([1, 2, 3, 10], dtype='int64')
>>> index.size
4

Size of a MultiIndex

>>> midx = cudf.MultiIndex(
...                 levels=[["a", "b", "c", None], ["1", None, "5"]],
...                 codes=[[0, 0, 1, 2, 3], [0, 2, 1, 1, 0]],
...                 names=["x", "y"],
...             )
>>> midx
MultiIndex([( 'a',  '1'),
            ( 'a',  '5'),
            ( 'b', <NA>),
            ( 'c', <NA>),
            (<NA>,  '1')],
           names=['x', 'y'])
>>> midx.size
5
skew(axis=<no_default>, skipna=True, level=None, numeric_only=None, **kwargs)#

Return unbiased Fisher-Pearson skew of a sample.

Parameters#

skipna: bool, default True

Exclude NA/null values when computing the result.

Returns#

Series

Notes#

Parameters currently not supported are axis, level and numeric_only

Examples#

Series

>>> import cudf
>>> series = cudf.Series([1, 2, 3, 4, 5, 6, 6])
>>> series
0    1
1    2
2    3
3    4
4    5
5    6
6    6
dtype: int64

DataFrame

>>> import cudf
>>> df = cudf.DataFrame({'a': [3, 2, 3, 4], 'b': [7, 8, 10, 10]})
>>> df.skew()
a    0.00000
b   -0.37037
dtype: float64
std(axis=<no_default>, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs)#

Return sample standard deviation of the DataFrame.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns#

Series

Notes#

Parameters currently not supported are level and numeric_only

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.std()
a    1.290994
b    1.290994
dtype: float64
sub(other, level=None, fill_value=None, axis=0)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.sub(1)
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.sub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64
subtract(other, level=None, fill_value=None, axis=0)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.sub(1)
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.sub(b)
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64
sum(axis=<no_default>, skipna=True, dtype=None, level=None, numeric_only=None, min_count=0, **kwargs)#

Return sum of the values in the DataFrame.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values when computing the result.

dtype: data type

Data type to cast the result to.

min_count: int, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

The default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

Returns#

Series

Notes#

Parameters currently not supported are level, numeric_only.

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.sum()
a    10
b    34
dtype: int64
tail(n=5)#

Returns the last n rows as a new DataFrame or Series

Examples#

DataFrame

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2, 3, 4]
>>> df['val'] = [float(i + 10) for i in range(5)]  # insert column
>>> df.tail(2)
   key   val
3    3  13.0
4    4  14.0

Series

>>> import cudf
>>> ser = cudf.Series([4, 3, 2, 1, 0])
>>> ser.tail(2)
3    1
4    0
take(indices, axis=0)#

Return a new frame containing the rows specified by indices.

Parameters#

indicesarray-like

Array of ints indicating which positions to take.

axis : Unsupported

Returns#

outSeries or DataFrame

New object with desired subset of rows.

Examples#

Series >>> s = cudf.Series([‘a’, ‘b’, ‘c’, ‘d’, ‘e’]) >>> s.take([2, 0, 4, 3]) 2 c 0 a 4 e 3 d dtype: object

DataFrame

>>> a = cudf.DataFrame({'a': [1.0, 2.0, 3.0],
...                    'b': cudf.Series(['a', 'b', 'c'])})
>>> a.take([0, 2, 2])
     a  b
0  1.0  a
2  3.0  c
2  3.0  c
>>> a.take([True, False, True])
     a  b
0  1.0  a
2  3.0  c
tile(count)#

Repeats the rows count times to form a new Frame.

Parameters#

self : input Table containing columns to interleave. count : Number of times to tile “rows”. Must be non-negative.

Examples#

>>> import cudf
>>> df  = cudf.Dataframe([[8, 4, 7], [5, 2, 3]])
>>> count = 2
>>> df.tile(df, count)
   0  1  2
0  8  4  7
1  5  2  3
0  8  4  7
1  5  2  3

Returns#

The indexed frame containing the tiled “rows”.

to_arrow()#

Convert to a PyArrow Array.

Returns#

PyArrow Array

Examples#

>>> import cudf
>>> sr = cudf.Series(["a", "b", None])
>>> sr.to_arrow()
<pyarrow.lib.StringArray object at 0x7f796b0e7600>
[
  "a",
  "b",
  null
]
>>> ind = cudf.Index(["a", "b", None])
>>> ind.to_arrow()
<pyarrow.lib.StringArray object at 0x7f796b0e7750>
[
  "a",
  "b",
  null
]
to_cupy(dtype: Dtype | None = None, copy: bool = True, na_value=None) cupy.ndarray#

Convert the Frame to a CuPy array.

Parameters#

dtypestr or numpy.dtype, optional

The dtype to pass to numpy.asarray().

copybool, default False

Whether to ensure that the returned value is not a view on another array. Note that copy=False does not ensure that to_cupy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

na_valueAny, default None

The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.

Returns#

cupy.ndarray

to_dlpack()#

Converts a cuDF object into a DLPack tensor.

DLPack is an open-source memory tensor structure: dmlc/dlpack.

This function takes a cuDF object and converts it to a PyCapsule object which contains a pointer to a DLPack tensor. This function deep copies the data into the DLPack tensor from the cuDF object.

Parameters#

cudf_obj : DataFrame, Series, Index, or Column

Returns#

pycapsule_objPyCapsule

Output DLPack tensor pointer which is encapsulated in a PyCapsule object.

to_hdf(path_or_buf, key, *args, **kwargs)#

Write the contained data to an HDF5 file using HDFStore.

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.

For more information see the user guide.

Parameters#

path_or_bufstr or pandas.HDFStore

File path or HDFStore object.

keystr

Identifier for the group in the store.

mode{‘a’, ‘w’, ‘r+’}, default ‘a’

Mode to open file:

  • ‘w’: write, a new file is created (an existing file with the same name would be deleted).

  • ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.

  • ‘r+’: similar to ‘a’, but the file must already exist.

format{‘fixed’, ‘table’}, default ‘fixed’

Possible values:

  • ‘fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.

  • ‘table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

appendbool, default False

For Table formats, append the input data to the existing.

data_columnslist of columns or True, optional

List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See Query via Data Columns. Applicable only to format=’table’.

complevel{0-9}, optional

Specifies a compression level for data. A value of 0 disables compression.

complib{‘zlib’, ‘lzo’, ‘bzip2’, ‘blosc’}, default ‘zlib’

Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.

fletcher32bool, default False

If applying compression use the fletcher32 checksum.

dropnabool, default False

If true, ALL nan rows will not be written to store.

errorsstr, default ‘strict’

Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

See Also#

cudf.read_hdf : Read from HDF file. cudf.DataFrame.to_parquet : Write a DataFrame to the binary parquet format. cudf.DataFrame.to_feather : Write out feather-format for DataFrames.

to_json(path_or_buf=None, *args, **kwargs)#

Convert the cuDF object to a JSON string. Note nulls and NaNs will be converted to null and datetime objects will be converted to UNIX timestamps.

Parameters#

path_or_bufstring or file handle, optional

File path or object. If not specified, the result is returned as a string.

engine{{ ‘auto’, ‘cudf’, ‘pandas’ }}, default ‘auto’

Parser engine to use. If ‘auto’ is passed, the pandas engine will be selected.

orientstring

Indication of expected JSON string format.

  • Series
    • default is ‘index’

    • allowed values are: {‘split’,’records’,’index’,’table’}

  • DataFrame
    • default is ‘columns’

    • allowed values are: {‘split’,’records’,’index’,’columns’,’values’,’table’}

  • The format of the JSON string
    • ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ‘records’ : list like [{column -> value}, … , {column -> value}]

    • ‘index’ : dict like {index -> {column -> value}}

    • ‘columns’ : dict like {column -> {index -> value}}

    • ‘values’ : just the values array

    • ‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}} describing the data, and the data component is like orient='records'.

date_format{None, ‘epoch’, ‘iso’}

Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For orient='table', the default is ‘iso’. For all other orients, the default is ‘epoch’.

double_precisionint, default 10

The number of decimal places to use when encoding floating point values.

force_asciibool, default True

Force encoded string to be ASCII.

date_unitstring, default ‘ms’ (milliseconds)

The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.

default_handlercallable, default None

Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serializable object.

linesbool, default False

If ‘orient’ is ‘records’ write out line delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list like.

compression{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}

A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.

indexbool, default True

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

See Also#

cudf.read_json

to_numpy(dtype: Dtype | None = None, copy: bool = True, na_value=None) numpy.ndarray#

Convert the Frame to a NumPy array.

Parameters#

dtypestr or numpy.dtype, optional

The dtype to pass to numpy.asarray().

copybool, default True

Whether to ensure that the returned value is not a view on another array. This parameter must be True since cuDF must copy device memory to host to provide a numpy array.

na_valueAny, default None

The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.

Returns#

numpy.ndarray

to_string()#

Convert to string

cuDF uses Pandas internals for efficient string formatting. Set formatting options using pandas string formatting options and cuDF objects will print identically to Pandas objects.

cuDF supports null/None as a value in any column type, which is transparently supported during this output process.

Examples#

>>> import cudf
>>> df = cudf.DataFrame()
>>> df['key'] = [0, 1, 2]
>>> df['val'] = [float(i + 10) for i in range(3)]
>>> df.to_string()
'   key   val\n0    0  10.0\n1    1  11.0\n2    2  12.0'
truediv(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters#

otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns#

DataFrame or Series

Result of the arithmetic operation.

Examples#

DataFrame

>>> df = cudf.DataFrame(
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.truediv(1)
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])
>>> a.truediv(b)
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64
truncate(before=None, after=None, axis=0, copy=True)#

Truncate a Series or DataFrame before and after some index value.

This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.

Parameters#

beforedate, str, int

Truncate all rows before this index value.

afterdate, str, int

Truncate all rows after this index value.

axis{0 or ‘index’, 1 or ‘columns’}, optional

Axis to truncate. Truncates the index (rows) by default.

copybool, default is True,

Return a copy of the truncated section.

Returns#

The truncated Series or DataFrame.

Notes#

If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.

Examples#

Series

>>> import cudf
>>> cs1 = cudf.Series([1, 2, 3, 4])
>>> cs1
0    1
1    2
2    3
3    4
dtype: int64
>>> cs1.truncate(before=1, after=2)
1    2
2    3
dtype: int64
>>> import cudf
>>> dates = cudf.date_range(
...     '2021-01-01 23:45:00', '2021-01-01 23:46:00', freq='s'
... )
>>> cs2 = cudf.Series(range(len(dates)), index=dates)
>>> cs2
2021-01-01 23:45:00     0
2021-01-01 23:45:01     1
2021-01-01 23:45:02     2
2021-01-01 23:45:03     3
2021-01-01 23:45:04     4
2021-01-01 23:45:05     5
2021-01-01 23:45:06     6
2021-01-01 23:45:07     7
2021-01-01 23:45:08     8
2021-01-01 23:45:09     9
2021-01-01 23:45:10    10
2021-01-01 23:45:11    11
2021-01-01 23:45:12    12
2021-01-01 23:45:13    13
2021-01-01 23:45:14    14
2021-01-01 23:45:15    15
2021-01-01 23:45:16    16
2021-01-01 23:45:17    17
2021-01-01 23:45:18    18
2021-01-01 23:45:19    19
2021-01-01 23:45:20    20
2021-01-01 23:45:21    21
2021-01-01 23:45:22    22
2021-01-01 23:45:23    23
2021-01-01 23:45:24    24
...
2021-01-01 23:45:56    56
2021-01-01 23:45:57    57
2021-01-01 23:45:58    58
2021-01-01 23:45:59    59
dtype: int64
>>> cs2.truncate(
...     before="2021-01-01 23:45:18", after="2021-01-01 23:45:27"
... )
2021-01-01 23:45:18    18
2021-01-01 23:45:19    19
2021-01-01 23:45:20    20
2021-01-01 23:45:21    21
2021-01-01 23:45:22    22
2021-01-01 23:45:23    23
2021-01-01 23:45:24    24
2021-01-01 23:45:25    25
2021-01-01 23:45:26    26
2021-01-01 23:45:27    27
dtype: int64
>>> cs3 = cudf.Series({'A': 1, 'B': 2, 'C': 3, 'D': 4})
>>> cs3
A    1
B    2
C    3
D    4
dtype: int64
>>> cs3.truncate(before='B', after='C')
B    2
C    3
dtype: int64

DataFrame

>>> df = cudf.DataFrame({
...     'A': ['a', 'b', 'c', 'd', 'e'],
...     'B': ['f', 'g', 'h', 'i', 'j'],
...     'C': ['k', 'l', 'm', 'n', 'o']
... }, index=[1, 2, 3, 4, 5])
>>> df
   A  B  C
1  a  f  k
2  b  g  l
3  c  h  m
4  d  i  n
5  e  j  o
>>> df.truncate(before=2, after=4)
   A  B  C
2  b  g  l
3  c  h  m
4  d  i  n
>>> df.truncate(before="A", after="B", axis="columns")
   A  B
1  a  f
2  b  g
3  c  h
4  d  i
5  e  j
>>> import cudf
>>> dates = cudf.date_range(
...     '2021-01-01 23:45:00', '2021-01-01 23:46:00', freq='s'
... )
>>> df2 = cudf.DataFrame(data={'A': 1, 'B': 2}, index=dates)
>>> df2.head()
                     A  B
2021-01-01 23:45:00  1  2
2021-01-01 23:45:01  1  2
2021-01-01 23:45:02  1  2
2021-01-01 23:45:03  1  2
2021-01-01 23:45:04  1  2
>>> df2.truncate(
...     before="2021-01-01 23:45:18", after="2021-01-01 23:45:27"
... )
                     A  B
2021-01-01 23:45:18  1  2
2021-01-01 23:45:19  1  2
2021-01-01 23:45:20  1  2
2021-01-01 23:45:21  1  2
2021-01-01 23:45:22  1  2
2021-01-01 23:45:23  1  2
2021-01-01 23:45:24  1  2
2021-01-01 23:45:25  1  2
2021-01-01 23:45:26  1  2
2021-01-01 23:45:27  1  2
property values#

Return a CuPy representation of the DataFrame.

Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns#

cupy.ndarray

The values of the DataFrame.

property values_host#

Return a NumPy representation of the data.

Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns#

numpy.ndarray

A host representation of the underlying data.

var(axis=<no_default>, skipna=True, level=None, ddof=1, numeric_only=None, **kwargs)#

Return unbiased variance of the DataFrame.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters#

axis: {index (0), columns(1)}

Axis for the function to be applied on.

skipna: bool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

ddof: int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns#

scalar

Notes#

Parameters currently not supported are level and numeric_only

Examples#

>>> import cudf
>>> df = cudf.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 8, 9, 10]})
>>> df.var()
a    1.666667
b    1.666667
dtype: float64