hipdf.io.parquet.ParquetDatasetWriter

hipdf.io.parquet.ParquetDatasetWriter#

23 min read time

Applies to Linux

class hipdf.io.parquet.ParquetDatasetWriter(path, partition_cols, index=None, compression: Literal['snappy', 'ZSTD', 'ZLIB', 'LZ4', None] = 'snappy', statistics: Literal['ROWGROUP', 'PAGE', 'COLUMN', 'NONE'] = 'ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None)#

Bases: object

Write a parquet file or dataset incrementally

Parameters#

pathstr: A local directory path or S3 URL. Will be used as root directory path while writing a partitioned dataset.
partition_colslist: Column names by which to partition the dataset Columns are partitioned in the order they are given
indexbool, default None: If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, index(es) other than RangeIndex will be saved as columns.
compression{‘snappy’, None}, default ‘snappy’: Name of the compression to use. Use None for no compression.
statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’: Level at which column statistics should be included in file.
max_file_sizeint or str, default None: A file size that cannot be exceeded by the writer. It is in bytes, if the input is int. Size can also be a str in form or “10 MB”, “1 GB”, etc. If this parameter is used, it is mandatory to pass file_name_prefix.
file_name_prefixstr: This is a prefix to file names generated only when max_file_size is specified.
storage_optionsdict, optional, default None: Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.

Examples#

Using a context

>>> df1 = cudf.DataFrame({"a": [1, 1, 2, 2, 1], "b": [9, 8, 7, 6, 5]})
>>> df2 = cudf.DataFrame({"a": [1, 3, 3, 1, 3], "b": [4, 3, 2, 1, 0]})
>>> with ParquetDatasetWriter("./dataset", partition_cols=["a"]) as cw:
...     cw.write_table(df1)
...     cw.write_table(df2)

By manually calling close()

>>> cw = ParquetDatasetWriter("./dataset", partition_cols=["a"])
>>> cw.write_table(df1)
>>> cw.write_table(df2)
>>> cw.close()

Both the methods will generate the same directory structure

dataset/
    a=1
        <filename>.parquet
    a=2
        <filename>.parquet
    a=3
        <filename>.parquet

__init__(path, partition_cols, index=None, compression: Literal['snappy', 'ZSTD', 'ZLIB', 'LZ4', None] = 'snappy', statistics: Literal['ROWGROUP', 'PAGE', 'COLUMN', 'NONE'] = 'ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None) → None#

Methods

`__init__`(path, partition_cols[, index, ...])
`close`([return_metadata])	Close all open files and optionally return footer metadata as a binary blob
`write_table`(df)	Write a dataframe to the file/dataset

__init__(path, partition_cols, index=None, compression: Literal['snappy', 'ZSTD', 'ZLIB', 'LZ4', None] = 'snappy', statistics: Literal['ROWGROUP', 'PAGE', 'COLUMN', 'NONE'] = 'ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None) → None#

write_table(df)#: Write a dataframe to the file/dataset

close(return_metadata=False)#: Close all open files and optionally return footer metadata as a binary blob

hipdf.io.parquet.ParquetDatasetWriter

Contents

hipdf.io.parquet.ParquetDatasetWriter#

Parameters#

Examples#