hipdf.io.parquet.ParquetDatasetWriter#
22 min read time
- class hipdf.io.parquet.ParquetDatasetWriter(path, partition_cols, index=None, compression='snappy', statistics='ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None)#
Bases:
objectWrite a parquet file or dataset incrementally
Parameters#
- pathstr
A local directory path or S3 URL. Will be used as root directory path while writing a partitioned dataset.
- partition_colslist
Column names by which to partition the dataset Columns are partitioned in the order they are given
- indexbool, default None
If
True, include the dataframe’s index(es) in the file output. IfFalse, they will not be written to the file. IfNone, index(es) other than RangeIndex will be saved as columns.- compression{‘snappy’, None}, default ‘snappy’
Name of the compression to use. Use
Nonefor no compression.- statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’
Level at which column statistics should be included in file.
- max_file_sizeint or str, default None
A file size that cannot be exceeded by the writer. It is in bytes, if the input is int. Size can also be a str in form or “10 MB”, “1 GB”, etc. If this parameter is used, it is mandatory to pass file_name_prefix.
- file_name_prefixstr
This is a prefix to file names generated only when max_file_size is specified.
- storage_optionsdict, optional, default None
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details.
Examples#
Using a context
>>> df1 = cudf.DataFrame({"a": [1, 1, 2, 2, 1], "b": [9, 8, 7, 6, 5]}) >>> df2 = cudf.DataFrame({"a": [1, 3, 3, 1, 3], "b": [4, 3, 2, 1, 0]}) >>> with ParquetDatasetWriter("./dataset", partition_cols=["a"]) as cw: ... cw.write_table(df1) ... cw.write_table(df2)
By manually calling
close()>>> cw = ParquetDatasetWriter("./dataset", partition_cols=["a"]) >>> cw.write_table(df1) >>> cw.write_table(df2) >>> cw.close()
Both the methods will generate the same directory structure
dataset/ a=1 <filename>.parquet a=2 <filename>.parquet a=3 <filename>.parquet- __init__(path, partition_cols, index=None, compression='snappy', statistics='ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None) None#
Methods
__init__(path, partition_cols[, index, ...])close([return_metadata])Close all open files and optionally return footer metadata as a binary blob
write_table(df)Write a dataframe to the file/dataset
- __init__(path, partition_cols, index=None, compression='snappy', statistics='ROWGROUP', max_file_size=None, file_name_prefix=None, storage_options=None) None#
- write_table(df)#
Write a dataframe to the file/dataset
- close(return_metadata=False)#
Close all open files and optionally return footer metadata as a binary blob