do you support write row group to existing dataset? #1763

braindevices · 2023-11-22T02:43:56Z

braindevices
Nov 22, 2023

Is your feature request related to a problem? Please describe.
this is a very important feature, when we have to write out big dataset by fragments.
I cannot find any info about this in your document.

a simple test show it has the potential:

import daft
df_test1 = daft.from_pydict({'a': [1., 2], 'b': [{'k1': 0, 'k2':1}, {'k3': 'ok'}]})
df_test2 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}]})
df_test3 = daft.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}], 'c': [20, 12]})
pq_dir = "/tmp/test2.pq"

for _i, _df in enumerate([df_test1, df_test2, df_test3]):
    _df.write_parquet(pq_dir)

df = daft.read_parquet(pq_dir)
df.to_pydict()

however, you seems do not support _common_metadata and _metadata
Thus there is not useful stats data.

Also when dataset is supper big, we usually would like to use some filter to limit the data reading to certain column/row based on the stats.

It seems like daft still lack this kind of ability.

jaychia · 2023-11-22T04:56:46Z

jaychia
Nov 22, 2023
Maintainer

Hi @braindevices!

Skipping of columns durings reads is already implemented with column pruning pushdowns:

df = daft.read_parquet("my-file.parquet")
df = df.select("col1")
df.collect()  # Will only read "col1" from the Parquet file

We are actually in the process of merging lots of functionality around skipping of rows. You can try it out by setting the environment variable DAFT_MICROPARTITIONS=1.

First run: export DAFT_MICROPARTITIONS=1

df = daft.read_parquet("my-file.parquet")
df = df.where(df["col1"] < 10)

# Will only read rowgroups where min/max statistics indicate that values
# of "col1" fall within the `df["col1"] < 10` filter
df.collect()

Note that we do not rely on a _metadata or _common_metadata file to perform these pruning operations, and only make use of Parquet metadata. In the future we are also adding integrations with other metadata sources such as Apache Iceberg.

0 replies

jaychia · 2023-11-22T04:59:06Z

jaychia
Nov 22, 2023
Maintainer

With regard to your comments around "write out big dataset by fragments":

Is your use-case here more of an "append" to an existing Parquet dataset?

I believe we currently just naively write Parquet files to the specified location, which should be the intended "appending" behavior already.

We are also adding support soon for Apache Iceberg and Hive table appends and overwrites.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do you support write row group to existing dataset? #1763

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

do you support write row group to existing dataset? #1763

braindevices Nov 22, 2023

Replies: 2 comments

jaychia Nov 22, 2023 Maintainer

jaychia Nov 22, 2023 Maintainer

braindevices
Nov 22, 2023

jaychia
Nov 22, 2023
Maintainer

jaychia
Nov 22, 2023
Maintainer