do you support write row group to existing dataset? #1763
Replies: 2 comments
-
Hi @braindevices! Skipping of columns durings reads is already implemented with column pruning pushdowns: df = daft.read_parquet("my-file.parquet")
df = df.select("col1")
df.collect() # Will only read "col1" from the Parquet file We are actually in the process of merging lots of functionality around skipping of rows. You can try it out by setting the environment variable First run: df = daft.read_parquet("my-file.parquet")
df = df.where(df["col1"] < 10)
# Will only read rowgroups where min/max statistics indicate that values
# of "col1" fall within the `df["col1"] < 10` filter
df.collect() Note that we do not rely on a |
Beta Was this translation helpful? Give feedback.
-
With regard to your comments around "write out big dataset by fragments": Is your use-case here more of an "append" to an existing Parquet dataset? I believe we currently just naively write Parquet files to the specified location, which should be the intended "appending" behavior already. We are also adding support soon for Apache Iceberg and Hive table appends and overwrites. |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
this is a very important feature, when we have to write out big dataset by fragments.
I cannot find any info about this in your document.
a simple test show it has the potential:
however, you seems do not support _common_metadata and _metadata
Thus there is not useful stats data.
Also when dataset is supper big, we usually would like to use some filter to limit the data reading to certain column/row based on the stats.
It seems like daft still lack this kind of ability.
Beta Was this translation helpful? Give feedback.
All reactions