More advanced techniques to read parquet files efficiently #589

jiacai2050 · 2023-01-28T07:17:06Z

Describe This Problem

Usually the IO part of a query is the most time consuming, so reducing time spent on this would improve query latency quietly a lot.

In current implementation, we have already applies some tricks to optimize this, to name a few:

concurrent reads even for one file
min/max prune
custom bloom filter prune

There is an awesome blog written by @tustvold and @alamb introducing some more advanced techniques to further improve read speed, which is definitely a must-read for developer in Arrow ecosystem.

Proposal

Explore ideas introduced in Querying Parquet with Millisecond Latency, Some notable ideas are:

Page prune
Late materialization
Decode optimization, especially dictionary encoding

Additional Context

No response

## Rationale Part of #589 ## Detailed Changes - Introduce `PagePruningPredicate` when build `ParquetRecordBatchStream` ## Test Plan

jiacai2050 added the feature New feature or request label Jan 28, 2023

jiacai2050 mentioned this issue Feb 24, 2023

feat: add parquet page filter #664

Merged

jiacai2050 self-assigned this May 18, 2023

jiacai2050 mentioned this issue Jun 5, 2023

Use dictionary type to store tags column #965

Closed

2 tasks

jiacai2050 added a commit that referenced this issue Jun 5, 2023

feat: add parquet page filter (#664)

d5593b7

## Rationale Part of #589 ## Detailed Changes - Introduce `PagePruningPredicate` when build `ParquetRecordBatchStream` ## Test Plan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More advanced techniques to read parquet files efficiently #589

More advanced techniques to read parquet files efficiently #589

jiacai2050 commented Jan 28, 2023 •

edited

Loading

More advanced techniques to read parquet files efficiently #589

More advanced techniques to read parquet files efficiently #589

Comments

jiacai2050 commented Jan 28, 2023 • edited Loading

Describe This Problem

Proposal

Additional Context

jiacai2050 commented Jan 28, 2023 •

edited

Loading