Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't stop just because a step returned no files #27

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

nicklan
Copy link
Contributor

@nicklan nicklan commented May 31, 2024

This was fun :). This fixes delta-io/delta-kernel-rs#233

Basically, if you push down a predicate, you can have a situation where a batch of files that does include an Add file, doesn't actually return any files to scan, because they are all filtered out. The kernel can't know this for sure because we don't introspect the data until the engine asks us to extract it for them. So in the case of running:

SELECT letter, number
FROM delta_scan('${DAT_PATH}/out/reader_tests/generated/basic_append/delta')
WHERE number < 2

the first batch included one file, but it's filtered out by the predicate, so nothing actually came out and resolved_files.size() == size_before would be true, so duckdb would just stop looking for more files. But there is one more file to scan, the one with the data we want! :)

The simple fix is to keep iterating until the kernel tells you you can be sure there's no more data.

There's a chance the kernel could optimize more and not have returned the first batch, but in general I think engines should assume they should keep iterating until scan_data_next returns false

@samansmink samansmink merged commit 23c7f56 into duckdb:main Jun 3, 2024
13 checks passed
@samansmink
Copy link
Collaborator

thanks, @nicklan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter pushdown issue
2 participants