Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spillable broadcast/shuffled hash join #721

Open
richox opened this issue Dec 26, 2024 · 0 comments · May be fixed by #753
Open

Spillable broadcast/shuffled hash join #721

richox opened this issue Dec 26, 2024 · 0 comments · May be fixed by #753
Labels
enhancement New feature or request

Comments

@richox
Copy link
Collaborator

richox commented Dec 26, 2024

Is your feature request related to a problem? Please describe.
currently hash joins use a monolithic in-memory hash table for joining, which may cause oom in the case where offheap memory is small.

Describe the solution you'd like
add a row/memory limit for building hash table. when exceeded, turn into a spill-merge method:

  1. build side data is shuffled into N buckets. (say N=1024)
  2. build buckets into separated hash tables, small buckets can be coalesced.
  3. shuffle probe side into the same N partitions.
  4. read each partition, join with the corresponding hash table.

Describe alternatives you've considered
this solves oom problem in most cases, however when there are data skewing, the shuffle does not work, we may fallback to sort-based joining in such situation.

Additional context
Add any other context or screenshots about the feature request here.

@richox richox added the enhancement New feature or request label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant