Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert to using arrow tables for full valid values grid in check_tbl_values_required() #37

Open
annakrystalli opened this issue Sep 27, 2023 · 1 comment

Comments

@annakrystalli
Copy link
Member

Previously I have been using arror tables which seem more memory efficient and generally more performant to optimise check_tbl_values_required() which can be slow with larger files.

In d1e2861 I reverted this because I discovered joins using arrow did not consider NA values as matches (as dplyr does by default), resulting in data being lost during inner joins that included NA values. (see issue reported here: apache/arrow#14907)

Hopefully, this will at some point be resolved. Once it is, changes in d1e2861 will need reverting to make the function more performant again.

@annakrystalli
Copy link
Member Author

annakrystalli commented Sep 27, 2023

Discussion in arrow moved to separate issue apache/arrow#37902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant