-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add arbitrary matching function for joins #3445
Comments
See also #2738 for a related discussion. Two key things are:
|
I don't understand this part. You need to do equality check to match the rows in two dataframes or to do validation. Equality check is a binary operation. What difference would it make to swap one binary operation with another one? Could you may be provide the examples of slow and fast joins in the existing implementation?
Why not simply pass |
Existing implementation is always fast because it uses equality that is hashable (in short, as we do not always actually hash things). However, for a general equality operator it does not have to be hashable, as e.g. it might not be transitive. In that case you need O(n^2) comparisons. An example of such equality operator is
Because, in general, it does not allow us to check if |
I have seen the discussion in #3363, and it was mentioned that adding arbitrary matching functions is part of 1.7 milestone. I would like to ask, what is the status of this feature request, are there any plans to implement it soon?
In principle, I could try to cook up a PR on my own, but I am not that familiar with DataFrames internals. Also the details of the syntax should be discussed in advance.
So, what is the principial difficulty with adding this feature? The way I see it, we just need to overload the equality comparison, which is used to relate the rows from the two DataFrames. Everything else, including the validation can stay without changes.
Now, there is a separate matter of the transformation of the columns, that was requested in #3363. I think, it makes sense to distinguish transformation and matching.
We should really only match the columns of the same type between two dataframes, because we also need to match different rows inside the same dataframe to do the validation.
And transformation should only provide the means to bring the columns to the same type: we first transform columns in one or both dataframes to the same type, and then run matching on the results of the transforms.
Transformaton can be already done by the users themselves simply by adding extra column. On the other hand, it is the arbitrary matching that requires tinkering with internals, but treating it separately from transformation simplifies the realization.
What do you think about it?
The text was updated successfully, but these errors were encountered: