Implement column matching into dirty_cat #332
Replies: 2 comments 1 reply
-
To resume what was mentioned:
For instance, if we have a column "roads_in_Biarritz" and a column "roads_in_France", are they considered matched (one set is much much larger than the other) ?
This blog post by @GaelVaroquaux is useful, as it shows that it is possible to use distributional distance to compute distances between columns. This code may be used to implement this metric. According to @du-phan and @dsleo, depending on if the set size matters (see first question below), we may use jaccard containment or jaccard similarity with the datascketch package. |
Beta Was this translation helpful? Give feedback.
-
Hi, To complete Jovan's message above: when doing set matching, depending on the use case at hand we might or might not be interested in the For example, when looking for pair of column to be joined, we want to have the biggest set intersection possible, and thus Otherwise, if I want to know if two columns are "similar", what is important for me is the ratio between the size of the intersection and the size of the set union: I want this ratio to be as close to 1 as possible. In that case, From an algorithmic point of view, if we opt for an approximative method (hashing), this will entail the use of quite different algorithm. The classical MinHash LSH only guarantees for |
Beta Was this translation helpful? Give feedback.
-
The discussion here will be useful to come up with ideas on a first implementation of column matching.
Column matching is a very common problem when assembling tables for machine learning.
For instance, given two tables, with corresponding columns but not matched (differing or missing column names), how can we match them?
Beta Was this translation helpful? Give feedback.
All reactions