Implement column matching into dirty_cat #332

jovan-stojanovic · 2022-09-08T10:10:18Z

jovan-stojanovic
Sep 8, 2022
Maintainer

The discussion here will be useful to come up with ideas on a first implementation of column matching.

Column matching is a very common problem when assembling tables for machine learning.

For instance, given two tables, with corresponding columns but not matched (differing or missing column names), how can we match them?

jovan-stojanovic · 2022-09-08T10:29:56Z

jovan-stojanovic
Sep 8, 2022
Maintainer Author

To resume what was mentioned:

how do we define the correspoding columns? (raised by @du-phan)

For instance, if we have a column "roads_in_Biarritz" and a column "roads_in_France", are they considered matched (one set is much much larger than the other) ?

what metric to use?

This blog post by @GaelVaroquaux is useful, as it shows that it is possible to use distributional distance to compute distances between columns. This code may be used to implement this metric.

According to @du-phan and @dsleo, depending on if the set size matters (see first question below), we may use jaccard containment or jaccard similarity with the datascketch package.

0 replies

du-phan · 2022-09-14T14:01:09Z

du-phan
Sep 14, 2022

Hi,

To complete Jovan's message above: when doing set matching, depending on the use case at hand we might or might not be interested in the set intersection size, and the chosen metric (either Jaccard similarity or Jaccard containment) should reflect that.

For example, when looking for pair of column to be joined, we want to have the biggest set intersection possible, and thus Jaccard containment should be preferred, as it's not biased by the size of the union of the two sets. In my example above, roads_in_france is thus a good join candidate for road_in_biarritz as an left join will cover most (all) of the values in the latter column.

Otherwise, if I want to know if two columns are "similar", what is important for me is the ratio between the size of the intersection and the size of the set union: I want this ratio to be as close to 1 as possible. In that case, Jaccard similarity is the choice, and roads_in_france is not a good match for road_in_biarritz.

From an algorithmic point of view, if we opt for an approximative method (hashing), this will entail the use of quite different algorithm. The classical MinHash LSH only guarantees for Jaccard similarity, for Jaccard containment we need to look at for example LSH Ensemble.

1 reply

GaelVaroquaux Sep 14, 2022
Maintainer

I totally agree that we will need to expose union vs intersection at least.

We will worry about scalability later, and focus on implementing a version that doesn't scale well but works on small datasets. However, I agree that in the long term we should probably look at probabilistic algorithms such as LSH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement column matching into dirty_cat #332

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Implement column matching into dirty_cat #332

jovan-stojanovic Sep 8, 2022 Maintainer

Replies: 2 comments · 1 reply

jovan-stojanovic Sep 8, 2022 Maintainer Author

du-phan Sep 14, 2022

GaelVaroquaux Sep 14, 2022 Maintainer

jovan-stojanovic
Sep 8, 2022
Maintainer

Replies: 2 comments 1 reply

jovan-stojanovic
Sep 8, 2022
Maintainer Author

du-phan
Sep 14, 2022

GaelVaroquaux Sep 14, 2022
Maintainer