Adding the StringEncoder transformer #1159

rcap107 · 2024-11-26T16:17:05Z

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

rcap107 · 2024-12-05T15:43:10Z

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

GaelVaroquaux · 2024-12-09T14:25:47Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

rcap107 · 2024-12-09T14:27:13Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

GaelVaroquaux · 2024-12-09T15:07:03Z

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

Vincent-Maladiere · 2024-12-09T15:30:42Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

rcap107 · 2024-12-09T15:32:48Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

Vincent-Maladiere · 2024-12-09T15:42:11Z

That's very interesting!

jeromedockes

nice! 🎉 looking really good @rcap107 . it will be awesome to have this as the cost of the GapEncoder is the main pain point of the tablevectorizer ATM 🚀

I left a couple of minor comments, the main one is just a possible performance improvement if we use the vectorizer's fit_transform to avoid tokenizing the input twice during fit

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

CHANGES.rst

example_string_encoder.py

examples/02_text_with_string_encoders.py

jeromedockes · 2024-12-11T11:11:44Z

examples/02_text_with_string_encoders.py

+# %%
+# |TextEncoder| embeddings are very strong, but they are also quite expensive to
+# train. A simpler, faster alternative for encoding strings is the |StringEncoder|,
+# which works by first performing a tf-idf vectorization of the text, and then


maybe tf-idf (computing vectors of rescaled word counts + wikipedia link)

jeromedockes · 2024-12-11T11:14:43Z

skrub/_string_encoder.py

+        del y
+        self.pipe = Pipeline(
+            [
+                ("tfidf", TfidfVectorizer()),


I think @GaelVaroquaux suggested using a HashingVectorizer instead of tfidfvectorizer (i don't think this would require changes elswhere in your code)

skrub/_string_encoder.py

rcap107 · 2024-12-11T13:14:57Z

nice! 🎉 looking really good @rcap107 . it will be awesome to have this as the cost of the GapEncoder is the main pain point of the tablevectorizer ATM 🚀

I left a couple of minor comments, the main one is just a possible performance improvement if we use the vectorizer's fit_transform to avoid tokenizing the input twice during fit

I implemented a change, I'm not sure if it's what you meant..

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

Sounds good

Vincent-Maladiere · 2024-12-11T15:12:06Z

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

I like this plan as well!

GaelVaroquaux · 2024-12-11T18:07:21Z

Before we merge, I would love a bit of experimenting, with a fairly systematic eye on the choice of parameters.

For instance: HashingVectorizer vs TfIdfVectorizer? If we go for hashingVectorizer, should it be followed by a TfIdfTransformer or not? For the vectorizer, what should be the default values for the analyzer, the n_features and the n_gram range. I suspect that we want 'analyzer="char_wb"' and ngram_range=(3, 4) based on my experience.

These choices are actually very important, and we should drive them using systematic experimentation. I know that it is a lot of work, but it makes a huge difference.

rcap107 · 2024-12-12T15:08:06Z

So I went and ran a few experiments and I have some preliminary results. I reused some of the code from the main example, so the plots are not very good. I am working on a separate branch for the moment.

This was done on the toxicity dataset.

I modified the StringEncoder to have

vectorizer to choose between tfidf and hashing (mapped to TfidfVectorizer and HashingVectorizer). Default is tfidf
ngram_range, which can be any pair of integers >= (1,1). Default is (1,1).
tf_idf_followup, which is a boolean flag. If active, the HashingVectorizer is followed by a TfidfTransformer.
analyzer, which can be word, char, or char_wb. Default is word.
I also added:
n_features, which is a parameter for tfidf.
max_features, which is a parameter for hashing.

Then I tested these parameters:

configurations = {
    "ngram_range": [(1, 1), (3, 4)],
    "analyzer": ["word", "char", "char_wb"],
    "vectorizer": ["hashing", "tfidf"],
    "n_components": [30],
    "tf_idf_followup": [True, False],
}

Only tfidf

Only hashingvectorizer, no tfidf followup

Hashingvectorizer with tfidf followup

Summary is:

tfidf is faster
hashing + tfidf has the same prediction performance as tfidf, but it's much slower
hashing alone is worse than tfidf and is much slower
(1,1) is always much better than (3,4) for ngram_range

…string-encoder-bench

jeromedockes · 2024-12-12T18:35:27Z

thanks for those experiments! I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters?

GaelVaroquaux · 2024-12-12T18:42:07Z

I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters?

char_wb makes sense: it's counting words and characters

rcap107 · 2024-12-12T20:38:18Z

I ran a few more experiments, because the results looked suspicious, and I think what I have now makes much more sense.

Pareto plots (time is on a log scale):

Boxplots, test score on y-axis

Boxplots, time on y-axis

I don't know why the previous results were so off, I might have been reusing results across runs 🤔

What is consistent is that tf-idf is way faster and has similar performance if not better than hashing vectorizer.

GaelVaroquaux · 2024-12-12T23:57:06Z

OK, those results look great. Please do keep the associated data, as one day we may publish a paper on skrub 😁

If I can make an editorial decision: let's use the char_wb, ngram_range=(3, 4). It matches what we do in the GapEncoder.

Vincent-Maladiere · 2024-12-13T09:16:28Z

Thanks for running these bench @rcap107 !! Glad to see we have a clear winner for the TableVectorizer as @GaelVaroquaux said.

rcap107 · 2024-12-13T09:35:54Z

Great, then I'll clean up the script I am using and I'll put it somewhere, so I can find it if we decide to run more in-depth experiments (like by testing more tables, or changing the number of components or idk).

As for the StringEncoder, do I keep the HashingVectorizer at all? Or do I keep only the arguments relative to tf-idf?

jeromedockes · 2024-12-13T09:41:43Z

> I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters? char_wb makes sense: it's counting words and characters

IIUC correctly char_wb prevents char ngrams from crossing word boundaries but they're still only character ngrams no?

rcap107 · 2024-12-13T10:47:04Z

skrub/tests/test_string_encoder.py

+    df_module.assert_frame_equal(check_df, result)
+
+
+def test_hashing(encode_column, df_module):


I think this test is failing because the hashing vectorizer is not deterministic, if that's the case, I'm not sure how to handle it in the test other than exposing RNG everywhere.

I didn't bother because we haven't decided whether to keep the hashing vectorizer in the first place.

maybe in a first version we can keep just the tfidf

otherwise in the test you can just check the type and shape of the output and the content of self.pipe

rcap107 · 2024-12-13T10:57:53Z

For the record, this is the same example with the new defaults

Pretty good result I think

jeromedockes · 2024-12-13T11:08:52Z

nice!! do you mind trying the employee_salaries dataset too, as it has columns that are more like categories rather than text?

examples/02_text_with_string_encoders.py

jeromedockes · 2024-12-13T12:54:22Z

skrub/_string_encoder.py

+        # ERROR CHECKING
+        if self.analyzer not in ["char_wb", "char", "word"]:
+            raise ValueError(f"Unknown analyzer {self.analyzer}")
+
+        if not all(isinstance(x, int) and x > 0 for x in self.ngram_range):
+            raise ValueError(
+                "Values in `ngram_range` must be positive integers, "
+                f"found {self.ngram_range} instead."
+            )
+        if not len(self.ngram_range) == 2:
+            raise ValueError(
+                f"`ngram_range` must have length 2, found {len(self.ngram_range)}."
+            )
+
+        if not isinstance(self.n_components, int) and self.n_components > 0:
+            raise ValueError(
+                f"`n_components` must be a positive integer, found {self.n_components}"
+            )


don't the tfidf vectorizer and truncated svd do a similar validation already?

I left in only the first check, the other constraints are in there, but I think they're less strict

jeromedockes · 2024-12-13T12:59:17Z

skrub/tests/test_string_encoder.py

+    df_module.assert_frame_equal(check_df, result)
+
+
+def test_hashing(encode_column, df_module):


maybe in a first version we can keep just the tfidf

jeromedockes · 2024-12-13T13:00:25Z

skrub/tests/test_string_encoder.py

+    df_module.assert_frame_equal(check_df, result)
+
+
+def test_hashing(encode_column, df_module):


otherwise in the test you can just check the type and shape of the output and the content of self.pipe

rcap107 · 2024-12-13T13:32:43Z

nice!! do you mind trying the employee_salaries dataset too, as it has columns that are more like categories rather than text?

I used R2 as metric, and did not test HashingVectorizer. It seems like when columns are more categories than text, then the GapEncoder is doing a better job than all the alternatives.

Still, StringEncoder is much faster and the performance is pretty good. TextEncoder is not doing quite as well on the other hand.

Bottom line is, for more "categorical-looking" attributes, use the GapEncoder for better performance, or StringEncoder for decent performance and faster fit time, for text use TextEncoder for better performance or StringEncoder for decent performance and much faster fit time.

This is from a pretty quick and dirty set of experiments, we could also try something like the heuristics from Leo's paper to choose, but that's going to get way more complicated.

By the way, where should I put all the code I used for these results? So far I have it in my scratch folder.

Co-authored-by: Jérôme Dockès <[email protected]>

…df-pca

rcap107 added 12 commits November 21, 2024 10:56

Fixing changelog with correct account

ec37e13

Merge remote-tracking branch 'upstream/main'

b3dae47

Merge branch 'main' of github.com:skrub-data/skrub

99e5450

Initial commit

4f7e46e

Update

583250b

Merge branch 'main' of github.com:skrub-data/skrub

4a39f36

Merge branch 'main' of github.com:skrub-data/skrub

ee2f739

Merge branch 'main' into tfidf-pca

30ad689

Merge remote-tracking branch 'upstream/main' into tfidf-pca

d7f1cd7

Updated object and added test

8686d7f

quick update to changelog

eb4de97

Fixed test

96423ba

rcap107 added 7 commits December 7, 2024 12:20

Merge branch 'main' of github.com:skrub-data/skrub

e01637c

Replacing PCA with TruncatedSVD

3a1f6eb

Updated init

398f9db

Updated example to add StringEncoder

3a45f19

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

38a9f2d

Updating changelog.

51856b3

📝 Updating docstrings

58a3559

📝 Fixing example

8e4fce2

rcap107 added 2 commits December 9, 2024 16:20

✅ Fixing tests and renaming test file

afdb361

✅ Fixing coverage

6c6d884

🐛 Fixing the name of a variable

9366d90

jeromedockes reviewed Dec 11, 2024

View reviewed changes

rcap107 added 2 commits December 11, 2024 13:58

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

6b474c6

Addressing comments in review

e8f308e

rcap107 added 3 commits December 12, 2024 17:21

Updating code to benchmark

8ea92d8

Merge branch 'string-encoder-bench' of github.com:rcap107/skrub into …

c999abf

…string-encoder-bench

updating code

8411a83

rcap107 added 4 commits December 13, 2024 11:05

Updating script

190ce2a

a

a43488e

Removing some files used for prototyping

cdfaf1a

Added new parameters, fixed docstring, added error checking

c0c066f

rcap107 commented Dec 13, 2024

View reviewed changes

Removing an unnecessary file

887e047

jeromedockes reviewed Dec 13, 2024

View reviewed changes

rcap107 and others added 3 commits December 13, 2024 14:33

Update examples/02_text_with_string_encoders.py

af3b087

Co-authored-by: Jérôme Dockès <[email protected]>

Simplified error checking

2bb353d

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

bfb8c55

…df-pca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the StringEncoder transformer #1159

Adding the StringEncoder transformer #1159

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 •

edited

Loading

Vincent-Maladiere commented Dec 9, 2024

jeromedockes left a comment

jeromedockes Dec 11, 2024

jeromedockes Dec 11, 2024

rcap107 commented Dec 11, 2024

Vincent-Maladiere commented Dec 11, 2024

GaelVaroquaux commented Dec 11, 2024

rcap107 commented Dec 12, 2024 •

edited

Loading

jeromedockes commented Dec 12, 2024

GaelVaroquaux commented Dec 12, 2024 via email

rcap107 commented Dec 12, 2024

GaelVaroquaux commented Dec 12, 2024

Vincent-Maladiere commented Dec 13, 2024

rcap107 commented Dec 13, 2024

jeromedockes commented Dec 13, 2024 via email

rcap107 Dec 13, 2024

jeromedockes Dec 13, 2024

jeromedockes Dec 13, 2024

rcap107 commented Dec 13, 2024

jeromedockes commented Dec 13, 2024

jeromedockes Dec 13, 2024

rcap107 Dec 13, 2024

jeromedockes Dec 13, 2024

jeromedockes Dec 13, 2024

rcap107 commented Dec 13, 2024

		df_module.assert_frame_equal(check_df, result)


		def test_hashing(encode_column, df_module):

Adding the StringEncoder transformer #1159

Are you sure you want to change the base?

Adding the StringEncoder transformer #1159

Conversation

rcap107 commented Nov 26, 2024 • edited Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 • edited Loading

Vincent-Maladiere commented Dec 9, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Dec 11, 2024

Vincent-Maladiere commented Dec 11, 2024

GaelVaroquaux commented Dec 11, 2024

rcap107 commented Dec 12, 2024 • edited Loading

jeromedockes commented Dec 12, 2024

GaelVaroquaux commented Dec 12, 2024 via email

rcap107 commented Dec 12, 2024

GaelVaroquaux commented Dec 12, 2024

Vincent-Maladiere commented Dec 13, 2024

rcap107 commented Dec 13, 2024

jeromedockes commented Dec 13, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Dec 13, 2024

jeromedockes commented Dec 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Dec 13, 2024

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 9, 2024 •

edited

Loading

rcap107 commented Dec 12, 2024 •

edited

Loading