Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Nov 26, 2024

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

  • Testing
  • Parameter checking?
  • Default value for the PCA?
  • Docstrings
  • Deciding name of the features

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 5, 2024

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

@GaelVaroquaux
Copy link
Member

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

@GaelVaroquaux
Copy link
Member

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

@Vincent-Maladiere
Copy link
Member

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

image

@Vincent-Maladiere
Copy link
Member

That's very interesting!

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! 🎉 looking really good @rcap107 . it will be awesome to have this as the cost of the GapEncoder is the main pain point of the tablevectorizer ATM 🚀

I left a couple of minor comments, the main one is just a possible performance improvement if we use the vectorizer's fit_transform to avoid tokenizing the input twice during fit

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

CHANGES.rst Outdated Show resolved Hide resolved
example_string_encoder.py Outdated Show resolved Hide resolved
examples/02_text_with_string_encoders.py Outdated Show resolved Hide resolved
# %%
# |TextEncoder| embeddings are very strong, but they are also quite expensive to
# train. A simpler, faster alternative for encoding strings is the |StringEncoder|,
# which works by first performing a tf-idf vectorization of the text, and then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe tf-idf (computing vectors of rescaled word counts + wikipedia link)

del y
self.pipe = Pipeline(
[
("tfidf", TfidfVectorizer()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @GaelVaroquaux suggested using a HashingVectorizer instead of tfidfvectorizer (i don't think this would require changes elswhere in your code)

skrub/_string_encoder.py Outdated Show resolved Hide resolved
skrub/_string_encoder.py Outdated Show resolved Hide resolved
@rcap107
Copy link
Contributor Author

rcap107 commented Dec 11, 2024

nice! 🎉 looking really good @rcap107 . it will be awesome to have this as the cost of the GapEncoder is the main pain point of the tablevectorizer ATM 🚀

I left a couple of minor comments, the main one is just a possible performance improvement if we use the vectorizer's fit_transform to avoid tokenizing the input twice during fit

I implemented a change, I'm not sure if it's what you meant..

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

Sounds good

@Vincent-Maladiere
Copy link
Member

the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?

I like this plan as well!

@GaelVaroquaux
Copy link
Member

Before we merge, I would love a bit of experimenting, with a fairly systematic eye on the choice of parameters.

For instance: HashingVectorizer vs TfIdfVectorizer? If we go for hashingVectorizer, should it be followed by a TfIdfTransformer or not? For the vectorizer, what should be the default values for the analyzer, the n_features and the n_gram range. I suspect that we want 'analyzer="char_wb"' and ngram_range=(3, 4) based on my experience.

These choices are actually very important, and we should drive them using systematic experimentation. I know that it is a lot of work, but it makes a huge difference.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 12, 2024

So I went and ran a few experiments and I have some preliminary results. I reused some of the code from the main example, so the plots are not very good. I am working on a separate branch for the moment.

This was done on the toxicity dataset.

I modified the StringEncoder to have

  • vectorizer to choose between tfidf and hashing (mapped to TfidfVectorizer and HashingVectorizer). Default is tfidf
  • ngram_range, which can be any pair of integers >= (1,1). Default is (1,1).
  • tf_idf_followup, which is a boolean flag. If active, the HashingVectorizer is followed by a TfidfTransformer.
  • analyzer, which can be word, char, or char_wb. Default is word.
    I also added:
  • n_features, which is a parameter for tfidf.
  • max_features, which is a parameter for hashing.

Then I tested these parameters:

configurations = {
    "ngram_range": [(1, 1), (3, 4)],
    "analyzer": ["word", "char", "char_wb"],
    "vectorizer": ["hashing", "tfidf"],
    "n_components": [30],
    "tf_idf_followup": [True, False],
}

Only tfidf

Only hashingvectorizer, no tfidf followup
image

Hashingvectorizer with tfidf followup
image

Summary is:

  • tfidf is faster
  • hashing + tfidf has the same prediction performance as tfidf, but it's much slower
  • hashing alone is worse than tfidf and is much slower
  • (1,1) is always much better than (3,4) for ngram_range

@jeromedockes
Copy link
Member

thanks for those experiments! I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 12, 2024 via email

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 12, 2024

I ran a few more experiments, because the results looked suspicious, and I think what I have now makes much more sense.

Pareto plots (time is on a log scale):
image

Boxplots, test score on y-axis
image

Boxplots, time on y-axis
image

I don't know why the previous results were so off, I might have been reusing results across runs 🤔

What is consistent is that tf-idf is way faster and has similar performance if not better than hashing vectorizer.

@GaelVaroquaux
Copy link
Member

OK, those results look great. Please do keep the associated data, as one day we may publish a paper on skrub 😁

If I can make an editorial decision: let's use the char_wb, ngram_range=(3, 4). It matches what we do in the GapEncoder.

@Vincent-Maladiere
Copy link
Member

Thanks for running these bench @rcap107 !! Glad to see we have a clear winner for the TableVectorizer as @GaelVaroquaux said.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 13, 2024

Great, then I'll clean up the script I am using and I'll put it somewhere, so I can find it if we decide to run more in-depth experiments (like by testing more tables, or changing the number of components or idk).

As for the StringEncoder, do I keep the HashingVectorizer at all? Or do I keep only the arguments relative to tf-idf?

@jeromedockes
Copy link
Member

jeromedockes commented Dec 13, 2024 via email

df_module.assert_frame_equal(check_df, result)


def test_hashing(encode_column, df_module):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is failing because the hashing vectorizer is not deterministic, if that's the case, I'm not sure how to handle it in the test other than exposing RNG everywhere.

I didn't bother because we haven't decided whether to keep the hashing vectorizer in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in a first version we can keep just the tfidf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise in the test you can just check the type and shape of the output and the content of self.pipe

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 13, 2024

For the record, this is the same example with the new defaults
image

image

Pretty good result I think

@jeromedockes
Copy link
Member

nice!! do you mind trying the employee_salaries dataset too, as it has columns that are more like categories rather than text?

examples/02_text_with_string_encoders.py Outdated Show resolved Hide resolved
Comment on lines 115 to 132
# ERROR CHECKING
if self.analyzer not in ["char_wb", "char", "word"]:
raise ValueError(f"Unknown analyzer {self.analyzer}")

if not all(isinstance(x, int) and x > 0 for x in self.ngram_range):
raise ValueError(
"Values in `ngram_range` must be positive integers, "
f"found {self.ngram_range} instead."
)
if not len(self.ngram_range) == 2:
raise ValueError(
f"`ngram_range` must have length 2, found {len(self.ngram_range)}."
)

if not isinstance(self.n_components, int) and self.n_components > 0:
raise ValueError(
f"`n_components` must be a positive integer, found {self.n_components}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't the tfidf vectorizer and truncated svd do a similar validation already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left in only the first check, the other constraints are in there, but I think they're less strict

df_module.assert_frame_equal(check_df, result)


def test_hashing(encode_column, df_module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in a first version we can keep just the tfidf

df_module.assert_frame_equal(check_df, result)


def test_hashing(encode_column, df_module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise in the test you can just check the type and shape of the output and the content of self.pipe

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 13, 2024

nice!! do you mind trying the employee_salaries dataset too, as it has columns that are more like categories rather than text?

I used R2 as metric, and did not test HashingVectorizer. It seems like when columns are more categories than text, then the GapEncoder is doing a better job than all the alternatives.
image

image

image

Still, StringEncoder is much faster and the performance is pretty good. TextEncoder is not doing quite as well on the other hand.

Bottom line is, for more "categorical-looking" attributes, use the GapEncoder for better performance, or StringEncoder for decent performance and faster fit time, for text use TextEncoder for better performance or StringEncoder for decent performance and much faster fit time.

This is from a pretty quick and dirty set of experiments, we could also try something like the heuristics from Leo's paper to choose, but that's going to get way more complicated.

By the way, where should I put all the code I used for these results? So far I have it in my scratch folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants