-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the StringEncoder transformer #1159
base: main
Are you sure you want to change the base?
Conversation
Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices. @jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version. Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer? (writing this down so I don't forget) |
I'm very happy to see this progressing. Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters) |
Where can I find the benchmarks? |
Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634 |
Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset. |
It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder |
That's very interesting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! 🎉 looking really good @rcap107 . it will be awesome to have this as the cost of the GapEncoder is the main pain point of the tablevectorizer ATM 🚀
I left a couple of minor comments, the main one is just a possible performance improvement if we use the vectorizer's fit_transform
to avoid tokenizing the input twice during fit
the code is very close to ready and we have an example so once that's done what are the next steps? maybe some lightweight comparisons on a couple of datasets, then merge, then more extensive experiments to decide if it can be come the tablevectorizer's default?
# %% | ||
# |TextEncoder| embeddings are very strong, but they are also quite expensive to | ||
# train. A simpler, faster alternative for encoding strings is the |StringEncoder|, | ||
# which works by first performing a tf-idf vectorization of the text, and then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe tf-idf (computing vectors of rescaled word counts + wikipedia link)
skrub/_string_encoder.py
Outdated
del y | ||
self.pipe = Pipeline( | ||
[ | ||
("tfidf", TfidfVectorizer()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @GaelVaroquaux suggested using a HashingVectorizer instead of tfidfvectorizer (i don't think this would require changes elswhere in your code)
I implemented a change, I'm not sure if it's what you meant..
Sounds good |
I like this plan as well! |
Before we merge, I would love a bit of experimenting, with a fairly systematic eye on the choice of parameters. For instance: HashingVectorizer vs TfIdfVectorizer? If we go for hashingVectorizer, should it be followed by a TfIdfTransformer or not? For the vectorizer, what should be the default values for the analyzer, the n_features and the n_gram range. I suspect that we want 'analyzer="char_wb"' and ngram_range=(3, 4) based on my experience. These choices are actually very important, and we should drive them using systematic experimentation. I know that it is a lot of work, but it makes a huge difference. |
thanks for those experiments! I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters? |
I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters?
char_wb makes sense: it's counting words and characters
|
OK, those results look great. Please do keep the associated data, as one day we may publish a paper on skrub 😁 If I can make an editorial decision: let's use the char_wb, ngram_range=(3, 4). It matches what we do in the GapEncoder. |
Thanks for running these bench @rcap107 !! Glad to see we have a clear winner for the TableVectorizer as @GaelVaroquaux said. |
Great, then I'll clean up the script I am using and I'll put it somewhere, so I can find it if we decide to run more in-depth experiments (like by testing more tables, or changing the number of components or idk). As for the StringEncoder, do I keep the HashingVectorizer at all? Or do I keep only the arguments relative to tf-idf? |
> I'm surprised the (1, 1 ) ngram with "char" or "char_wb" can perform so well -- isn't it just counting individual characters?
char_wb makes sense: it's counting words and characters
IIUC correctly char_wb prevents char ngrams from crossing word boundaries but they're still only character ngrams no?
|
df_module.assert_frame_equal(check_df, result) | ||
|
||
|
||
def test_hashing(encode_column, df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is failing because the hashing vectorizer is not deterministic, if that's the case, I'm not sure how to handle it in the test other than exposing RNG everywhere.
I didn't bother because we haven't decided whether to keep the hashing vectorizer in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in a first version we can keep just the tfidf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise in the test you can just check the type and shape of the output and the content of self.pipe
nice!! do you mind trying the employee_salaries dataset too, as it has columns that are more like categories rather than text? |
skrub/_string_encoder.py
Outdated
# ERROR CHECKING | ||
if self.analyzer not in ["char_wb", "char", "word"]: | ||
raise ValueError(f"Unknown analyzer {self.analyzer}") | ||
|
||
if not all(isinstance(x, int) and x > 0 for x in self.ngram_range): | ||
raise ValueError( | ||
"Values in `ngram_range` must be positive integers, " | ||
f"found {self.ngram_range} instead." | ||
) | ||
if not len(self.ngram_range) == 2: | ||
raise ValueError( | ||
f"`ngram_range` must have length 2, found {len(self.ngram_range)}." | ||
) | ||
|
||
if not isinstance(self.n_components, int) and self.n_components > 0: | ||
raise ValueError( | ||
f"`n_components` must be a positive integer, found {self.n_components}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't the tfidf vectorizer and truncated svd do a similar validation already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left in only the first check, the other constraints are in there, but I think they're less strict
df_module.assert_frame_equal(check_df, result) | ||
|
||
|
||
def test_hashing(encode_column, df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in a first version we can keep just the tfidf
df_module.assert_frame_equal(check_df, result) | ||
|
||
|
||
def test_hashing(encode_column, df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise in the test you can just check the type and shape of the output and the content of self.pipe
Co-authored-by: Jérôme Dockès <[email protected]>
This is a first draft of a PR to address #1121
I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.
Things left to do: