Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ec37e13
Fixing changelog with correct account
rcap107 Nov 21, 2024
b3dae47
Merge remote-tracking branch 'upstream/main'
rcap107 Nov 25, 2024
99e5450
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 26, 2024
4f7e46e
Initial commit
rcap107 Nov 26, 2024
583250b
Update
rcap107 Nov 27, 2024
4a39f36
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 27, 2024
ee2f739
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Nov 29, 2024
30ad689
Merge branch 'main' into tfidf-pca
rcap107 Nov 29, 2024
d7f1cd7
Merge remote-tracking branch 'upstream/main' into tfidf-pca
rcap107 Dec 5, 2024
8686d7f
Updated object and added test
rcap107 Dec 5, 2024
eb4de97
quick update to changelog
rcap107 Dec 5, 2024
96423ba
Fixed test
rcap107 Dec 5, 2024
e01637c
Merge branch 'main' of github.com:skrub-data/skrub
rcap107 Dec 7, 2024
3a1f6eb
Replacing PCA with TruncatedSVD
rcap107 Dec 9, 2024
398f9db
Updated init
rcap107 Dec 9, 2024
3a45f19
Updated example to add StringEncoder
rcap107 Dec 9, 2024
38a9f2d
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 9, 2024
51856b3
Updating changelog.
rcap107 Dec 9, 2024
58a3559
📝 Updating docstrings
rcap107 Dec 9, 2024
8e4fce2
📝 Fixing example
rcap107 Dec 9, 2024
afdb361
✅ Fixing tests and renaming test file
rcap107 Dec 9, 2024
6c6d884
✅ Fixing coverage
rcap107 Dec 9, 2024
9366d90
🐛 Fixing the name of a variable
rcap107 Dec 9, 2024
6b474c6
Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca
rcap107 Dec 11, 2024
e8f308e
Addressing comments in review
rcap107 Dec 11, 2024
8ea92d8
Updating code to benchmark
rcap107 Dec 12, 2024
c999abf
Merge branch 'string-encoder-bench' of github.com:rcap107/skrub into …
rcap107 Dec 12, 2024
8411a83
updating code
rcap107 Dec 12, 2024
190ce2a
Updating script
rcap107 Dec 13, 2024
a43488e
a
rcap107 Dec 13, 2024
cdfaf1a
Removing some files used for prototyping
rcap107 Dec 13, 2024
c0c066f
Added new parameters, fixed docstring, added error checking
rcap107 Dec 13, 2024
887e047
Removing an unnecessary file
rcap107 Dec 13, 2024
af3b087
Update examples/02_text_with_string_encoders.py
rcap107 Dec 13, 2024
2bb353d
Simplified error checking
rcap107 Dec 13, 2024
bfb8c55
Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…
rcap107 Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ It is currently undergoing fast development and backward compatibility is not en

New features
------------

* The :class:`StringEncoder` encodes strings using tf-idf and truncated SVD
decomposition and provides a cheaper alternative to :class:`TextEncoder`.
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
:pr:`1159` by :user:`Riccardo Cappuzzo<rcap107>`.

Changes
-------
Expand Down
33 changes: 33 additions & 0 deletions example_string_encoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# %% test string encoder
import polars as pl
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

from skrub._string_encoder import StringEncoder

corpus = [
"this is the first document",
"this document is the second document",
"and this is the third one",
"is this the first document",
]
column = pl.Series(name="this_column", values=corpus)

# %%

pipe = Pipeline(
[
("tfidf", TfidfVectorizer()),
("pca", PCA(n_components=2)),
]
)
# %%
a = pipe.fit_transform(corpus)

# %%
se = StringEncoder(2)

# %%
r = se.fit_transform(column)
# %%
31 changes: 28 additions & 3 deletions examples/02_text_with_string_encoders.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
.. |TextEncoder| replace::
:class:`~skrub.TextEncoder`

.. |StringEncoder| replace::
:class:`~skrub.StringEncoder`

.. |TableReport| replace::
:class:`~skrub.TableReport`

Expand Down Expand Up @@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
# We set ``n_components`` to 30; however, to achieve the best performance, we would
# need to find the optimal value for this hyperparameter using either |GridSearchCV|
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this
# example small.
# small example.
#
# Recall that the ROC AUC is a metric that quantifies the ranking power of estimators,
# where a random estimator scores 0.5, and an oracle —providing perfect predictions—
Expand Down Expand Up @@ -221,6 +224,25 @@ def plot_box_results(named_results):

plot_box_results(results)

# %%
# |TextEncoder| embeddings are very strong, but they are also quite expensive to
# train. A simpler, faster alternative for encoding strings is the |StringEncoder|,
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
# which works by first performing a tf-idf vectorization of the text, and then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe tf-idf (computing vectors of rescaled word counts + wikipedia link)

# following it with TruncatedSVD to reduce the number of dimensions to, in this
# case, 30.
from skrub import StringEncoder

string_encoder = StringEncoder(n_components=30)

string_encoder_pipe = clone(gap_pipe).set_params(
**{"tablevectorizer__high_cardinality": string_encoder}
)
string_encoder_results = cross_validate(string_encoder_pipe, X, y, scoring="roc_auc")
results.append(("StringEncoder", string_encoder_results))

plot_box_results(results)


# %%
# The performance of the |TextEncoder| is significantly stronger than that of
# the syntactic encoders, which is expected. But how long does it take to load
Expand All @@ -232,7 +254,7 @@ def plot_box_results(named_results):

def plot_performance_tradeoff(results):
fig, ax = plt.subplots(figsize=(5, 4), dpi=200)
markers = ["s", "o", "^"]
markers = ["s", "o", "^", "x"]
for idx, (name, result) in enumerate(results):
ax.scatter(
result["fit_time"],
Expand Down Expand Up @@ -293,8 +315,11 @@ def plot_performance_tradeoff(results):
# During the subsequent cross-validation iterations, the model is simply copied,
# which reduces computation time for the remaining folds.
#
# Interestingly, |StringEncoder| has a performance remarkably similar to that of
# |GapEncoder|, while being significantly faster.
# Conclusion
# ----------
# In conclusion, |TextEncoder| provides powerful vectorization for text, but at
# the cost of longer computation times and the need for additional dependencies,
# such as torch.
# such as torch. \StringEncoder| represents a simpler alternative that can provide
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
# good performance at a fraction of the cost of more complex methods.
2 changes: 2 additions & 0 deletions skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from ._reporting import TableReport, patch_display, unpatch_display
from ._select_cols import DropCols, SelectCols
from ._similarity_encoder import SimilarityEncoder
from ._string_encoder import StringEncoder
from ._table_vectorizer import TableVectorizer
from ._tabular_learner import tabular_learner
from ._text_encoder import TextEncoder
Expand Down Expand Up @@ -53,5 +54,6 @@
"SelectCols",
"DropCols",
"TextEncoder",
"StringEncoder",
"column_associations",
]
132 changes: 132 additions & 0 deletions skrub/_string_encoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.validation import check_is_fitted

from . import _dataframe as sbd
from ._on_each_column import SingleColumnTransformer


class StringEncoder(SingleColumnTransformer):
"""Generate a lightweight string encoding of a given column using tf-idf \
vectorization and truncated SVD.

First, apply a tf-idf vectorization of the text, then reduce the dimensionality
with a truncated SVD decomposition with the given number of parameters.

New features will be named `{col_name}_{component}` if the series has a name,
and `tsvd_{component}` if it does not.

Parameters
----------
n_components : int
Number of components to be used for the PCA decomposition.

See Also
--------
MinHashEncoder :
Encode string columns as a numeric array with the minhash method.
GapEncoder :
Encode string columns by constructing latent topics.
SimilarityEncoder :
Encode string columns as a numeric array with n-gram string similarity.
TextEncoder :
Encode string columns using pre-trained language models.

Examples
--------
>>> import pandas as pd
>>> from skrub import StringEncoder

We will encode the comments using 2 components:

>>> enc = StringEncoder(n_components=2)
>>> X = pd.Series([
... "The professor snatched a good interview out of the jaws of these questions.",
... "Bookmarking this to watch later.",
... "When you don't know the lyrics of the song except the chorus",
... ], name='video comments')

>>> enc.fit_transform(X) # doctest: +SKIP
video comments_0 video comments_1
0 8.218069e-01 4.557474e-17
1 6.971618e-16 1.000000e+00
2 8.218069e-01 -3.046564e-16
"""

def __init__(self, n_components=30):
self.n_components = n_components

def _transform(self, X):
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
result = self.pipe.transform(sbd.to_numpy(X))

result = sbd.make_dataframe_like(X, dict(zip(self.all_outputs_, result.T)))
result = sbd.copy_index(X, result)

return result

def get_feature_names_out(self):
"""Get output feature names for transformation.

Returns
-------
feature_names_out : list of str objects
Transformed feature names.
"""
return list(self.all_outputs_)

def fit_transform(self, X, y=None):
"""Fit the encoder and transform a column.

Parameters
----------
X : Pandas or Polars series
The column to transform.
y : None
Unused. Here for compatibility with scikit-learn.

Returns
-------
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)
The embedding representation of the input.
"""
del y
self.pipe = Pipeline(
[
("tfidf", TfidfVectorizer()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @GaelVaroquaux suggested using a HashingVectorizer instead of tfidfvectorizer (i don't think this would require changes elswhere in your code)

("tsvd", TruncatedSVD(n_components=self.n_components)),
]
)

name = sbd.name(X)
if not name:
name = "tsvd"
self.all_outputs_ = [f"{name}_{idx}" for idx in range(self.n_components)]

self.pipe.fit(sbd.to_numpy(X))

self._is_fitted = True

return self.transform(X)
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved

def transform(self, X):
"""Transform a column.

Parameters
----------
X : Pandas or Polars series
The column to transform.

Returns
-------
X_out: Pandas or Polars dataframe with shape (len(X), tsvd_n_components)
The embedding representation of the input.
"""
check_is_fitted(self)
return self._transform(X)

def __sklearn_is_fitted__(self):
"""
Check fitted status and return a Boolean value.
"""
return hasattr(self, "_is_fitted") and self._is_fitted
68 changes: 68 additions & 0 deletions skrub/tests/test_string_encoder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import pytest
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

from skrub import _dataframe as sbd
from skrub._string_encoder import StringEncoder


@pytest.fixture
def encode_column(df_module):
corpus = [
"this is the first document",
"this document is the second document",
"and this is the third one",
"is this the first document",
]

return df_module.make_column("col1", corpus)


def test_encoding(encode_column, df_module):
pipe = Pipeline(
[
("tfidf", TfidfVectorizer()),
("tsvd", TruncatedSVD(n_components=2)),
]
)
check = pipe.fit_transform(sbd.to_numpy(encode_column))

names = [f"col1_{idx}" for idx in range(2)]

check_df = df_module.make_dataframe(dict(zip(names, check.T)))

se = StringEncoder(2)
result = se.fit_transform(encode_column)

# Converting dtypes to avoid nullable shenanigans
check_df = sbd.pandas_convert_dtypes(check_df)
result = sbd.pandas_convert_dtypes(result)

df_module.assert_frame_equal(check_df, result)


def test_get_feature_names_out(encode_column, df_module):
"""Test that ``get_feature_names_out`` returns the correct feature names."""
encoder = StringEncoder(n_components=4)

encoder.fit(encode_column)
expected_columns = ["col1_0", "col1_1", "col1_2", "col1_3"]
assert encoder.get_feature_names_out() == expected_columns

# Checking that a series with an empty name generates the proper column names
X = df_module.make_column(
None,
[
"this is the first document",
"this document is the second document",
"and this is the third one",
"is this the first document",
],
)

encoder = StringEncoder(n_components=4)

encoder.fit(X)
expected_columns = ["tsvd_0", "tsvd_1", "tsvd_2", "tsvd_3"]
assert encoder.get_feature_names_out() == expected_columns
Loading