Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for Flexible Preprocessing #897

Open
wants to merge 4 commits into
base: development
Choose a base branch
from

Conversation

SourdoughCat
Copy link
Contributor

@SourdoughCat SourdoughCat commented Jul 27, 2019

[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]

What does this PR do?

Too many things - so many things that I'm pretty sure there will be discussions on whether this is the way to proceed. Putting it up here before I invest more time on it. This PR addresses several items (possibly too many again) including

#507
#836
Handling of categorical data #771 - I know its not directly related; but it is the closest issue, and in my mind it would allow for extending it to using different encodings like using this: https://github.com/scikit-learn-contrib/categorical-encoding

Which are all related to how TPOT does preprocessing. This PR does a number of things including re-introducing "RandomTree" option in templates that allows specifying templates in the form Transformer-RandomTree; which will then allow things like My_Preprocessing-RandomTree.

The high level approach is to inject additional preprocessing steps when _fit_init is called which then alters the behaviour of TPOT.

Where should the reviewer start?

tpot/base.py - will add comments in files as part of this PR with my thoughts...

Also can't get relative imports working - so that would be appreciated for the tpot.drivers.load_scoring_function part

How should this PR be tested?

Happy to add new tests later; when the design is approved...

Any background context you want to provide?

NIL see above

What are the relevant issues?

#507
#836
Handling of categorical data #771

Screenshots (if appropriate)

Example of API using modified Iris dataset

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-RandomTree")

X_train_df = pd.DataFrame(X_train, columns=["num1", "num2", "num3", "num4"])
X_train_df['text'] = np.random.choice(["hello", "world", "foo", "bar world", "bar hello"], X_train.shape[0])

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2", "num3", "num4"]
                      })
tpot2.fit(X_train_df, y_train)

tpot2 = TPOTClassifier(generations=1, population_size=5, verbosity=2, template="PCA-LogisticRegression", 
                      preprocess_config_dict = {
                          'numeric_columns': ["num2"],
                          'text_columns': ['text']
                      })
tpot2.fit(X_train_df, y_train)
Generation 1 - Current best internal CV score: 0.5908385093167702

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2']), iterated_power=2, svd_solver=randomized), C=20.0, dual=True, penalty=l2)

Generation 1 - Current best internal CV score: 0.9556935817805383

Best pipeline: LogisticRegression(PCA(PreprocessTransformer(input_matrix, numeric_columns=['num2', 'num3', 'num4']), iterated_power=9, svd_solver=randomized), C=20.0, dual=False, penalty=l1)

Generation 1 - Current best internal CV score: 0.5096273291925466

Questions:

  • Do the docs need to be updated? Yes - will do later
  • Does this PR add new (Python) dependencies? No




def load_scoring_function(scoring_func):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help for getting relative imports working would be appreciated here... tpot.drivers.load_scoring_function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, directly importing via from ..driver import load_scoring_function will cause some conflicts with

from .tpot import TPOTClassifier, TPOTRegressor
from ._version import __version__

A workaround is to move this function to tpot/metrics.py and then add from ..metrics import load_scoring_function to tpot/builtins/preprocessing.py

column_transform_dict[k] = config_dict[k]
else:
column_transform_dict[k] = [config_dict[k]]
self._config_dict['tpot.builtins.PreprocessTransformer'] = column_transform_dict
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This injection could be dangerous - do we have opinions on how it is supposed to be handled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think preprocess_config_dict should be a argument within PreprocessTransformer instead of TPOT. And users should be able to customize it via config_dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is certainly possible (and possible right now with no changes to TPOT master technically) via the use of templates - I think the question arises related to #507; if its possible to have a "built-in" configuration with text or not.

Maybe the answer is we can't

)

# override some settings...
if config_dict.get('impute', None) is False:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line deals with #836 - might be overloading this PR and might be an item for later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more related to #889. I think we need add imputation into config_dict too. We may allow TPOT skip imputation if the pipeline only has XGBClassifier or XGBRegressor.

@weixuanfu weixuanfu changed the base branch from master to development July 29, 2019 13:12
@SourdoughCat
Copy link
Contributor Author

Ok,

Given the higher level comments @weixuanfu how should we proceed. In my opinion, I think this PR is just too big, what we probably want is:

  • a PR that just adds back "RandomTree" as an option so that we can do things like "PCA-RandomTree" or similar (have to ensure than RandomTree is the very last option"; in doing so this enables
  • Another PR that proposes how the preprocessing transform should work - there are downsides to this, as it means someone using the preprocessing has to directly alter config options for every model. e.g.

(as per comments above)

X_train, y_train = ...
my_meta_data_info = {<insert meta data information>}
config_dict = {**my_meta_data_info, **TPOT_DEFAULT_CONFIG}
tpot = TPOTClassifier(config_dict=config_dict, template='Preprocess-RandomTree')
tpot.fit(X_train, y_train)

versus

(as per approach taken in this PR)

X_train, y_train = ...
tpot = TPOTClassifier(config_dict=None, preprocess_config_dict ={<insert meta data>})
tpot.fit(X_train, y_train)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants