Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the new feature of customized initial population. #162

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

peiyanpan
Copy link

What does this PR do?

Add the new feature of allowing users to specify customized initial pipeline population for TPOT2.

Where should the reviewer start?

  • tpot2/tests/test_customized_iniPop.py
    Contains the SequentialPipeline initialization method, which consists of scalers, selectors, transformers_layer, inner_estimators_layer, estimators and a sample of initializing this TPOTClassifier in a customized_initial_population parameter.
  • tpot2/config/get_configspace.py
    A new set_node() function has been added, containing mainly operations for adding new nodes in pipeline.
  • tpot2/evolvers/base_evolver.py
    Add some judgments about the number of initialized populations and the number of populations that need to be generated by crushed gold.
  • tpot2/tpot_estimator/estimator.py
    Add passing of customized_initial_population parameter

How should this PR be tested?

The test code is at tpot2/tests/test_customized_iniPop.py:

pytest test_customized_iniPop.py

import pytest


@pytest.fixture
def test_customized_iniPop():
    import tpot2
    import sklearn
    import sklearn.datasets

    scorer = sklearn.metrics.get_scorer('roc_auc_ovo')

    X, y = sklearn.datasets.load_iris(return_X_y=True)

    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

    from tpot2.config.get_configspace import set_node
    from tpot2.search_spaces.pipelines.union import UnionPipeline
    from tpot2.search_spaces.pipelines.choice import ChoicePipeline
    from tpot2.search_spaces.pipelines.sequential import SequentialPipeline
    from tpot2.config.get_configspace import get_search_space

    scalers = set_node("MinMaxScaler", {})
    selectors = set_node("SelectFwe", {'alpha': 0.0002381268562})
    transformers_layer =UnionPipeline([
                            ChoicePipeline([
                                set_node("SkipTransformer", {})
                            ]),
                            get_search_space("Passthrough",)
                            ]
                        )

    inner_estimators_layer = UnionPipeline([
                                get_search_space("Passthrough",)]
                            )
    estimators = set_node("HistGradientBoostingClassifier", 
                        {'early_stop': 'valid', 
                        'l2_regularization': 0.0011074158219, 
                        'learning_rate': 0.0050792320068, 
                        'max_depth': None, 
                        'max_features': 0.3430178535213, 
                        'max_leaf_nodes': 237, 
                        'min_samples_leaf': 63, 
                        'tol': 0.0001, 
                        'n_iter_no_change': 14, 
                        'validation_fraction': 0.2343285974496})

    pipeline = SequentialPipeline(search_spaces=[
                                        scalers,
                                        selectors, 
                                        transformers_layer,
                                        inner_estimators_layer,
                                        estimators,
                                        ])
    ind = pipeline.generate()

    est = tpot2.TPOTClassifier(search_space="linear", n_jobs=40, verbose=5, generations=1, population_size=5, customized_initial_population=[ind])

    est.fit(X_train, y_train)

    print(str(est.fitted_pipeline_))

    print(scorer(est, X_test, y_test))

Any background context you want to provide?

In this version, users can specify a well-defined initial pipeline population, currently limited to the SequentialPipeline type. This update has the potential to improve algorithm performance and reduce evolutionary time.

Several Tips:

  1. These SequentialPipeline pipelines can be obtained:

Referencing the examples in customized_initial_population.py and modifying them according to TPOT2's config_dict.

  1. We consider the relationship between #customized initial pipelines and #population_size as follows:
init_population_size = len(customized_initial_population)
if self.cur_population_size <= init_population_size:
    initial_population = customized_initial_population[:self.cur_population_size]
else:
    initial_population = [next(self.individual_generator) for _ in range(self.cur_population_size - init_population_size)]
    initial_population = customized_initial_population + initial_population
  1. The current version is only applicable to solve the problem where search_spaces is linear and the initialized pipeline is of type SequentialPipeline. We will continue to refine the scenario where search_spaces is graph and the pipeline is of type GraphPipeline in the near future if you think our approach is appropriate.

What are the relevant issues?

issue-61

Main Contributors

@peiyanpan @t-harden

@perib
Copy link
Collaborator

perib commented Dec 3, 2024

I do like the idea of being able to specify an initial population. Thanks for your interest and contribution to the project!

Some notes:

  1. You modified the default PULL_REQUEST_TEMPLATE.md with your info. That is intended to be copy pasted into the PRs (should populate by default on github). This change needs to be reverted back to the original template.

  2. Bug - The custom initial population is never actually used. Line 447 of base_evolver overwrites the custom population. You can see the initial population with the following command and see that the custom individual is not there.

for ind in est.evaluated_individuals.iterrows():
    print(ind[1]['Instance'])

This could be turned into a test potentially, I'm pretty sure that the order of the custom initial population will match the order in the pandas df.

3. initial_population = customized_initial_population[:self.cur_population_size] - I think if users pass in a list larger than the population size, they probably did that intentionally and TPOT2 should just use the larger list as is. To me this is more intuitive. I recommend replacing with initial_population = customized_initial_population

  1. I'm wondering if the "set_node" function is necessary. It seems functionally equivalent to just calling "EstimatorNode." Also note that passing in a dictionary fixes the hyperparameters permanently, they go unlearned. Need to pass in ConfigurationSpace to make the hyperparameters learned. With set_node, you currently cannot specify an initial/default hyperparameter AND have it be learned/tuned later. Instead, I think that adding a parameter in EstimatorNode (and wrapper pipeline, etc.) like 'default_hyperparameters' might be a better approach. AND/OR maybe add it as a parameter for get_node, but this gets weird/complicated with the wrapperpipelines...

  2. This also needs to be implemented in the steady state evolver/estimator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants