Add the new feature of customized initial population. #162

peiyanpan · 2024-12-03T11:23:00Z

What does this PR do?

Add the new feature of allowing users to specify customized initial pipeline population for TPOT2.

Where should the reviewer start?

tpot2/tests/test_customized_iniPop.py
Contains the SequentialPipeline initialization method, which consists of scalers, selectors, transformers_layer, inner_estimators_layer, estimators and a sample of initializing this TPOTClassifier in a customized_initial_population parameter.
tpot2/config/get_configspace.py
A new set_node() function has been added, containing mainly operations for adding new nodes in pipeline.
tpot2/evolvers/base_evolver.py
Add some judgments about the number of initialized populations and the number of populations that need to be generated by crushed gold.
tpot2/tpot_estimator/estimator.py
Add passing of customized_initial_population parameter

How should this PR be tested?

The test code is at tpot2/tests/test_customized_iniPop.py:

pytest test_customized_iniPop.py

import pytest


@pytest.fixture
def test_customized_iniPop():
    import tpot2
    import sklearn
    import sklearn.datasets

    scorer = sklearn.metrics.get_scorer('roc_auc_ovo')

    X, y = sklearn.datasets.load_iris(return_X_y=True)

    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

    from tpot2.config.get_configspace import set_node
    from tpot2.search_spaces.pipelines.union import UnionPipeline
    from tpot2.search_spaces.pipelines.choice import ChoicePipeline
    from tpot2.search_spaces.pipelines.sequential import SequentialPipeline
    from tpot2.config.get_configspace import get_search_space

    scalers = set_node("MinMaxScaler", {})
    selectors = set_node("SelectFwe", {'alpha': 0.0002381268562})
    transformers_layer =UnionPipeline([
                            ChoicePipeline([
                                set_node("SkipTransformer", {})
                            ]),
                            get_search_space("Passthrough",)
                            ]
                        )

    inner_estimators_layer = UnionPipeline([
                                get_search_space("Passthrough",)]
                            )
    estimators = set_node("HistGradientBoostingClassifier", 
                        {'early_stop': 'valid', 
                        'l2_regularization': 0.0011074158219, 
                        'learning_rate': 0.0050792320068, 
                        'max_depth': None, 
                        'max_features': 0.3430178535213, 
                        'max_leaf_nodes': 237, 
                        'min_samples_leaf': 63, 
                        'tol': 0.0001, 
                        'n_iter_no_change': 14, 
                        'validation_fraction': 0.2343285974496})

    pipeline = SequentialPipeline(search_spaces=[
                                        scalers,
                                        selectors, 
                                        transformers_layer,
                                        inner_estimators_layer,
                                        estimators,
                                        ])
    ind = pipeline.generate()

    est = tpot2.TPOTClassifier(search_space="linear", n_jobs=40, verbose=5, generations=1, population_size=5, customized_initial_population=[ind])

    est.fit(X_train, y_train)

    print(str(est.fitted_pipeline_))

    print(scorer(est, X_test, y_test))

Any background context you want to provide?

In this version, users can specify a well-defined initial pipeline population, currently limited to the SequentialPipeline type. This update has the potential to improve algorithm performance and reduce evolutionary time.

Several Tips:

These SequentialPipeline pipelines can be obtained:

Referencing the examples in customized_initial_population.py and modifying them according to TPOT2's config_dict.

We consider the relationship between #customized initial pipelines and #population_size as follows:

init_population_size = len(customized_initial_population)
if self.cur_population_size <= init_population_size:
    initial_population = customized_initial_population[:self.cur_population_size]
else:
    initial_population = [next(self.individual_generator) for _ in range(self.cur_population_size - init_population_size)]
    initial_population = customized_initial_population + initial_population

The current version is only applicable to solve the problem where search_spaces is linear and the initialized pipeline is of type SequentialPipeline. We will continue to refine the scenario where search_spaces is graph and the pipeline is of type GraphPipeline in the near future if you think our approach is appropriate.

What are the relevant issues?

issue-61

Main Contributors

@peiyanpan @t-harden

perib · 2024-12-03T15:14:25Z

I do like the idea of being able to specify an initial population. Thanks for your interest and contribution to the project!

Some notes:

You modified the default PULL_REQUEST_TEMPLATE.md with your info. That is intended to be copy pasted into the PRs (should populate by default on github). This change needs to be reverted back to the original template.
Bug - The custom initial population is never actually used. Line 447 of base_evolver overwrites the custom population. You can see the initial population with the following command and see that the custom individual is not there.

for ind in est.evaluated_individuals.iterrows():
    print(ind[1]['Instance'])

This could be turned into a test potentially, I'm pretty sure that the order of the custom initial population will match the order in the pandas df.

3. initial_population = customized_initial_population[:self.cur_population_size] - I think if users pass in a list larger than the population size, they probably did that intentionally and TPOT2 should just use the larger list as is. To me this is more intuitive. I recommend replacing with initial_population = customized_initial_population

I'm wondering if the "set_node" function is necessary. It seems functionally equivalent to just calling "EstimatorNode." Also note that passing in a dictionary fixes the hyperparameters permanently, they go unlearned. Need to pass in ConfigurationSpace to make the hyperparameters learned. With set_node, you currently cannot specify an initial/default hyperparameter AND have it be learned/tuned later. Instead, I think that adding a parameter in EstimatorNode (and wrapper pipeline, etc.) like 'default_hyperparameters' might be a better approach. AND/OR maybe add it as a parameter for get_node, but this gets weird/complicated with the wrapperpipelines...
This also needs to be implemented in the steady state evolver/estimator

peiyanpan and others added 4 commits December 3, 2024 00:24

Add the new feature of customized initial population

2f22441

Add the new feature of customized initial population

4878b06

some changes

1a74df3

Add some detail

c983018

peiyanpan added 2 commits December 3, 2024 23:43

fix some bugs

0e3f4ba

enhance set_node

9ca2116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the new feature of customized initial population. #162

Add the new feature of customized initial population. #162

peiyanpan commented Dec 3, 2024

perib commented Dec 3, 2024 •

edited

Loading

Add the new feature of customized initial population. #162

Are you sure you want to change the base?

Add the new feature of customized initial population. #162

Conversation

peiyanpan commented Dec 3, 2024

What does this PR do?

Where should the reviewer start?

How should this PR be tested?

Any background context you want to provide?

What are the relevant issues?

Main Contributors

perib commented Dec 3, 2024 • edited Loading

perib commented Dec 3, 2024 •

edited

Loading