Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the new feature of customized initial population. #162

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 96 additions & 10 deletions PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,116 @@
[please review the [Contribution Guidelines](http://epistasislab.github.io/tpot/contributing/) prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]

## What does this PR do?


Add the new feature of allowing users to specify customized initial pipeline population for TPOT2.

## Where should the reviewer start?


- tpot2/tests/test_customized_iniPop.py
Contains the SequentialPipeline initialization method, which consists of scalers, selectors, transformers_layer, inner_estimators_layer, estimators and a sample of initializing this TPOTClassifier in a customized_initial_population parameter.
- tpot2/config/get_configspace.py
A new set_node() function has been added, containing mainly operations for adding new nodes in pipeline.
- tpot2/evolvers/base_evolver.py
Add some judgments about the number of initialized populations and the number of populations that need to be generated by crushed gold.
- tpot2/tpot_estimator/estimator.py
Add passing of customized_initial_population parameter

## How should this PR be tested?

The test code is at tpot2/tests/test_customized_iniPop.py:

**pytest test_customized_iniPop.py**

```
import pytest


@pytest.fixture
def test_customized_iniPop():
import tpot2
import sklearn
import sklearn.datasets

scorer = sklearn.metrics.get_scorer('roc_auc_ovo')

X, y = sklearn.datasets.load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75, test_size=0.25)

from tpot2.config.get_configspace import set_node
from tpot2.search_spaces.pipelines.union import UnionPipeline
from tpot2.search_spaces.pipelines.choice import ChoicePipeline
from tpot2.search_spaces.pipelines.sequential import SequentialPipeline
from tpot2.config.get_configspace import get_search_space

scalers = set_node("MinMaxScaler", {})
selectors = set_node("SelectFwe", {'alpha': 0.0002381268562})
transformers_layer =UnionPipeline([
ChoicePipeline([
set_node("SkipTransformer", {})
]),
get_search_space("Passthrough",)
]
)

inner_estimators_layer = UnionPipeline([
get_search_space("Passthrough",)]
)
estimators = set_node("HistGradientBoostingClassifier",
{'early_stop': 'valid',
'l2_regularization': 0.0011074158219,
'learning_rate': 0.0050792320068,
'max_depth': None,
'max_features': 0.3430178535213,
'max_leaf_nodes': 237,
'min_samples_leaf': 63,
'tol': 0.0001,
'n_iter_no_change': 14,
'validation_fraction': 0.2343285974496})

pipeline = SequentialPipeline(search_spaces=[
scalers,
selectors,
transformers_layer,
inner_estimators_layer,
estimators,
])
ind = pipeline.generate()

est = tpot2.TPOTClassifier(search_space="linear", n_jobs=40, verbose=5, generations=1, population_size=5, customized_initial_population=[ind])

est.fit(X_train, y_train)

print(str(est.fitted_pipeline_))

print(scorer(est, X_test, y_test))
```

## Any background context you want to provide?

In this version, users can specify a well-defined initial pipeline population, currently limited to the *SequentialPipeline* type. This update has the potential to improve algorithm performance and reduce evolutionary time.

Several Tips:

## What are the relevant issues?
1. These SequentialPipeline pipelines can be obtained:

Referencing the examples in customized_initial_population.py and modifying them according to TPOT2's config_dict.

[you can link directly to issues by entering # then the number of the issue]
2. We consider the relationship between #customized initial pipelines and #population_size as follows:

## Screenshots (if appropriate)
```
init_population_size = len(customized_initial_population)
if self.cur_population_size <= init_population_size:
initial_population = customized_initial_population[:self.cur_population_size]
else:
initial_population = [next(self.individual_generator) for _ in range(self.cur_population_size - init_population_size)]
initial_population = customized_initial_population + initial_population
```
3. The current version is only applicable to solve the problem where search_spaces is linear and the initialized pipeline is of type SequentialPipeline. We will continue to refine the scenario where search_spaces is graph and the pipeline is of type GraphPipeline in the near future if you think our approach is appropriate.


## What are the relevant issues?

[issue-61](https://github.com/EpistasisLab/tpot2/issues/61)

## Questions:
## Main Contributors

- Do the docs need to be updated?
- Does this PR add new (Python) dependencies?
@peiyanpan @t-harden
Loading