Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow duplicate objects in Pipeline and ColumnTransformer #638

Open
PGijsbers opened this issue Mar 6, 2019 · 5 comments
Open

Allow duplicate objects in Pipeline and ColumnTransformer #638

PGijsbers opened this issue Mar 6, 2019 · 5 comments
Labels

Comments

@PGijsbers
Copy link
Collaborator

Currently neither Pipeline nor ColumnTransformer may contain two different steps with the same type of transformer. I think this should be allowed.

Consider a scenario where I have a dataset with numeric and categorical values (e.g. feature 1 and 2, respectively), and wish to impute them with a different imputation strategy. I would use the following code (with openml on head of develop):

import openml
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Assume a dataset with feature 0 being numeric, and feature 1 being nominal
pipeline = Pipeline(
    [('preprocessing', ColumnTransformer(
        [('impute_numeric', SimpleImputer(strategy='mean'), [0]),
         ('impute_categorical', SimpleImputer(strategy='median'), [1])])),
     ('classifier', DecisionTreeClassifier())])
openml.flows.sklearn_to_flow(pipeline)

I would assume this should work, but it raises the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 404, in _serialize_model
    _extract_information_from_model(model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 512, in _extract_information_from_model
    rval = sklearn_to_flow(v, model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 408, in _serialize_model
    _check_multiple_occurence_of_component_in_flow(model, subcomponents)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 490, in _check_multiple_occurence_of_component_in_flow
    'trying to serialize %s.' % (visitee.name, model))
ValueError: Found a second occurence of component sklearn.impute.SimpleImputer when trying to serialize ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('impute_numeric', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0), [0]), ('impute_categorical', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0), [1])]).

Similarly an error is raised if a pipeline contains two steps of the same type.

What is the reason this error is raised? Is it simply not yet supported? Or should I be ordering my workflow differently, and if so, how?

@janvanrijn
Copy link
Member

This is a problem of the OpenML Flow definition, as defined in the early days of OpenML (2012). There is currently no uniform way to specify to which specific instance of the flow a hyperparameter setting in a run belongs, and as such having multiple instantiations of the same subflow in a complex flow does not allow for reproducible research.

It has been on the agenda to improve this server side, however no one has started programming / testing alternatives.

@PGijsbers
Copy link
Collaborator Author

Thanks, that clarifies a lot. Does it make sense to leave this issue open as it will go unresolved? Or should I close it as 'we' on the package side can not fix this until the definitions are updated?

@mfeurer
Copy link
Collaborator

mfeurer commented Mar 7, 2019

I think closing and referencing the corresponding issue on the OpenML issue tracker is the way to go here: openml/OpenML#340

@mfeurer
Copy link
Collaborator

mfeurer commented Oct 15, 2019

Reopening to show that this is a known issue.

@mfeurer mfeurer reopened this Oct 15, 2019
@PGijsbers
Copy link
Collaborator Author

Marked it as wontfix because we won't (can't) fix this until we rework the flow definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants