Allow duplicate objects in Pipeline and ColumnTransformer #638

PGijsbers · 2019-03-06T16:56:06Z

Currently neither Pipeline nor ColumnTransformer may contain two different steps with the same type of transformer. I think this should be allowed.

Consider a scenario where I have a dataset with numeric and categorical values (e.g. feature 1 and 2, respectively), and wish to impute them with a different imputation strategy. I would use the following code (with openml on head of develop):

import openml
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Assume a dataset with feature 0 being numeric, and feature 1 being nominal
pipeline = Pipeline(
    [('preprocessing', ColumnTransformer(
        [('impute_numeric', SimpleImputer(strategy='mean'), [0]),
         ('impute_categorical', SimpleImputer(strategy='median'), [1])])),
     ('classifier', DecisionTreeClassifier())])
openml.flows.sklearn_to_flow(pipeline)

I would assume this should work, but it raises the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 404, in _serialize_model
    _extract_information_from_model(model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 512, in _extract_information_from_model
    rval = sklearn_to_flow(v, model)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
    rval = [sklearn_to_flow(element, parent_model) for element in o]
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
    rval = _serialize_model(o)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 408, in _serialize_model
    _check_multiple_occurence_of_component_in_flow(model, subcomponents)
  File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 490, in _check_multiple_occurence_of_component_in_flow
    'trying to serialize %s.' % (visitee.name, model))
ValueError: Found a second occurence of component sklearn.impute.SimpleImputer when trying to serialize ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('impute_numeric', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0), [0]), ('impute_categorical', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0), [1])]).

Similarly an error is raised if a pipeline contains two steps of the same type.

What is the reason this error is raised? Is it simply not yet supported? Or should I be ordering my workflow differently, and if so, how?

The text was updated successfully, but these errors were encountered:

janvanrijn · 2019-03-06T17:05:44Z

This is a problem of the OpenML Flow definition, as defined in the early days of OpenML (2012). There is currently no uniform way to specify to which specific instance of the flow a hyperparameter setting in a run belongs, and as such having multiple instantiations of the same subflow in a complex flow does not allow for reproducible research.

It has been on the agenda to improve this server side, however no one has started programming / testing alternatives.

PGijsbers · 2019-03-06T18:30:44Z

Thanks, that clarifies a lot. Does it make sense to leave this issue open as it will go unresolved? Or should I close it as 'we' on the package side can not fix this until the definitions are updated?

mfeurer · 2019-03-07T08:14:44Z

I think closing and referencing the corresponding issue on the OpenML issue tracker is the way to go here: openml/OpenML#340

mfeurer · 2019-10-15T12:22:22Z

Reopening to show that this is a known issue.

PGijsbers · 2019-10-17T07:58:53Z

Marked it as wontfix because we won't (can't) fix this until we rework the flow definition.

mfeurer closed this as completed Mar 7, 2019

mfeurer mentioned this issue Oct 15, 2019

Allow several instances of same flow within one flow #826

Closed

mfeurer reopened this Oct 15, 2019

PGijsbers added the wontfix label Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow duplicate objects in Pipeline and ColumnTransformer #638

Allow duplicate objects in Pipeline and ColumnTransformer #638

PGijsbers commented Mar 6, 2019

janvanrijn commented Mar 6, 2019

PGijsbers commented Mar 6, 2019

mfeurer commented Mar 7, 2019

mfeurer commented Oct 15, 2019

PGijsbers commented Oct 17, 2019

Allow duplicate objects in Pipeline and ColumnTransformer #638

Allow duplicate objects in Pipeline and ColumnTransformer #638

Comments

PGijsbers commented Mar 6, 2019

janvanrijn commented Mar 6, 2019

PGijsbers commented Mar 6, 2019

mfeurer commented Mar 7, 2019

mfeurer commented Oct 15, 2019

PGijsbers commented Oct 17, 2019