[ENH] Enhancing polars support by introducing `set_output` #399

julian-fong · 2024-06-21T18:56:58Z

Introduces files set_output inside skpro.utils and new tests file test_set_output inside the tests folder. As part of sktime/enhancement-proposals#34 and the notes written in my mentorship programme .

In this pr:

I have introduced basic functions to convert multi column pandas dataframes into single column pandas dataframes and vice versa, these are stored under the polars adapter file skpro.datatypes._adapter.polars. In the polars adapter file, convert_polars_to_pandas_with_index now checks to see if there was melted multi-index columns (these columns will be denoated via "foo__bar" convention) and convert_pandas_to_polars_with_index now checks to see if there are multi-index columns inside the pandas DataFrame (like in predict_interval and predict_quantile. If so, then we will melt down these multi-index columns into single-level columns before converting into a polars dataframe.
created skpro.utils.set_output.check_output_config to ensure that transformations set by the users are aligned with available skpro output data containers.
created skpro.utils.set_output.transform_output in order to convert the resulting DataFrame into user specified or default data containers. transform_output acts like a wrapper around the ordinary convert function, but instead it also checks whether to convert based upon the user specified mtype or to leverage the default original mtype seen in fit
I have introduced _config inside BaseProbaRegressor and a new function set_output which mirrors sklearn's set_output for familiarity.

Relates to #342 #449

…for testing

fkiraly · 2024-09-08T09:58:38Z

skpro/utils/set_output.py

+}
+
+
+def check_output_config(estimator):


I would make this function and the next one private, to avoid user calls and allowing us to make changes quickly

fkiraly · 2024-09-08T10:00:31Z

skpro/utils/set_output.py

+
+
+def transform_output(
+    obj, valid, from_type, default_to_type, default_scitype, output_config, store


this function has many parameters - this is a code smell.

I would suggest to remove the convert-call, and only use this function to get the from/to type - that way, we also do not need to replace convert by transform_output in the regressor

fkiraly · 2024-09-08T10:02:59Z

skpro/regression/base/_base.py

@@ -38,6 +39,10 @@ class BaseProbaRegressor(BaseEstimator):
        "C_inner_mtype": "pd_DataFrame_Table",
    }

+    _config = {
+        "transform": "default"


maybe transform_output?

fkiraly

Looks great!

I would request two changes:

make the new utilities private
decouple transform_output from the output, and call it only if the setting is not the default

…tionality

…e not table scitypes

fkiraly · 2024-09-13T16:10:31Z

let me know in case you need a review

julian-fong · 2024-09-13T16:36:52Z

Another review would be quite helpful.. I think I'll try and re-write the description to be more clear what the goal of the pull request is and whats being added in the PR.

I'll also note that the new code that supports mi-columns and si-columns/indxes for polars exists inside adapters.polars and not inside the table conversion utilities. Instead, the polars conversion utilities for table import the conversion methods from adapters.polars and uses them to convert the pandas dataframes and polars dataframes.

So if needed, I can separate the polars conversion methods inside adapters.polars into different functions:

One for si column pandas to polars (for the table scitype)
One for si column polars to pandas (for the table scitype)
A new conversion method for mi column pandas to polars (for any new mi abstract data types that we decide upon)
A new conversion methods for mi column pandas to polars i.e reverse mapping (for any new mi abstract data types that we decide upon)
All of these methods will still be inside adapters.polars unless you have another recommendation on the placement of these methods. Either way, this should allow the scitype conversion methods to be firmly independent from each other.

I'll also note that although predict_var is only single column, there are still some failures in the CI, because one of the notebook-examples (the one with plot_crossplot_std)

julian-fong · 2024-09-13T16:45:41Z

I remember asking you before why in the base code the output conversion only existed inside the predict function and nothing else. Could you kindly remind me why?

And given that reasoning, do you think it makes sense to have the other predict_* functions have this line of code also? or should it always be returned in the same data container as the input. If this is indeed the case, we can always allow data container type changes in these methods via set_output. But most likely in baby steps first since this is quite a challenging design question to solve !

fkiraly · 2024-09-14T17:40:45Z

From our discussion, summarizing what I think the key points for this PR, to avoid it moving around with minor changes:

a clear specification and use cases written down for set_output and the new config field, what it should do. Scenarios that we need to cover are pandas and polars inputs both.
this should, in particular, include a specification for what comes out of the proba methods if a polars output is produced
I would then add a section on how these outputs are getting mapped onto mtypes, and the datatypes module, a precise suggestion.

I am happy to write these, if you feel you are stuck - please let me know. Or, if they exist somewhere, links would be appreciated.

I'll also note that the new code that supports mi-columns and si-columns/indxes for polars exists inside adapters.polars and not inside the table conversion utilities.

I see - I would though not suggest to split here too much for now, instead we could just add the

I'll also note that although predict_var is only single column, there are still some failures in the CI, because one of the notebook-examples (the one with plot_crossplot_std)

This seems a bit suspicious, why would these changes impact the plotting function at all? I was under the impression that all existing code should run unchanged. Do you have an explanation for the failure?

fkiraly · 2024-09-14T17:43:00Z

I remember asking you before why in the base code the output conversion only existed inside the predict function and nothing else. Could you kindly remind me why?

That is because the different contracts of the underscore-methods, which are hopfully correctly stated in the extension template:

_predict gets input as X_inner_mtype and y_inner_mtype. It should return an output as y_inner_mtype type.
_predict_quantiles ets gets input as X_inner_mtype and y_inner_mtype. It needs to return the output in the specific pd.DataFrame based format, there is currently only a single choice for this.

Due to this, in the case of _predict, we need to convert the output too, if the returned y_pred should be of the same type as y seen in fit; but we do not need to do so for _predict_quantiles.

fkiraly · 2024-09-14T17:43:45Z

And given that reasoning, do you think it makes sense to have the other predict_* functions have this line of code also?

Yes, but weren't you already adding such a conversion in this PR? Which I think makes sense.

julian-fong · 2024-09-16T14:15:17Z

_predict_quantiles ets gets input as X_inner_mtype and y_inner_mtype. It needs to return the output in the specific pd.DataFrame based format, there is currently only a single choice for this.

Ah I see, so do you think in your opinion for predict_interval, predict_quantiles, and predict_var, that (for now) it should not trigger any convert function unless specified by the user via set_output? i.e

Scenario 1a: User passes in a pandas DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the output is returned as pd.DataFrame

Scenario 1b: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the output is returned as pd.DataFrame

Scenario 2a: User passes in a pandas DataFrame as X_inner_mtype and y_inner_mtype and utilizes set_output for polars, thus the output is returned as pl.DataFrame

Scenario 2b: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and utilizes set_output for pandas, thus the output is returned as pd.DataFrame

instead of automatically trying to convert it back to the input type of X_inner_mtype and y_inner_mtype i.e

Scenario 1: User passes in a pandas Series as y_inner_mtype and does not utilize set_output, thus the code will attempt to also output is returned as pd.Series due to the passed input y_inner_mtype

Scenario 2: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the code will attempt to also output is returned as pl.DataFrame due to the passed input y_inner_mtype

fkiraly · 2024-09-16T16:45:43Z

I would do the second, not the first set of scenarios, because it is consistent with input/output behaviour as current. The quoted line is just explaining the status quo.

I think once we have the datatype extension though, it will not be difficult to switch between the two.

julian-fong · 2024-09-16T17:57:02Z

I would do the second, not the first set of scenarios, because it is consistent with input/output behaviour as current. The quoted line is just explaining the status quo.

I think once we have the datatype extension though, it will not be difficult to switch between the two.

I see, then it would be a good idea to write out the various possible data container types for the proba scitype to facilitate this. Maybe polars and pandas dataframes would be enough for now?

julian-fong · 2024-09-17T03:25:18Z

Part of these key points can be found in the design doc

a clear specification and use cases written down for set_output and the new config field, what it should do. Scenarios that we need to cover are pandas and polars inputs both.

This is covered in section 5.2. The main use cases for set_output would be 'pandas' and 'polars', with a third option named 'default' which is to indicate default behaviour.

this should, in particular, include a specification for what comes out of the proba methods if a polars output is produced

Polars dataframe outputs are described in section 3 for proba methods predict_interval, predict_quantiles, and predict_var

I would then add a section on how these outputs are getting mapped onto mtypes, and the datatypes module, a precise suggestion.

I may need some help for how the outputs get mapped onto mtypes and scitypes, i've described a solution inside section 5.3. Some help regarding how to design the mapping method, how we can integrate data container support for the Proba scitype and various scenarios would be appreciated.

julian-fong and others added 30 commits May 26, 2024 15:54

create test_polars.py file

539fd51

updates

4712da9

initial commit

fe5333b

added polars eager table to allowed mtypes in base regressor

5f578fd

added draft version of testing fit and predict in polars dataframe

cf8a0d5

fixed to use skpro check soft dependencies

9357486

updated tests

1a23ee0

added test for predict_quantiles

89079f6

fixed naming of pandas datafarmes

02f699f

Merge branch 'sktime:main' into polars_support

c49ed0e

added test for check_polars_table

be084ef

updates to pr

5c3697e

updated estimator to be a pytest fixture for one estimator

32e700a

Merge branch 'sktime:main' into polars_support

0470817

bug fix

497e1ef

update

8d3b541

update

782e714

updates

39590f7

updates

20643c5

updates

05e96bf

updates

ad697a3

updates

00ac2bf

Merge branch 'sktime:main' into polars_support

78d5d46

Merge branch 'sktime:main' into polars_support

f464b7f

Merge branch 'sktime:main' into polars_support

5eba103

updates to remove unnecessary skipifs and changed the estimator used …

227d623

…for testing

Merge branch 'sktime:main' into polars_support

0b51616

write several functions

cf227f4

added _config and a new method set_output

895044e

added check_transform_config

ec36ded

julian-fong and others added 2 commits August 19, 2024 09:02

enabling tests

d986683

Merge branch 'sktime:main' into polars_regression

032b2f2

fkiraly reviewed Sep 8, 2024

View reviewed changes

Merge branch 'main' into pr/399

a9cca9e

fkiraly requested changes Sep 8, 2024

View reviewed changes

fkiraly added enhancement module:regression probabilistic regression module labels Sep 8, 2024

julian-fong and others added 7 commits September 9, 2024 23:09

Merge branch 'sktime:main' into polars_regression

7870580

changed from transform to transform_output

4b28aaa

bug fix

45ad110

updates

f00a76b

made utils private and decoupled _transform_output from _convert func…

59d747e

…tionality

added test support for predict_var

8a2e14d

removed support for predict_quantiles and predict_interval as they ar…

bfb1ef6

…e not table scitypes

fkiraly mentioned this pull request Sep 10, 2024

[ENH] mtype for multi-indexed pandas and polars dataframes #460

Open

commented out pred_var

572938a

updates

f985047

julian-fong closed this Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Enhancing polars support by introducing `set_output` #399

[ENH] Enhancing polars support by introducing `set_output` #399

julian-fong commented Jun 21, 2024 •

edited

Loading

fkiraly Sep 8, 2024

fkiraly Sep 8, 2024 •

edited

Loading

fkiraly Sep 8, 2024

fkiraly left a comment

fkiraly commented Sep 13, 2024

julian-fong commented Sep 13, 2024 •

edited

Loading

julian-fong commented Sep 13, 2024

fkiraly commented Sep 14, 2024

fkiraly commented Sep 14, 2024

fkiraly commented Sep 14, 2024

julian-fong commented Sep 16, 2024 •

edited

Loading

fkiraly commented Sep 16, 2024

julian-fong commented Sep 16, 2024

julian-fong commented Sep 17, 2024 •

edited

Loading



		def transform_output(
		obj, valid, from_type, default_to_type, default_scitype, output_config, store

[ENH] Enhancing polars support by introducing set_output #399

[ENH] Enhancing polars support by introducing set_output #399

Conversation

julian-fong commented Jun 21, 2024 • edited Loading

fkiraly Sep 8, 2024

Choose a reason for hiding this comment

fkiraly Sep 8, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly Sep 8, 2024

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Sep 13, 2024

julian-fong commented Sep 13, 2024 • edited Loading

julian-fong commented Sep 13, 2024

fkiraly commented Sep 14, 2024

fkiraly commented Sep 14, 2024

fkiraly commented Sep 14, 2024

julian-fong commented Sep 16, 2024 • edited Loading

fkiraly commented Sep 16, 2024

julian-fong commented Sep 16, 2024

julian-fong commented Sep 17, 2024 • edited Loading

[ENH] Enhancing polars support by introducing `set_output` #399

[ENH] Enhancing polars support by introducing `set_output` #399

julian-fong commented Jun 21, 2024 •

edited

Loading

fkiraly Sep 8, 2024 •

edited

Loading

julian-fong commented Sep 13, 2024 •

edited

Loading

julian-fong commented Sep 16, 2024 •

edited

Loading

julian-fong commented Sep 17, 2024 •

edited

Loading