[ENH] Polars adapter enhancements #449

julian-fong · 2024-08-14T15:06:27Z

adds index support as part of #440 and is used to sync up polars conversion utilities between skpro and sktime.

Correponding sktime pr for polars conversion utilities is sktime/sktime#6455.

In this pr:

If a pandas Dataframe is a from_type and polars frame is a to_type then during the conversion, we will save the index (assumed never to be in multi-index format) and insert it as an individual column with column name __index__. Then the resulting pandas dataframe will be converted to a polars dataframe.

In the inverse function, if we are converting from polars dataframe to pandas dataframe, if the column __index__ exists in the pandas dataframe post-conversion, then we will map that column to the index before returning the pandas Dataframe

After this is merged, #447 will be implemented as a polars only estimator. tests will also be written to check polars input end to end and pandas input and output through the polars estimator (i.e pandas input into polars estimator -> polars predictions -> pandas output)

pranavvp16

Left some questions about how is index name being handled in the conversion utils. The code will get the job done currently, but suppose I convert a pandas DataFrame having a index with a name to polars, and then convert it back to pandas the index name would be lost.

pranavvp16 · 2024-08-14T20:23:45Z

skpro/datatypes/_adapter/polars.py

+    from polars import from_pandas
+
+    obj.reset_index()
+    obj.rename(columns={"index": "__index__"})


How about we retain the index name too. This code won't work for data frames that already have a name for their index. Does skpro assumes that data frame will always have a default RangeIndex without any name @fkiraly @julian-fong ??, there may be scenarios where a index having name is passed and the conversion util will fail in this case.

How sktime handles this is by appending the name infront of the __index__ convention. So suppose if the DataFrame has a index named 1 the util will convert it to __index__1 in polars frame.

thats a good point, i'll make that change then

Yes, I thought that's what is happening in the dask converter as well, is it?

julian-fong · 2024-08-15T01:30:20Z

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e

┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘

is returned when it expects

┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘

@fkiraly any preferred method on how to solve this?

pranavvp16 · 2024-08-15T05:54:09Z

@julian-fong I know why the tests are failing here, you need to change the polars fixture example to match this convention, in skpro the examples are located at skpra/datatypes/_table/_examples.py. What you can do here is import the conversion util and convert the pandas DataFrame that is being used

…examples to use conversion methods

fkiraly · 2024-08-16T10:48:40Z

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e

┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘

is returned when it expects

┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘

@fkiraly any preferred method on how to solve this?

My answer would be to not include an __index__ if it is a RangeIndex, that is:

If you convert from pd_DataFrame_Table to the polars type:

check if index is RangeIndex
if yes, include no __index__ column; otherwise, include index as __index__[indexname] column.

fkiraly

Looks good - @pranavvp16, can you kindly review this, too?

julian-fong · 2024-08-17T11:44:48Z

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e
┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘
is returned when it expects
┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘
@fkiraly any preferred method on how to solve this?
My answer would be to not include an __index__ if it is a RangeIndex, that is:

If you convert from pd_DataFrame_Table to the polars type:

check if index is RangeIndex

if yes, include no __index__ column; otherwise, include index as __index__[indexname] column.

Forgot to include async discussion but it is summarized as follows.

In the code: we will only ignore the addition of __index__ if the index of the pandas DataFrame is the trivial index i.e RangeIndex(0,n_rows)

Otherwise, we will build a new column called __index__* using the reset_index() function built in pandas.

pranavvp16 · 2024-08-17T12:26:28Z

Looks good - @pranavvp16, can you kindly review this, too?

LGTM!

julian-fong · 2024-08-17T13:42:05Z

Thank you for the reviews!

julian-fong added 5 commits August 13, 2024 16:00

initial commit

2cbb933

intial commit

c94b612

updated _convert

17c3126

updated to from_pandas

fc46414

removed duplicative code

ae9ee7b

pranavvp16 reviewed Aug 14, 2024

View reviewed changes

julian-fong added 4 commits August 14, 2024 17:05

fixed naming convention for indices to use __index__{col_name}

98ea699

fixed name to only include original index name in returned dataframe

446e180

refactored current polars tests and fixed code

c220f37

refactored lazy frames to use .collect_schema().names() to fix warning

e8cacd6

fkiraly assigned julian-fong Aug 15, 2024

julian-fong added 4 commits August 15, 2024 07:05

added conversion util for polars examples and removed commented code

29d3f40

refactored check_polars_frame to ignore __index__ columns and edited …

16e8f51

…examples to use conversion methods

bug fix

8f97233

updated n_features calculation

3bf810c

julian-fong added 2 commits August 16, 2024 10:33

added code to not include __index__ if df.index is trivial

2d4d2d1

removed line

6498824

fkiraly approved these changes Aug 17, 2024

View reviewed changes

This was referenced Aug 18, 2024

[ENH] Enhancing polars support by introducing set_output #399

Closed

[ENH] Add polars version of dummy proba regressor #447

Closed

fkiraly merged commit e360e73 into sktime:main Aug 18, 2024
30 checks passed

fkiraly added enhancement module:datatypes datatypes module: data containers, checkers & converters labels Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Polars adapter enhancements #449

[ENH] Polars adapter enhancements #449

julian-fong commented Aug 14, 2024 •

edited

Loading

pranavvp16 left a comment

pranavvp16 Aug 14, 2024

julian-fong Aug 14, 2024

fkiraly Aug 16, 2024

julian-fong commented Aug 15, 2024 •

edited

Loading

pranavvp16 commented Aug 15, 2024

fkiraly commented Aug 16, 2024

fkiraly left a comment •

edited

Loading

julian-fong commented Aug 17, 2024 •

edited

Loading

pranavvp16 commented Aug 17, 2024

julian-fong commented Aug 17, 2024

[ENH] Polars adapter enhancements #449

[ENH] Polars adapter enhancements #449

Conversation

julian-fong commented Aug 14, 2024 • edited Loading

pranavvp16 left a comment

Choose a reason for hiding this comment

pranavvp16 Aug 14, 2024

Choose a reason for hiding this comment

julian-fong Aug 14, 2024

Choose a reason for hiding this comment

fkiraly Aug 16, 2024

Choose a reason for hiding this comment

julian-fong commented Aug 15, 2024 • edited Loading

pranavvp16 commented Aug 15, 2024

fkiraly commented Aug 16, 2024

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

julian-fong commented Aug 17, 2024 • edited Loading

pranavvp16 commented Aug 17, 2024

julian-fong commented Aug 17, 2024

julian-fong commented Aug 14, 2024 •

edited

Loading

julian-fong commented Aug 15, 2024 •

edited

Loading

fkiraly left a comment •

edited

Loading

julian-fong commented Aug 17, 2024 •

edited

Loading