Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Polars adapter enhancements #449

Merged
merged 15 commits into from
Aug 18, 2024

Conversation

julian-fong
Copy link
Contributor

@julian-fong julian-fong commented Aug 14, 2024

adds index support as part of #440 and is used to sync up polars conversion utilities between skpro and sktime.

Correponding sktime pr for polars conversion utilities is sktime/sktime#6455.

In this pr:

If a pandas Dataframe is a from_type and polars frame is a to_type then during the conversion, we will save the index (assumed never to be in multi-index format) and insert it as an individual column with column name __index__. Then the resulting pandas dataframe will be converted to a polars dataframe.

In the inverse function, if we are converting from polars dataframe to pandas dataframe, if the column __index__ exists in the pandas dataframe post-conversion, then we will map that column to the index before returning the pandas Dataframe

After this is merged, #447 will be implemented as a polars only estimator. tests will also be written to check polars input end to end and pandas input and output through the polars estimator (i.e pandas input into polars estimator -> polars predictions -> pandas output)

Copy link

@pranavvp16 pranavvp16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions about how is index name being handled in the conversion utils. The code will get the job done currently, but suppose I convert a pandas DataFrame having a index with a name to polars, and then convert it back to pandas the index name would be lost.

from polars import from_pandas

obj.reset_index()
obj.rename(columns={"index": "__index__"})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we retain the index name too. This code won't work for data frames that already have a name for their index. Does skpro assumes that data frame will always have a default RangeIndex without any name @fkiraly @julian-fong ??, there may be scenarios where a index having name is passed and the conversion util will fail in this case.

How sktime handles this is by appending the name infront of the __index__ convention. So suppose if the DataFrame has a index named 1 the util will convert it to __index__1 in polars frame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a good point, i'll make that change then

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought that's what is happening in the dask converter as well, is it?

@julian-fong
Copy link
Contributor Author

julian-fong commented Aug 15, 2024

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e

┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘

is returned when it expects

┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘

@fkiraly any preferred method on how to solve this?

@pranavvp16
Copy link

@julian-fong I know why the tests are failing here, you need to change the polars fixture example to match this convention, in skpro the examples are located at skpra/datatypes/_table/_examples.py. What you can do here is import the conversion util and convert the pandas DataFrame that is being used

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 16, 2024

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e

┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘

is returned when it expects

┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘

@fkiraly any preferred method on how to solve this?

My answer would be to not include an __index__ if it is a RangeIndex, that is:

If you convert from pd_DataFrame_Table to the polars type:

  1. check if index is RangeIndex
  2. if yes, include no __index__ column; otherwise, include index as __index__[indexname] column.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - @pranavvp16, can you kindly review this, too?

@julian-fong
Copy link
Contributor Author

julian-fong commented Aug 17, 2024

Tests are failing in jobs because the __index__ is being returned and there is no easy way to get rid of it. i.e

┌───────────┬──────┐
│ __index__ ┆ a    │
│ ---       ┆ ---  │
│ i64       ┆ f64  │
╞═══════════╪══════╡
│ 0         ┆ 1.0  │
│ 1         ┆ 4.0  │
│ 2         ┆ 0.5  │
│ 3         ┆ -3.0 │
└───────────┴──────┘

is returned when it expects

┌──────┐
│ a    │
│ ---  │
│ f64  │
╞══════╡
│ 1.0  │
│ 4.0  │
│ 0.5  │
│ -3.0 │
└──────┘

@fkiraly any preferred method on how to solve this?

My answer would be to not include an __index__ if it is a RangeIndex, that is:

If you convert from pd_DataFrame_Table to the polars type:

  1. check if index is RangeIndex
  2. if yes, include no __index__ column; otherwise, include index as __index__[indexname] column.

Forgot to include async discussion but it is summarized as follows.

In the code: we will only ignore the addition of __index__ if the index of the pandas DataFrame is the trivial index i.e RangeIndex(0,n_rows)

Otherwise, we will build a new column called __index__* using the reset_index() function built in pandas.

@pranavvp16
Copy link

Looks good - @pranavvp16, can you kindly review this, too?

LGTM!

@julian-fong
Copy link
Contributor Author

Thank you for the reviews!

@fkiraly fkiraly merged commit e360e73 into sktime:main Aug 18, 2024
30 checks passed
@fkiraly fkiraly added enhancement module:datatypes datatypes module: data containers, checkers & converters labels Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement module:datatypes datatypes module: data containers, checkers & converters
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants