Graphical representation for the users #51

lionelkusch · 2024-12-10T17:15:16Z

My summarise of the discussion with @AngelReyero and @jpaillard:
For basic users:

showing the importance of features or a group of features ( horizontal line with a point at the end). Only the most important due to the high dimension
showing the distribution of the importance ( use box plot and/or violin plot)
showing the data ( need to be adjusted with the type of data)
showing the correlation of features (correlation plot with an order based on a clustering for highlight blocks of correlations)

For more advanced users: (not yet finalised)

comparison of the different methods (a bar plot, or box plot)

lionelkusch · 2024-12-10T17:21:55Z

As reference for starting point:

lionelkusch · 2024-12-17T07:56:11Z

Add graphics for "simple" linear dependence between variables.

lionelkusch · 2024-12-27T13:05:42Z

https://github.com/blent-ai/ALEPython

bthirion · 2024-12-27T19:10:27Z

We need to investigate whether this library indeed solves PDP, in which case we should reuse it rather than reimplement.

lionelkusch · 2024-12-30T08:08:59Z

Do you want to include PDP as a model in the library?

lionelkusch · 2024-12-30T08:19:32Z

I find that scikit-learn has already implemented it but their graphical representation is quite limited.

bthirion · 2024-12-30T17:27:16Z

We should start with a nice graphical example showcasing it. Note that you should talk to the sklearn team (e.g. @glemaitre ), they want to rework the variable importance part.

lionelkusch · 2024-12-31T13:54:07Z

https://github.com/marcotcr/lime
I found a library of graphical representation for importance different types of data. However, it's mainly for local interpretation.
category (iris dataset): https://marcotcr.github.io/lime/tutorials/Tutorial%20-%20continuous%20and%20categorical%20features.html
tabular: https://marcotcr.github.io/lime/tutorials/Using%2Blime%2Bfor%2Bregression.html
text: https://marcotcr.github.io/lime/tutorials/Lime%20-%20multiclass.html
images: https://marcotcr.github.io/lime/tutorials/Tutorial%20-%20images.html

glemaitre · 2024-12-31T14:17:20Z

I find that scikit-learn has already implemented it but their graphical representation is quite limited.

Currently, for the partial dependence we have a graphical representation that show the average and option to show the individual line and also a bar plot representation for categorical variable.

@lionelkusch I'm wondering which graphical representation would you suggest to add. I think that we have more or less the one presented in Christopher Molnar book.

We should start with a nice graphical example showcasing it. Note that you should talk to the sklearn team (e.g. @glemaitre ), they want to rework the variable importance part.

When it comes to a broader scope regarding variable importance, we indeed want to add or rework some part of scikit-learn:

With the help of the CZI grant, we aim at improving displays by adding missing defacto visualization:
- Extending the displays and add a .from_cv_results method. This method should take an ensemble of models (obtained by cross-validation) and show somewhat the distributions of the desired information (e.g. multiple ROC curves).
- Having standard visualization that are now missing in scikit-learn. For instance, the variable importance could be one of them to show the feature_importances_ or coef_ visualization. If using the previous item on cross-validation, we can have distribution of those.
Also within the CZI scope, we intend to rework/extend the variable importance scope: there is no single methods that shine out there and our current implementation exposing a variable feature_importances_ is misleading. We would like to come with a more generic function/method such that given a model, you can compute the desired importance and being aware of the limitation. So this one is a bit more long term but I'm happy to discuss because it can inform our design (a draft of some potential avenues: SLEP021: Unified API for computing feature importance scikit-learn/enhancement_proposals#86)

lionelkusch · 2024-12-31T14:55:53Z

On our graphical representation, I didn't notice the data distribution on the x-axis, the first time that I saw it.
For the 2D, the distribution of the data is only on the x-axis, and it should be on the y-axis also.
For improving, you can change the colour of the distribution of the dataset or add a space with the x-axis because it can be confused with the bar for the ticks. For a 2D plot, if you can add a white rectangular under it, the distribution will be more readable.

Another possibility is, instead, to plot each point of the dataset; it's to plot the histogram. You can find some examples in the library shap:

I found another library, which proposes a more complex representation but I like it. [alePython][https://github.com/blent-ai/ALEPython]
It includes the quartile on the top and on the right of the data distribution. However, it can be too much information on the same graphic.

glemaitre · 2024-12-31T15:09:38Z

Thanks for the example, I see what you mean now.

We need to go back to the following PR but I think that it partially address some the first remarks: scikit-learn/scikit-learn#27388.

Maybe merging the quantile information on the marginal plot would be a way that:

we don't confuse it with the x- or y-axis
and we are explicit on what are those ticks

lionelkusch · 2024-12-31T16:26:27Z

@glemaitre
I have consulted the pull request, SLEP021. I have a few questions, probably because I'm new to the topic of variable importance and I'm not familiar with the development of scikit-learn.

As I understand it, feature_importances_ is only implemented for trees, based on Gini importance.
Why not rename feature_importances_ to gini_importances_?

If you want to generalize feature_importances_ to all models, from my point of view, it's not possible because some models don't have a human-understandable way of handling data, such as nonlinear models or high-dimensional predictor models. If we use the definition that feature importance provides a measure of the impact of a feature on the mean prediction of an estimator. In this case, extracting feature importance is non-trivial, as some features do not have a constant overall effect (especially with non-linear models) that may be difficult to summarize by a value. In addition, there is a trade-off between the importance of a variable in a sub-domain of the data and its overall importance, which is quite complex to induce.

glemaitre · 2024-12-31T16:36:07Z

As I understand it, feature_importances_ is only implemented for trees, based on Gini importance.
Why not rename feature_importances_ to gini_importances_?

It is criterion agnostic, meaning if you use entropy, it will be based on this criterion. Basically it is called mean decrease in impurity (MDI) and so a more generic name. If we go with a renaming, then we need to take the opportunity to be able to compute different importances and not provide a single one (furthermore on the training set only).

If you want to generalize feature_importances_ to all models

Actually, we want to go against generalizing. We think that there is not one suitable for all models. However, we want to provide a common programmatic entry point for those methods such that are user can choose. For instance, the API should be enough flexible to let our user choose between the MDI or permutation importance (or SHAP) and we should not choose a feature_importance_ for them if there is no "good" choice.

lionelkusch · 2024-12-31T16:43:01Z

We just started a discussion on the API for library in this issue #104. If you follow the conversation, it will probably give you some additional hints.

lionelkusch · 2025-01-08T16:22:07Z

It can be interesting to include graphic like Manhattan plot: https://en.wikipedia.org/wiki/Manhattan_plot

bthirion · 2025-01-08T16:23:44Z

This corresponds to marginal tests, but you're right this is part of the picture.

lionelkusch · 2025-01-10T17:58:50Z

This is a interesting tutorial with nice representation: https://github.com/monte-flora/scikit-explain/blob/master/tutorial_notebooks/accumulated_local_effect_tutorial.ipynb

bthirion · 2025-01-10T18:32:48Z

Nice indeed.

lionelkusch · 2025-01-13T13:08:13Z

This is nice paper on representation for representing the result: https://arxiv.org/pdf/1610.00290

lionelkusch · 2025-01-29T15:09:27Z

I found this library in R which is specialised in representations of variables of importance: https://alaninglis.github.io/vivid/articles/vividVignette.html
I like their generalised partial dependence pairs plot but I don't think that it's adapted to high-dimensional data.

lionelkusch · 2025-01-29T15:19:47Z

I found an example of PDP based on alluvial plots and with interactive examples: https://github.com/erblast/easyalluvial#partial-dependence-alluvial-plots
This can be an idea for representing results where the group of variables is present.

lionelkusch · 2025-01-29T15:27:39Z

An other way is to use Parallel Coordinate Plot: https://github.com/simonpradel/visualize-hyperparameter

bthirion · 2025-01-29T21:40:48Z

I think it does not scale really well when you increase the #variables...

lionelkusch added the examples Question link to the examples label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphical representation for the users #51

Graphical representation for the users #51

lionelkusch commented Dec 10, 2024

lionelkusch commented Dec 10, 2024

lionelkusch commented Dec 17, 2024

lionelkusch commented Dec 27, 2024

bthirion commented Dec 27, 2024

lionelkusch commented Dec 30, 2024

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

lionelkusch commented Jan 8, 2025

bthirion commented Jan 8, 2025

lionelkusch commented Jan 10, 2025

bthirion commented Jan 10, 2025

lionelkusch commented Jan 13, 2025

lionelkusch commented Jan 29, 2025

lionelkusch commented Jan 29, 2025

lionelkusch commented Jan 29, 2025

bthirion commented Jan 29, 2025

Graphical representation for the users #51

Graphical representation for the users #51

Comments

lionelkusch commented Dec 10, 2024

lionelkusch commented Dec 10, 2024

lionelkusch commented Dec 17, 2024

lionelkusch commented Dec 27, 2024

bthirion commented Dec 27, 2024

lionelkusch commented Dec 30, 2024

lionelkusch commented Dec 30, 2024

bthirion commented Dec 30, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

glemaitre commented Dec 31, 2024

lionelkusch commented Dec 31, 2024

lionelkusch commented Jan 8, 2025

bthirion commented Jan 8, 2025

lionelkusch commented Jan 10, 2025

bthirion commented Jan 10, 2025

lionelkusch commented Jan 13, 2025

lionelkusch commented Jan 29, 2025

lionelkusch commented Jan 29, 2025

lionelkusch commented Jan 29, 2025

bthirion commented Jan 29, 2025