Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example for preprocessing.dictmapper.DictMapper and meta.outlier_classifier.OutlierClassifier #646

Merged
merged 7 commits into from
Mar 29, 2024
27 changes: 27 additions & 0 deletions sklego/meta/outlier_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,33 @@ def fit(self, X, y=None):
ValueError
- If the underlying model is not an outlier detection model.
- If the underlying model does not have a `decision_function` method.

Example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anopsy, I have the same feedback as for DictMapper: if you could move the example up in the docstring I think it would be easier and faster for folks to find when scrolling through the api documentation without the need to step down into the .fit(..) method.

I think this example is ready to merge after that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that!

-------
```py
from sklearn.ensemble import IsolationForest
from sklego.meta.outlier_classifier import OutlierClassifier

X = [[0], [0.5], [-1], [99]]
y = [[0], [0], [0], [1]]

isolation_forest = IsolationForest()
isolation_forest.fit(X)
detector_preds = isolation_forest.predict(X)
# array[ 1 1 1 -1]

outlier_clf = OutlierClassifier(isolation_forest)
_ = outlier_clf.fit(X, y)

preds = outlier_clf.predict([[100], [-0.5], [0.5], [1]])
# array[1. 0. 0. 0.]

proba_preds = outlier_clf.predict_proba([[100], [-0.5], [0.5], [1]])
# [[0.34946567 0.65053433]
# [0.79707913 0.20292087]
# [0.80275406 0.19724594]
# [0.80275406 0.19724594]]
```
"""
X, y = check_X_y(X, y, estimator=self)
if not self._is_outlier_model():
Expand Down
76 changes: 76 additions & 0 deletions sklego/preprocessing/dictmapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,41 @@ def fit(self, X, y=None):
-------
self : DictMapper
The fitted transformer.

Example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you manage to add how to make it interact with either sklego.preprocessing.ColumnSelector or sklearn.composeColumnTransformer I believe it would be of great help and ready to merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this double dict was making me really uncomfortable 😅 I'm on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the fixes, but I'm unable to push them. I need to figure out what's going on and be back 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was also confused. However, I finally figured out what was happening and was able to push this time. The problem was that one of the files, sklego.model_selection.py, which I wasn’t even working on, failed to pass the ruff-format and was reformatted by ruff. Since I hadn’t worked on it, I decided to revert that change (git restore) and attempted to commit only the two files I had been working on. But after pushing, I received the message “Everything-up-to-date” and didn’t see any changes on my branch. Today, I decided to accept the formatting changes made by ruff, and I was finally able to push. I hope the formatter didn’t break anything in model_selection.py. What’s the proper way to handle this kind of problem in the future?

-------
```py
import pandas as pd
from sklego.preprocessing.dictmapper import DictMapper

X = pd.DataFrame({
"city": ["Amsterdam", "Leiden", "Utrecht", "Amsterdam", "Haarlem"],
"university": ["uva", "lei", "uu", "vu", "none"]
})

mapper = {

#population

"Amsterdam": 1_181_817,
"Leiden": 130_181,
"Utrecht": 367_984,
"Haarlem": 165_396,

#ranking

"uva": 64,
"lei": 214,
"uu": 117,
"vu": 105
}

dict_mapper = DictMapper(mapper, 0)
_ = dict_mapper.fit(X)

dict_mapper.n_features_in_
# 2
```
"""
X = check_array(
X,
Expand Down Expand Up @@ -72,6 +107,47 @@ def transform(self, X):
------
ValueError
If the number of columns from `X` differs from the number of columns when fitting.

Example
-------
```py
import pandas as pd
from sklego.preprocessing.dictmapper import DictMapper

X = pd.DataFrame({
"city": ["Amsterdam", "Leiden", "Utrecht", "Amsterdam", "Haarlem"],
"university": ["uva", "lei", "uu", "vu", "none"]
})

mapper = {

#population

"Amsterdam": 1_181_817,
"Leiden": 130_181,
"Utrecht": 367_984,
"Haarlem": 165_396,

#ranking

"uva": 64,
"lei": 214,
"uu": 117,
"vu": 105
}

dict_mapper = DictMapper(mapper, 0)
_ = dict_mapper.fit(X)

X_trans = dict_mapper.transform(X)
X_trans
# array([[1181817, 64],
# [ 130181, 214],
# [ 367984, 117],
# [1181817, 105],
# [ 165396, 0]])

```
"""
check_is_fitted(self, ["n_features_in_"])
X = check_array(
Expand Down
2 changes: 1 addition & 1 deletion sklego/preprocessing/outlier_remover.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ class OutlierRemover(TrainOnlyTransformerMixin, BaseEstimator):

isolation_forest = IsolationForest()
isolation_forest.fit(X)
detector_preds = isolator_forest.predict(X)
detector_preds = isolation_forest.predict(X)

outlier_remover = OutlierRemover(isolation_forest, refit=True)
outlier_remover.fit(X)
Expand Down