Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lr ml #106

Merged
merged 31 commits into from
Apr 17, 2024
Merged

Lr ml #106

merged 31 commits into from
Apr 17, 2024

Conversation

AKuederle
Copy link
Contributor

No description provided.

from gaitlink.lrd import LrdUllrich


# TODO: this might be removed in the future, but I think it makes life easier, as people might not be familiar with tpcp.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like having that as a separate thing. I think it adds very little value and just removes flexibility.

But, I fully agree people need some guidance on how to do the optimization.

Can we do an example instead that shows step by step how to retrain/hyper para optimize the model?

That would also be great to have, in case we need to retrain any of the models in the future.

algo = LrdUllrich(**test_params)
data = pd.DataFrame([], columns=["acc_x", "acc_y", "acc_z", "gyr_x", "gyr_y", "gyr_z"])
ic_list = pd.DataFrame({"ic": []})
with pytest.raises(ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty input should not raise, but produce empty output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we limit the number of models to the one that we actually need for now? A big problem will be reproducibility and usability of the models with future versions of sklearn. We need to somehow still ensure that the models can be retrained in the future. Hence, best to limit what data is required for that. See #89 for details. I think for now we only need the ms_project_all and ms_project_ms model.

algo = LrdUllrich(**params)
data = pd.DataFrame(np.zeros((100, 6)), columns=["acc_x", "acc_y", "acc_z", "gyr_x", "gyr_y", "gyr_z"])
ic_list = pd.DataFrame({"ic": []})
with pytest.raises(ValueError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

"model": model, "scaler": scaler}


def __init__(self,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use proper defaults instead of doing the None check later.

self.scaler = MinMaxScaler()

# Fit the scaler if it hasn't been fitted yet, otherwise just transform the data
if not hasattr(self.scaler, 'scale_'):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be more explicit when using the sklearn is_fitted helper

ics = ics.copy()

# Apply Butterworth filtering and extract the first and second derivatives.
gyr = data[["gyr_x", "gyr_z"]].rename({"gyr_x": "v", "gyr_z": "ap"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this renaming? The explict names are not really used anywhere, right?

# Squash the multi index
signal_paras.columns = ["_".join(c) for c in signal_paras.columns]

# shift the last IC by 3 samples to make the second derivative work
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make that conditional? I.e. only if the last Ic is really to close to the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should document this as an edgecase in the docstring of the Algorithm

from gaitlink.lrd import LrdUllrich


class TestMetaLrdUllrich(TestAlgorithmMixin):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look great.

I would suggest we add a regression test for the two models that we will keep (see comment above), so that in case we retrain in the future, we see if the results change.

As an example for the regression test see here:

https://github.com/mobilise-d/gaitlink/blob/1a750978690a991ac40c367d17f624b5c63f9071/tests/test_icd/test_icd_ionescu.py#L54

I would suggest to use the reference ICs as input to LR algo for the test.


@classmethod
def _load_from_file(cls, model_name):
base_dir = Path(__file__).parent / os.getenv('MODEL_DIR', 'pretrained_models')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to load files that are part of a package this should be done using from importlib.resources import files. When a package is installed, it can be quite difficult to figure out what the correct file path is then. We use that approach in data_transform._filter.py. To make that work with joblib, use joblib.loads instead of load to load the binary blob instead of the file.

smoothing_filter: BaseFilter = cf(ButterworthFilter(order = 4, cutoff_freq_hz = (0.5, 2), filter_type="bandpass"))
) -> None:
super().__init__()
if model is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same way you set the ButterworthFilter as default you can also just set the other paras.

But in this case, as you want to have the defaults from the predefined parameters you can use set_defaults utility. Have a look here on how to use it: https://github.com/mobilise-d/gaitlink/blob/89b32f4330e68df81f5db81737bfecd11bfc66db/gaitlink/wba/_wb_assembly.py#L127

"""
if isinstance(model, ClassifierMixin):
return model
raise TypeError(f"Unknown model type {type(model)}. The model must be of type {ClassifierMixin}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curly braces around ClassiefierMixin might do strange things. You want the name of the class there right?

self.ic_list = ic_list
self.sampling_rate_hz = sampling_rate_hz

if data.empty:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified to:

if data.empty or ic_list.empty

mapping = {0: "left", 1: "right"}
prediction_per_gs = prediction_per_gs.replace(mapping)

self.ic_lr_list_ = pd.DataFrame({"ic": self.ic_list.values.flatten(), "lr_label": prediction_per_gs["lr_label"]})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.value is deprecated. Use to_numpy

all_features = pd.concat(features, axis=0, ignore_index=True) if len(features) > 1 else features[0]

try:
check_is_fitted(self.scaler)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHeck is fitted is actually called internally by transform. So the error would happen naturally.

However, thinking about this. What is the scenario where I have an unfitted scalar but a fitted model? Isn't an unfitted scalar a sign that something went wrong?

import pandas as pd


def extract_ref_data(datapoint):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simpliefied using the iter_gs utility and the reference_parameters_relative_to_wb_ property of the dataset.

    imu_data = datapoint.data["LowerBack"]
    ref = datapoint.reference_parameters_relative_to_wb_

    return zip(*[(data, ref.ic_list.loc[wb.id]["ic"], ref.ic_list.loc[wb.id]["lr_label"]) for wb, data in iter_gs(imu_data, ref.wb_list)])
    ```

(have not tested, just from the top of my head)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually delete all the files we don't need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the tests locally? A bunch of them are not passing :)

assert (output["lr_label"] == ["right", "left", "right"]).all()

# TODO: this needs checking
class testRegression:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes:

  • This will only be recognized as a valid test if the name is uppercase -> TestRegression
  • If you wan to you can also compress this into one test, by also paramterizing the params. So applying the @pytest.mark.parametrize("config_name, config", [("msp_all", LrdUllrich.PredefinedParameters.msproject_all), ("msp_ms", LrdUllrich.PredefinedParameters.msproject_ms) in addition to the existing paramterize will run the test for all combinations of datasets and configs. I would suggest including the conig name in the name of the snapshotfile (second argument to snapshot.assert_match), so that the file can be easily identified later.
  • To keep things consistent, use the GsIterator as done in the other examples

@AKuederle
Copy link
Contributor Author

TODO: Squash-Merge!

@AKuederle AKuederle merged commit 6eb8a8d into main Apr 17, 2024
9 checks passed
@AKuederle AKuederle deleted the lr_ml branch April 17, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants