Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle pre-trained ML models for LR #89

Open
2 of 4 tasks
AKuederle opened this issue Feb 9, 2024 · 1 comment
Open
2 of 4 tasks

Properly handle pre-trained ML models for LR #89

AKuederle opened this issue Feb 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@AKuederle
Copy link
Contributor

AKuederle commented Feb 9, 2024

Currently we provide pre-trained Sklearn models (#52) for LR detection. They are stored as pickle files. This is an issue, as future versions of sklearn might not be able to load the files.

Ideally we should create a script that can retrain these models from scratch. However, the data the models are originally based on is not public.

  • Identify which models we actually need for the final pipelines -> Only MS_project_all_all
  • Identify the Data -> MSProject data available (internally) on Escience -> Algorothm Test Study -> data -> USFD
  • Create dataset class for MSProject dataset
  • Create CV and retraining script

Original model was trained with following parameters (optimized in a GridsearchCV (n_folds=5, optimized for accuracy), Model was a SVC with MinMaxScaler

param_grid_svm_lin = [
    {'C': [0.01, 0.1, 1, 10, 100, 1000, 10000], 'kernel': ['linear']}]

For the dataset:

  • There is a clinical info file on EScience which could be the basis for the index.
  • Further information from the original code
if "msproject" in dataset["name"]:
        hc_list1 = ["C"+str(i) for i in range(1,9)] 
        hc_list2 = ["MS019","MS027","MS036","MS049","MS050","MS051","MS052","MS053","MS056","MS057","MS058","MS059","MS060","MS063","MS064","MS065"]
        if "all" in dataset["name"]:
            patient_list = patient_list
        elif "hc" in dataset["name"]:
            # not only folders with C belong to controls (referring to clinical info from e-science)
            patient_list = np.concatenate([hc_list1, hc_list2])
        else:
            hc_list = np.concatenate([hc_list1, hc_list2])
            patient_list = np.setdiff1d(patient_list,hc_list)
# exclude certain subjects due to data errors

if "MS021" in train_ids.values:
    train_ids = train_ids.loc[train_ids != "MS021"]
if "MS025" in train_ids.values:
    train_ids = train_ids.loc[train_ids != "MS025"]

Further notes:

Eventhought that problem could be completely solved using sklearn, it might be helpful to use the respective tpcp tooling to be able to use the pipelines we have and also create the basis for L/R validation for any algorithm

@AKuederle AKuederle mentioned this issue Mar 13, 2024
@AKuederle AKuederle self-assigned this May 6, 2024
@AKuederle
Copy link
Contributor Author

Notes:

  • We should keep the oldmodels around (unless newly trained are really identical -> unlikely)
  • The full retraining code should be kept around as a no_exc example -> Good documentation.

@AKuederle AKuederle removed their assignment May 6, 2024
@AKuederle AKuederle added the enhancement New feature or request label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant