You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we provide pre-trained Sklearn models (#52) for LR detection. They are stored as pickle files. This is an issue, as future versions of sklearn might not be able to load the files.
Ideally we should create a script that can retrain these models from scratch. However, the data the models are originally based on is not public.
Identify which models we actually need for the final pipelines -> Only MS_project_all_all
Identify the Data -> MSProject data available (internally) on Escience -> Algorothm Test Study -> data -> USFD
Create dataset class for MSProject dataset
Create CV and retraining script
Original model was trained with following parameters (optimized in a GridsearchCV (n_folds=5, optimized for accuracy), Model was a SVC with MinMaxScaler
There is a clinical info file on EScience which could be the basis for the index.
Further information from the original code
if"msproject"indataset["name"]:
hc_list1= ["C"+str(i) foriinrange(1,9)]
hc_list2= ["MS019","MS027","MS036","MS049","MS050","MS051","MS052","MS053","MS056","MS057","MS058","MS059","MS060","MS063","MS064","MS065"]
if"all"indataset["name"]:
patient_list=patient_listelif"hc"indataset["name"]:
# not only folders with C belong to controls (referring to clinical info from e-science)patient_list=np.concatenate([hc_list1, hc_list2])
else:
hc_list=np.concatenate([hc_list1, hc_list2])
patient_list=np.setdiff1d(patient_list,hc_list)
# exclude certain subjects due to data errorsif"MS021"intrain_ids.values:
train_ids=train_ids.loc[train_ids!="MS021"]
if"MS025"intrain_ids.values:
train_ids=train_ids.loc[train_ids!="MS025"]
Further notes:
Eventhought that problem could be completely solved using sklearn, it might be helpful to use the respective tpcp tooling to be able to use the pipelines we have and also create the basis for L/R validation for any algorithm
The text was updated successfully, but these errors were encountered:
Currently we provide pre-trained Sklearn models (#52) for LR detection. They are stored as pickle files. This is an issue, as future versions of sklearn might not be able to load the files.
Ideally we should create a script that can retrain these models from scratch. However, the data the models are originally based on is not public.
Original model was trained with following parameters (optimized in a GridsearchCV (n_folds=5, optimized for accuracy), Model was a
SVC
withMinMaxScaler
For the dataset:
Further notes:
Eventhought that problem could be completely solved using sklearn, it might be helpful to use the respective tpcp tooling to be able to use the pipelines we have and also create the basis for L/R validation for any algorithm
The text was updated successfully, but these errors were encountered: