ver. 2021-12-18
names[suggestion] | objects |
---|---|
ames | data(ames) |
ames_split | initial_split() |
ames_train | training() |
ames_test | testing() |
ames_cv | mc_cv() vfold_cv() bootstraps() validation_split() |
ames_rec | recipes objects finalize_recipe() |
ames_spec | parsnip models finalize_model() |
ames_wflow | workflows objects finalize_workflow() update() |
ames_fit | fit() fit_members() |
ames_cv_res | fit_resample() |
ames_grid | grid_*() crossing() |
ames_tune_res | tune_*() |
ames_last_res | last_fit() |
*_param | select_best() parameters() |
*_perf | yardstick result collect_metrics() rank_results() |
Don't create | control_*() prep() bake() metric_set() |
Please let me know if you have any suggestions for improvement.
Sometimes debate about object naming conventions.
https://en.wikipedia.org/wiki/Naming_convention_(programming)
You're more comfortable with free writing, right?
To be honest, I am always tired of following the rules.
But the naming convention still exists.
The reasons for this are to reduce the stress of writing code and to make it easier to understand when re-reading it later.
If the name of the object does not correspond to its content, you will have to re-read all the code to understand the relationship between the object and its content.
If you are not the only one who might read the code, it is confusing to give it any name you like.
The object name and content must be linked, like your email address and yourself.
Some naming conventions are given in the style guides published by google and tidyverse.
https://google.github.io/styleguide/Rguide.html
Some of the rules are
- Variable names should be concise
- Function names are verbs
- Variable names are nouns
- Variable names should not contain a dot symbol
and so on.
Also, different style guides may have different recommended rules.
- Use BigCamelCase for variable names [google].
- Use lowercase letters and underscores(_) in variable names [tidyverse].
These are just rules, there is no right answer.
Although not listed here, some old R users may have followed the rule of using the dot symbol(.) in variable names.
What is the meaning of the "." (dot) in R?
And it is mentioned in the style guide that "it is very difficult to follow the rule".
Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).
-- The tidyverse style guide [2.1 Object names]
tidymodels and tidyverse, is a package led by Rstudio.
The members working on it are all familiar with the tidyverse and R rules.
The following Tidy Modeling with R: a.k.a. TMwR also contains a bit about the rules and philosophy.
https://www.tmwr.org/tidyverse.html#principles
However, when you read the code in TMwR, you may find that the objects are named differently even though their contents are the same.
Therefore, I would like to organize the relationship between names and contents in TMwR and suggest naming conventions to help you when you feel like writing tidy code.
I mainly checked this by looking at TMWR, but I also referred to blogs of people who wrote tidymodels code.
The names were given in the following relationships.
names | objects | names | objects | |
---|---|---|---|---|
*_split *_spl |
initial_split() | *_train | training() | |
spl | smooth.spline() | *_test | testing() | |
resample | mc_cv() | *_mod *_model *_spec |
parsnip models | |
*_folds | vfold_cv() | *_rec *_recipe |
recipes objects | |
rs | bootstraps() | *_rec_trained | prep() | |
*_val_set | validation_split() | *_processed | bake() | |
*_metrics | metric_set() | *_wfl *_wflow |
workflows objects | |
ctrl keep_pred ctrl_AAA |
control_AAA() | *_param | parameters() | |
*_grid | grid_AAA | *_tune | tune_grid() | |
grid_AAA | tune_race_AAA | *_fit | fit() | |
*_res | collect_metrics() collect_predictions() fit_resample() last_fit() and mode... |
I've been reading some tidymodels code that some people have written, and I've been thinking
- Cross-validation has variations in the names.
- The role of res is unclear.
- The use of variables for detailed settings.
I think these are the reasons why named objects are so complicated.
So I'll try to simplify them by dividing them into certain groups.
In order to make the following table easier to understand, I will assume that you are using data ames.
ver. 2021-12-18
names[suggestion] | objects |
---|---|
ames | data(ames) |
ames_split | initial_split() |
ames_train | training() |
ames_test | testing() |
ames_cv | mc_cv() vfold_cv() bootstraps() validation_split() |
ames_rec | recipes objects finalize_recipe() |
ames_spec | parsnip models finalize_model() |
ames_wflow | workflows objects finalize_workflow() update() |
ames_fit | fit() fit_members() |
ames_cv_res | fit_resample() |
ames_grid | grid_*() crossing() |
ames_tune_res | tune_*() |
ames_last_res | last_fit() |
*_param | select_best() parameters() |
*_perf | yardstick result collect_metrics() rank_results() |
Don't create | control_*() prep() bake() metric_set() |
Here is an explanation of why the table above is the way it is.
I used the term "spec" instead of "model".
while modeling with tidymodels, the data created by prep and bake was used for checking the data rather than creating models,
so I did not set a naming convention.
If you are going to use them for model input, it is recommended to use _train or _test.
you should type control_() directly in the "control argument".
The sentence of code will be longer, but since there are multiple types of control_(), there is no need to create complicated variables.
Selecting more metrics will make your code longer, but I don't think it's a big problem, so please don't create variables and write them into the "argument".
Although bootstraps can be used in a variety of ways, for the purpose of machine learning, let's use it as cross-validation data.
Although neither bootstraps nor validation_split has cv in the name of the function, they are both "rset class" and have in common that they can be entered into fit_resample, etc. As a side note, initial_split is an "rsplit class".
I defined how to use res. tune and fit_resample returns are the result of fit() with cross-validation data.
The results of the last_fit() has "last_fit class" and can be used with collect_metrics().
"resample_results class" and "tune_results class" can use collect_metrics() too.
In addition, it can be used as input for stacks::add_candidates() and tidyposterior::perf_mod().
I have differentiated the results measured using metrics that have been considered _res from "results" by using the name "performance".
Please let me know if there are any omissions.
Since tidymodels may change the function, this is not a decision but needs to be modified.
2021-12-19: First version submitted.