Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: kfold_cv() Wrapper Function for Terminology Consistency and Usability #554

Open
msberends opened this issue Nov 5, 2024 · 0 comments

Comments

@msberends
Copy link

Dear Tidymodels Development Team,

First, thank you for the excellent work on the rsample package and the entire tidymodels ecosystem. Your contributions have significantly improved accessibility and usability for data science and machine learning in R, and your consistent, high-quality work is truly appreciated.

Feature Request: kfold_cv() Wrapper Function

I would like to suggest adding a kfold_cv() function as a wrapper for vfold_cv(). This request is aimed at enhancing both terminology consistency and user accessibility, as "k-fold cross-validation" is overwhelmingly the more common term in the literature and among practitioners.

It could simply be implemented as:

#' @rdname vfold_cv
#' @param v,k The number of partitions of the data set
#' @export
kfold_cv(data, k = 10, repeats = 1, strata = NULL, breaks = 4, pool = 0.1, ...) {
  vfold_cv(data = data, v = k, repeats = repeats, strata = strata, breaks = breaks, pool = pool, ...)
}

Rationale

  1. Standard Terminology Usage: The term "k-fold cross-validation" is widely recognized as the standard across numerous publications and textbooks on machine learning and statistics. Here are some authoritative sources where the term "k-fold cross-validation" is used consistently:

    • Hastie, Tibshirani, and Friedman (2009) in The Elements of Statistical Learning specifically refer to "k-fold cross-validation" (p. 222, Springer) as a foundational resampling method.
    • James, Witten, Hastie, and Tibshirani (2013) in An Introduction to Statistical Learning also use "k-fold cross-validation" (p. 176, Springer), reflecting the term’s adoption in foundational texts.
    • Goodfellow, Bengio, and Courville (2016) in Deep Learning further emphasize "k-fold cross-validation" as a core method in machine learning (MIT Press, p. 120).

    The popularity of the term "k-fold" over "v-fold" is also evident in applied contexts, including online courses, tutorials, and the documentation of other machine learning libraries such as scikit-learn in Python, where KFold is explicitly named.

  2. User Familiarity and Accessibility: Most users, especially those new to tidymodels, might be more familiar with the term "k-fold" and may not immediately recognize "v-fold" as equivalent. This can lead to confusion, especially for those coming from other languages or tools where "k-fold" is standard. A kfold_cv() function could help bridge this gap, enhancing the user experience without altering any functionality.

  3. Maintaining Code Consistency and Readability: Many users adapt code from textbooks or other languages, where k is typically the identifier for the number of folds. Allowing both vfold_cv() and kfold_cv() would support more readable code for these users and may help streamline transitions for those migrating workflows into R and tidymodels.

Thank you for considering this suggestion, and for your commitment to improving the tools available to the R and data science communities. Your work has set a high standard, and small enhancements like this can make an even greater impact on usability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant