README.Rmd

---
output: github_document
always_allow_html: true
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  message = FALSE, 
  warning = FALSE
)

library(tidyverse)
library(tidyfit)
```

# tidyfit <img src="man/figures/logo.png" align="right" alt="" width="120" />

<!-- badges: start -->
<!-- badges: end -->

`tidyfit` is an `R`-package that facilitates and automates linear regression and classification modeling in a tidy environment. The package includes several methods, such as Lasso, PLS and ElasticNet regressions. `tidyfit` builds on the `tidymodels` suite, but emphasizes automated modeling with a focus on grouped data, model comparisons, and high-volume analytics with standardized input/output interfaces. The objective is to make model fitting, cross validation and model output very simple and standardized across all methods, with any necessary method-specific transformations handled in the background.

## Installation

You can install the development version of tidyfit from [GitHub](https://github.com/) with:

```{r, eval=F}
# install.packages("devtools")
devtools::install_github("jpfitzinger/tidyfit")
library(tidyfit)
```

## Overview and usage

`tidyfit` includes 3 deceptively simple functions:

 - `regress()`
 - `classify()`
 - `m()`

All 3 of these functions return a `tidyfit.models` frame, which is a tibble containing information about fitted regression and classification models. `regress` and `classify` perform regression and classification on tidy data. The functions ingest a tibble, prepare input data for the models by splitting groups, partitioning cross validation slices and handling any necessary adjustments and transformations. The data is ultimately passed to the model wrapper `m()` which fits the models. The basic usage is as follows:

```{r, eval=F}
regress(
  .data, 
  formula = y ~ x1 + x2, 
  mod1 = m(<args for underlying method>), mod2 = m(), ...,    # Pass multiple model wrappers
  .cv = "vfold", .cv_args = list(), .weights = "weight_col",  # Additional settings
  <further arguments>
)
```

The syntax is identical for `classify`.

`m` is a powerful wrapper for many different regression and classification techniques that can be used with `regress` and `classify`, or as a stand-alone function:

```{r, eval=F}
m(
  <method>,           # e.g. "lm" or "lasso"
  formula, data,      # not passed when used within regress or classify
  ...                 # Args passed to underlying method, e.g. to stats::lm for OLS regression
)
```

An important feature of `m()` is that all arguments can be passed as vectors, allowing generalized hyperparameter tuning or scenario analysis for any method:

 - Passing a hyperparameter grid: `m("lasso", lambda = seq(0, 1, by = 0.1))`
 - Different algorithms for robust regression: `m("robust", method = c("M", "MM"))`
 - Forward vs. backward selection: `m("subset", method = c("forward", "backward"))`
 - Logit and Probit models: `m("glm", family = list(binomial(link="logit"), binomial(link="probit")))`
 
Arguments that are meant to be vectors (e.g. weights) are recognized by the function and not interpreted as grids.

## Methods implemented in `tidyfit`

`tidyfit` currently implements the following methods:

```{r, echo = F}
library(kableExtra)
tbl <- data.frame(
  Method = c("OLS", "Generalized least squares", "Robust regression (e.g. Huber loss)", 
             "Quantile regression", "LASSO", "Ridge", "Adaptive LASSO", "ElasticNet",
             "Gradient boosting regression", "Principal components regression", "Partial least squares",
             "Hierarchical feature regression", "Best subset selection", "Bayesian regression", "Bayesian time-varying parameters regression", "Generalized mixed-effects regression", "Markov-switching regression"),
  Name = c("lm", "glm", "robust", "quantile", "lasso", "ridge", "adalasso", "enet", "boost", "pcr", "plsr", 
           "hfr", "subset", "bayes", "tvp", "glmm", "mslm"),
  Package = c("`stats::lm`", "`stats::glm`", "`MASS::rlm`", "`quantreg::rq`", "`glmnet::glmnet`", "`glmnet::glmnet`", "`glmnet::glmnet`", "`glmnet::glmnet`",
              "`mboost::glmboost`", "`pls::plsr`", "`pls::pcr`", "`hfr::hfr`", "`bestglm::bestglm`", "`arm::bayesglm`", "`shrinkTVP::shrinkTVP`", "`lme4::glmer`", "`MSwM::msmFit`"),
  Regression = c("yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes"),
  Classification = c("no", "yes", "no", "no", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "yes", "yes", "no", "yes", "no"),
  group = c("Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Regression and classification with L1 and L2 penalties", "Regression and classification with L1 and L2 penalties", "Regression and classification with L1 and L2 penalties", "Regression and classification with L1 and L2 penalties", "Gradient boosting", "Factor regressions", "Factor regressions", "Factor regressions", "Best subset selection", "Bayesian regression", "Bayesian regression", "Mixed-effects modeling", "Specialized time series methods")
)

kbl(tbl[,1:5], align = "lcccc") %>% 
  pack_rows(index = table(fct_inorder(tbl$group))) %>% 
  kable_styling("striped")

```

See `?m` for additional information.

It is important to note that the above list is not complete, since some of the methods encompass multiple algorithms. For instance, "subset" can be used to perform forward, backward or exhaustive search selection using `leaps`. Similarly, "lasso" includes certain grouped LASSO implementations that can be fitted with `glmnet`.

## A minimal workflow

In this section, a minimal workflow is used to demonstrate how the package works. For more detailed guides of specialized topics, or simply for further reading, follow these links:

  - [Accessing fitted models](https://tidyfit.unchartedml.com/articles/Accessing_Fitted_Model_Objects.html)
  - [Regularized regression](https://tidyfit.unchartedml.com/articles/Predicting_Boston_House_Prices.html) (Boston house price data)
  - [Multinomial classification](https://tidyfit.unchartedml.com/articles/Multinomial_Classification.html) (iris data)
  - [Rolling window regression for time series](https://tidyfit.unchartedml.com/articles/Rolling_Window_Time_Series_Regression.html) (factor data)
  - [Time-varying parameters](https://tidyfit.unchartedml.com/articles/Time-varying_parameters_vs_rolling_windows.html) (factor data)
  - [Bootstrap confidence intervals](https://tidyfit.unchartedml.com/articles/Bootstrapping_Confidence_Intervals.html)
  - [coming soon] Fixed and Random effects
  - [coming soon] Quantile regression
  - [coming soon] Custom models

`tidyfit` includes a data set of financial Fama-French factor returns freely available [here](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). The data set includes monthly industry returns for 10 industries, as well as monthly factor returns for 5 factors:

```{r example}
data <- tidyfit::Factor_Industry_Returns
data
```

We will only use a subset of the data to keep things simple:

```{r}
df_train <- data %>% 
  filter(Date > 201800 & Date < 202000)
df_test <- data %>% 
  filter(Date >= 202000)
```

Before beginning with the estimation, we activate the progress bar visualization. This allows us to gauge estimation progress along the way. `tidyfit` uses the `progressr`-package internally to generate a progress bar:

```{r, eval=F}
progressr::handlers(global=TRUE)
```

For purposes of this demonstration, the objective will be to fit an ElasticNet regression for each industry group, and compare results to a robust least squares regression. This can be done with `regress` after grouping the data. For grouped data, the functions `regress` and `classify` estimate models for each group independently:

```{r}
model_frame <- df_train %>% 
  group_by(Industry) %>% 
  regress(Return ~ ., 
          m("enet"), 
          m("robust", method = "MM", psi = MASS::psi.huber), 
          .cv = "initial_time_split", .mask = "Date")
```

Note that the penalty and mixing parameters are chosen automatically using a time series train/test split (`rsample::initial_time_split`), with a hyperparameter grid set by the package `dials`. See `?regress` for additional CV methods. A custom grid can easily be passed using `lambda = c(...)` and/or `alpha = c(...)` in the `m("enet")` wrapper. 

The resulting `tidyfit.models` frame consists of 3 components:

  1. A group of identifying columns. This includes the `Industry` column, the model name, and grid ID (the ID for the model settings used).
  2. A `handler` column. This column contains a partialised function (created with `purrr::partial`) that contains the model object as well as any additional information needed to perform predictions or access coefficients (e.g. the sample mean and s.d. for rescaling coefficients)
  3. `settings`, as well as (if applicable) `messages` and `warnings`.

```{r}
subset_mod_frame <- model_frame %>% 
  filter(Industry %in% unique(Industry)[1:2])
subset_mod_frame
```

Let's unnest the settings columns:

```{r}
subset_mod_frame %>% 
  tidyr::unnest(settings, keep_empty = TRUE)
```

The `tidyfit.models` frame can be used to access additional information. Specifically, we can do 3 things:

  1. Access the fitted model
  2. Predict
  3. Access a tibble of estimated parameters

The **fitted tidyFit models** are stored as an `R6` class in the `model_object` column and can be addressed directly with generics such as `coef` or `summary`. The underlying object (e.g. an `lm` class fitted model) is given in `...$object` (see [here](https://tidyfit.unchartedml.com/articles/Accessing_Fitted_Model_Objects.html) for another example):

```{r}
subset_mod_frame %>% 
  mutate(fitted_model = map(model_object, ~.$object))
```

To **predict**, we need data with the same columns as the input data and simply use the generic `predict` function. Groups are respected and if the response variable is in the data, it is included as a `truth` column in the resulting object:

```{r}
predict(subset_mod_frame, data)
```

Finally, we can obtain a **tidy frame of the coefficients** using the generic `coef` function:

```{r}
estimates <- coef(subset_mod_frame)
estimates
```

The estimates contain additional method-specific information that is nested in `model_info`. This can include standard errors, t-values and similar information:

```{r}
tidyr::unnest(estimates, model_info)
```

Suppose we would like to evaluate the relative performance of the two methods. This becomes exceedingly simple using the `yardstick` package:

```{r, fig.width=10, fig.height=5, fig.align="center"}
model_frame %>% 
  # Generate predictions
  predict(df_test) %>% 
  # Calculate RMSE
  yardstick::rmse(truth, prediction) %>% 
  # Plot
  ggplot(aes(Industry, .estimate)) +
  geom_col(aes(fill = model), position = position_dodge()) +
  theme_bw()
```

The ElasticNet performs a little better (unsurprising really, given the small data set).

A **more detailed regression analysis of Boston house price data** using a panel of regularized regression estimators can be found [here](https://tidyfit.unchartedml.com/articles/Predicting_Boston_House_Prices.html).