-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
260 lines (191 loc) · 14.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
output: github_document
always_allow_html: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
message = FALSE,
warning = FALSE
)
library(tidyverse)
library(tidyfit)
```
# tidyfit <img src="man/figures/logo.png" align="right" alt="" width="120" />
<!-- badges: start -->
![CRAN](https://img.shields.io/cran/v/tidyfit?label=CRAN)
<!-- badges: end -->
`tidyfit` is an `R`-package that facilitates and automates linear and nonlinear regression and classification modeling in a tidy environment. The package includes methods such as the Lasso, PLS, time-varying parameter or Bayesian model averaging regressions, and many more. The aim is threefold:
1. Offer a **standardized input-output interface** for a broad set of "good" modeling packages. Unlike projects such as `caret`, the aim is not to cover a comprehensive set of available packages, but rather a curated list of the most useful ones.
2. Efficient **modeling with `tidy` data**, including verbs for regression and classification, automatic consideration of grouped data, and `tidy` output formats throughout.
3. **Augment and automate** methods to ensure statistical comparability (e.g. across coefficients of different linear estimators), sensible default settings, necessary transformations (e.g. standardizing features when required), automatic hyperparameter optimization, etc.
`tidyfit` builds on the `tidymodels` suite, but emphasizes automated modeling with a focus on grouped data, model comparisons, and high-volume analytics. The objective is to make model fitting, cross validation and model output very simple and standardized across all methods, with many necessary method-specific transformations handled in the background.
## Installation
`tidyfit` can be installed from CRAN or the development version from [GitHub](https://github.com/jpfitzinger/tidyfit) with:
```{r, eval=F}
# CRAN
install.packages("tidyfit")
# Dev version
# install.packages("devtools")
devtools::install_github("jpfitzinger/tidyfit")
library(tidyfit)
```
## Overview and usage
`tidyfit` includes 3 deceptively simple functions:
- `regress()`
- `classify()`
- `m()`
All 3 of these functions return a `tidyfit.models` frame, which is a data frame containing information about fitted regression and classification models. `regress` and `classify` perform regression and classification on tidy data. The functions ingest a `tibble`, prepare input data for the models by splitting groups, partitioning cross validation slices and handling any necessary adjustments and transformations. The data is ultimately passed to the model wrapper `m()` which fits the models.
[See the flowchart](https://tidyfit.residualmetrics.com/articles/Flowchart.html)
To illustrate basic usage, suppose we would like to fit a financial factor regression for 10 industries with exponential weighting, comparing a WLS and a weighted LASSO regression:
```{r, eval=F}
progressr::handlers(global=TRUE)
tidyfit::Factor_Industry_Returns %>%
group_by(Industry) %>% # Ensures that a model is fitted for each industry
mutate(Weight = 0.96^seq(n(), 1)) %>% # Exponential weights
# 'regress' allows flexible standardized regression analysis in a single line of code
regress(
Return ~ ., # Uses normal formula syntax
m("lasso"), # LASSO regression wrapper, 'lambda' grid set to default
m("lm"), # OLS wrapper (can add as many wrappers as necessary here)
.cv = "initial_time_split", # Cross-validation method for optimal 'lambda' in LASSO
.mask = "Date", # 'Date' columns should be excluded
.weights = "Weight" # Specifies the weights column
) -> models_df
# Get coefficients frame
coef(models_df)
```
The syntax is identical for `classify`.
`m` is a powerful wrapper for many different regression and classification techniques that can be used with `regress` and `classify`, or as a stand-alone function:
```{r, eval=F}
m(
<method>, # e.g. "lm" or "lasso"
formula, data, # not passed when used within regress or classify
... # Args passed to underlying method, e.g. to stats::lm for OLS regression
)
```
An important feature of `m()` is that all arguments can be passed as vectors, allowing generalized hyperparameter tuning or scenario analysis for any method:
- Passing a hyperparameter grid: `m("lasso", lambda = seq(0, 1, by = 0.1))`
- Different algorithms for robust regression: `m("robust", method = c("M", "MM"))`
- Forward vs. backward selection: `m("subset", method = c("forward", "backward"))`
- Logit vs. Probit models: `m("glm", family = list(binomial(link="logit"), binomial(link="probit")))`
Arguments that are meant to be vectors (e.g. weights) are recognized by the function and not interpreted as grids.
## Methods implemented in `tidyfit`
`tidyfit` currently implements the following methods:
```{r, echo = F}
library(kableExtra)
tbl <- data.frame(
Method = c("OLS", "GLS", "Robust regression (e.g. Huber loss)",
"Quantile regression", "ANOVA", "LASSO", "Ridge", "Group LASSO", "Adaptive LASSO", "ElasticNet",
"Gradient boosting regression", "Support vector machine", "Random forest", "Quantile random forest",
"Neural Network", "Principal components regression", "Partial least squares",
"Hierarchical feature regression", "Best subset selection", "Genetic Algorithm", "General-to-specific", "Bayesian regression", "Bayesian Ridge", "Bayesian Lasso", "Bayesian model averaging", "Bayesian Spike and Slab", "Bayesian time-varying parameters regression", "Generalized mixed-effects regression", "Markov-switching regression", "Pearson correlation", "Chi-squared test", "Minimum redundancy, maximum relevance", "ReliefF"),
Name = c("lm", "glm", "robust", "quantile", "anova", "lasso", "ridge", "group_lasso", "adalasso", "enet", "boost", "svm",
"rf", "quantile_rf",
"nnet", "pcr", "plsr",
"hfr", "subset", "genetic", "gets", "bayes", "bridge", "blasso", "bma", "spikeslab", "tvp", "glmm",
"mslm", "cor", "chisq", "mrmr", "relief"),
Package = c("`stats`", "`stats`", "`MASS`", "`quantreg`", "`stats`", "`glmnet`", "`glmnet`", "`gglasso`", "`glmnet`", "`glmnet`",
"`mboost`", "`e1071`", "`randomForest`", "`quantregForest`", "`nnet`", "`pls`", "`pls`", "`hfr`", "`bestglm`", "`gaselect`", "`gets`", "`arm`", "`monomvn`", "`monomvn`", "`BMS`", "`BoomSpikeSlab`", "`shrinkTVP`", "`lme4`", "`MSwM`", "`stats`", "`stats`", "`mRMRe`", "`CORElearn`"),
Regression = c("yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes"),
Classification = c("no", "yes", "no", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "no", "no", "no", "no", "yes", "no", "no", "yes", "no", "no", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes"),
group = c("Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Linear (generalized) regression or classification", "Regression or classification with L1 and L2 penalties", "Regression or classification with L1 and L2 penalties", "Regression or classification with L1 and L2 penalties", "Regression or classification with L1 and L2 penalties", "Regression or classification with L1 and L2 penalties", "Machine Learning", "Machine Learning", "Machine Learning", "Machine Learning", "Machine Learning", "Factor regressions", "Factor regressions", "Factor regressions", "Subset selection", "Subset selection", "Subset selection", "Bayesian regression", "Bayesian regression", "Bayesian regression", "Bayesian regression", "Bayesian regression", "Bayesian regression", "Mixed-effects modeling", "Specialized time series methods", "Feature selection", "Feature selection", "Feature selection", "Feature selection")
)
kbl(tbl[,1:5], align = "lcccc") %>%
pack_rows(index = table(fct_inorder(tbl$group))) %>%
kable_styling("striped")
```
See `?m` for additional information.
It is important to note that the above list is not complete, since some of the methods encompass multiple algorithms. For instance, "subset" can be used to perform forward, backward or exhaustive search selection using `leaps`. Similarly, "lasso" includes certain grouped LASSO implementations that can be fitted with `glmnet`.
## A minimal workflow
In this section, a minimal workflow is used to demonstrate how the package works. For more detailed guides of specialized topics, or simply for further reading, follow these links:
- [The flowchart](https://tidyfit.residualmetrics.com/articles/Flowchart.html)
- [Regularized regression](https://tidyfit.residualmetrics.com/articles/Predicting_Boston_House_Prices.html) (Boston house price data)
- [Multinomial classification](https://tidyfit.residualmetrics.com/articles/Multinomial_Classification.html) (iris data)
- [Feature Selection](https://tidyfit.residualmetrics.com/articles/Feature_Selection.html) (macroeconomic data)
- [Accessing fitted models](https://tidyfit.residualmetrics.com/articles/Accessing_Fitted_Model_Objects.html)
- [Rolling window regression for time series](https://tidyfit.residualmetrics.com/articles/Rolling_Window_Time_Series_Regression.html) (factor data)
- [Time-varying parameters](https://tidyfit.residualmetrics.com/articles/Time-varying_parameters_vs_rolling_windows.html) (factor data)
- [Bootstrap confidence intervals](https://tidyfit.residualmetrics.com/articles/Bootstrapping_Confidence_Intervals.html)
`tidyfit` includes a data set of financial Fama-French factor returns freely available [here](https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). The data set includes monthly industry returns for 10 industries, as well as monthly factor returns for 5 factors:
```{r example}
data <- tidyfit::Factor_Industry_Returns
# Calculate excess return
data <- data %>%
mutate(Return = Return - RF) %>%
select(-RF)
data
```
We will only use a subset of the data to keep things simple:
```{r}
df_train <- data %>%
filter(Date > 201800 & Date < 202000)
df_test <- data %>%
filter(Date >= 202000)
```
Before beginning with the estimation, we can activate the progress bar visualization. This allows us to gauge estimation progress along the way. `tidyfit` uses the `progressr`-package internally to generate a progress bar --- run `progressr::handlers(global=TRUE)` to activate progress bars in your environment.
For purposes of this demonstration, the objective will be to fit an ElasticNet regression for each industry group, and compare results to a robust least squares regression. This can be done with `regress` after grouping the data. For grouped data, the functions `regress` and `classify` estimate models for each group independently:
```{r}
model_frame <- df_train %>%
group_by(Industry) %>%
regress(Return ~ .,
m("enet"),
m("robust", method = "MM", psi = MASS::psi.huber),
.cv = "initial_time_split", .mask = "Date")
```
Note that the penalty and mixing parameters are chosen automatically using a time series train/test split (`rsample::initial_time_split`), with a hyperparameter grid set by the package `dials`. See `?regress` for additional CV methods. A custom grid can easily be passed using `lambda = c(...)` and/or `alpha = c(...)` in the `m("enet")` wrapper.
The resulting `tidyfit.models` frame consists of 3 components:
1. A group of identifying columns. This includes the `Industry` column, the model name, and grid ID (the ID for the model settings used).
2. `estimator_fct`, `size (MB)` and `model_object` columns. These columns contain information on the model itself. The `model_object` is the fitted `tidyFit` model (an `R6` class that contains the model object as well as any additional information needed to perform predictions or access coefficients)
3. Nested `settings`, as well as (if applicable) `messages` and `warnings`.
```{r}
subset_mod_frame <- model_frame %>%
filter(Industry %in% unique(Industry)[1:2])
subset_mod_frame
```
Let's unnest the settings columns:
```{r}
subset_mod_frame %>%
tidyr::unnest(settings, keep_empty = TRUE)
```
The `tidyfit.models` frame can be used to access additional information. Specifically, we can do 4 things:
1. Access the fitted model
2. Predict
3. Access a data frame of estimated parameters
4. Use additional generics
The **fitted tidyFit models** are stored as an `R6` class in the `model_object` column and can be addressed directly with generics such as `coef` or `summary`. The underlying object (e.g. an `lm` class fitted model) is given in `...$object` (see [here](https://tidyfit.residualmetrics.com/articles/Accessing_Fitted_Model_Objects.html) for another example):
```{r}
subset_mod_frame %>%
mutate(fitted_model = map(model_object, ~.$object))
```
To **predict**, we need data with the same columns as the input data and simply use the generic `predict` function. Groups are respected and if the response variable is in the data, it is included as a `truth` column in the resulting object:
```{r}
predict(subset_mod_frame, data)
```
Finally, we can obtain a **tidy frame of the coefficients** using the generic `coef` function:
```{r}
estimates <- coef(subset_mod_frame)
estimates
```
The estimates contain additional method-specific information that is nested in `model_info`. This can include standard errors, t-values and similar information:
```{r}
tidyr::unnest(estimates, model_info)
```
Additional generics such as `fitted` or `resid` can be used to obtain more information on the models.
Suppose we would like to evaluate the relative performance of the two methods. This becomes exceedingly simple using the `yardstick` package:
```{r, fig.width=10, fig.height=5, fig.align="center"}
model_frame %>%
# Generate predictions
predict(df_test) %>%
# Calculate RMSE
yardstick::rmse(truth, prediction) %>%
# Plot
ggplot(aes(Industry, .estimate)) +
geom_col(aes(fill = model), position = position_dodge())
```
The ElasticNet performs a little better (unsurprising really, given the small data set).
A **more detailed regression analysis of Boston house price data** using a panel of regularized regression estimators can be found [here](https://tidyfit.residualmetrics.com/articles/Predicting_Boston_House_Prices.html).