forked from tidymodels/butcher
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
127 lines (87 loc) · 5.21 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# butcher <a href='https://butcher.tidymodels.org/'><img src='man/figures/logo.png' align="right" height="139" /></a>
<!-- badges: start -->
[![Codecov test coverage](https://codecov.io/gh/tidymodels/butcher/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/butcher?branch=main)
[![R-CMD-check](https://github.com/tidymodels/butcher/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/butcher/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## Overview
Modeling pipelines in `R` occasionally result in fitted model objects that take up too much memory. There are two main culprits:
1. Heavy dependencies on formulas and closures that capture the enclosing environment in the modeling process; and
2. Lack of selectivity in the construction of the model object itself.
As a result, fitted model objects carry over components that are often redundant and not required for post-fit estimation activities. `butcher` makes it easy to axe parts of the fitted output that are no longer needed, without sacrificing much functionality from the original model object.
## Installation
Install the released version from CRAN:
```{r, eval = FALSE}
install.packages("butcher")
```
Or install the development version from [GitHub](https://github.com/):
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("tidymodels/butcher")
```
## Butchering
To make the most of your memory available, this package provides five S3 generics for you to remove parts of a model object:
- `axe_call()`: To remove the call object.
- `axe_ctrl()`: To remove controls associated with training.
- `axe_data()`: To remove the original training data.
- `axe_env()`: To remove environments.
- `axe_fitted()`: To remove fitted values.
As an example, we wrap a `lm` model:
```{r example}
library(butcher)
our_model <- function() {
some_junk_in_the_environment <- runif(1e6) # we didn't know about
lm(mpg ~ ., data = mtcars)
}
```
The `lm` that exists in our modeling pipeline is:
```{r, warning = F, message = F}
library(lobstr)
obj_size(our_model())
```
When, in fact, it should only require:
```{r, warning = F, message = F}
small_lm <- lm(mpg ~ ., data = mtcars)
obj_size(small_lm)
```
To understand which part of our original model object is taking up the most memory, we leverage the `weigh()` function:
```{r, warning = F, message = F}
big_lm <- our_model()
butcher::weigh(big_lm)
```
The problem here is in the `terms` component of our `big_lm`. Because of how `lm` is implemented in the `stats` package, the environment (in which our model was made) was also carried along in the fitted output. To remove this (mostly) extraneous component, we can use `axe_env()`:
```{r, warning = F, message = F}
cleaned_lm <- butcher::axe_env(big_lm, verbose = TRUE)
```
Comparing it against our `small_lm`, we'll find:
```{r, warning = F, message = F}
butcher::weigh(cleaned_lm)
```
...it now takes the same memory on disk as `small_lm`:
```{r, warning = F, message = F}
butcher::weigh(small_lm)
```
Axing the environment is not the only functionality of `butcher`. We can also remove `call`, `ctrl`, `data` and `fitted_values`, or simply run `butcher()` to execute all of these axing functions at once. Any kind of axing on the object will append a butchered class to the current model object class(es) as well as a new attribute named `butcher_disabled` that lists any post-fit estimation functions that are disabled as a result.
## Model Object Coverage
Check out the `vignette("available-axe-methods")` to see butcher's current coverage. If you are working with a new model object that could benefit from any kind of axing, we would love for you to make a pull request! You can visit the `vignette("adding-models-to-butcher")` for more guidelines, but in short, to contribute a set of axe methods:
1) Run `new_model_butcher(model_class = "your_object", package_name = "your_package")`
2) Use butcher helper functions `butcher::weigh()` and `butcher::locate()` to decide what to axe
3) Finalize edits to `R/your_object.R` and `tests/testthat/test-your_object.R`
4) Make a pull request!
## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/butcher/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).