Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving lm model via vetiver object takes a lot of space #264

Closed
lschneiderbauer opened this issue Nov 30, 2023 · 3 comments
Closed

Saving lm model via vetiver object takes a lot of space #264

lschneiderbauer opened this issue Nov 30, 2023 · 3 comments

Comments

@lschneiderbauer
Copy link

Hi,

Thank you for putting effort into trying to make live easier for ML people. :)
I am just experimenting with the vetiver package to see if we can make use of it, and am stumbling over some issues.

I set up a simple tidymodels workflow, fitted some data (~ 14 mio records), created a vetiver object and tried to persist it with vetiver_pin_write().
The problem I have is that the result takes ~ 1.4 GB on my hard disk.

Is this intentional? In our use case we really only need the (stored) model to make predictions and provide confidence intervals. For that storing the training coefficients and associated uncertainties should be enough, and I don't see why that should take 1.4 GB of space.

I tried to experiment with the model = FALSE parameter for lm(), but that only reduced the filesize by half or so. It seems it has something to do with some fit$qr$qr object inside the fit model. I can manually remove that, and the filesize gets to an acceptible size, but neither vetiver nor butcher do so automatically.

Do I have to live with the fact that the trained models will take a big amount of space or are there some measures I can take to get it to a size of the order of a couple of KB?

@juliasilge
Copy link
Member

This is a great question @lschneiderbauer. It's more about butcher than vetiver, so I will plan to move this issue over there. You can see what specifically we remove from an lm() model here, and notice that we don't remove the qr component. The reason is that component is needed for generating prediction intervals, which is something we typically want to retain for models.

I wonder if we should consider two levels of butchering, one that retains the ability to make all kinds of predictions and one that is less conservative and only retains the ability to make a very simple prediction.

In the meantime, if I were you, I would probably use the butcher infrastructure to remove the components you want before creating a vetiver model, something like this:

library(butcher)
library(vetiver)

more_cars <- mtcars[rep(1:32, each = 1e4),]
cars_lm <- lm(mpg ~ ., data = more_cars)
weigh(cars_lm)
#> # A tibble: 25 × 2
#>    object         size
#>    <chr>         <dbl>
#>  1 qr.qr         54.0 
#>  2 residuals     28.4 
#>  3 fitted.values 28.4 
#>  4 effects        5.12
#>  5 model.mpg      2.56
#>  6 model.cyl      2.56
#>  7 model.disp     2.56
#>  8 model.hp       2.56
#>  9 model.drat     2.56
#> 10 model.wt       2.56
#> # ℹ 15 more rows

axe_custom <- function(x) {
    old <- x
    ## you probably don't want residuals either:
    x <- butcher:::exchange(x, "residuals", numeric(0))
    x$qr <- butcher:::exchange(x$qr, "qr", matrix(0))
    x
}

axed_lm <- axe_custom(cars_lm)
weigh(axed_lm)
#> # A tibble: 25 × 2
#>    object         size
#>    <chr>         <dbl>
#>  1 fitted.values 28.4 
#>  2 effects        5.12
#>  3 model.mpg      2.56
#>  4 model.cyl      2.56
#>  5 model.disp     2.56
#>  6 model.hp       2.56
#>  7 model.drat     2.56
#>  8 model.wt       2.56
#>  9 model.qsec     2.56
#> 10 model.vs       2.56
#> # ℹ 15 more rows

v <- vetiver_model(axed_lm, "custom-butchered-lm")
weigh(v)
#> # A tibble: 37 × 2
#>    object            size
#>    <chr>            <dbl>
#>  1 model.effects     5.12
#>  2 model.model.mpg   2.56
#>  3 model.model.cyl   2.56
#>  4 model.model.disp  2.56
#>  5 model.model.hp    2.56
#>  6 model.model.drat  2.56
#>  7 model.model.wt    2.56
#>  8 model.model.qsec  2.56
#>  9 model.model.vs    2.56
#> 10 model.model.am    2.56
#> # ℹ 27 more rows

Created on 2023-11-30 with reprex v2.0.2

@juliasilge
Copy link
Member

Oops no, I can't transfer an issue from the rstudio org to the tidymodels org. I'll open a new issue over there.

@juliasilge
Copy link
Member

Please feel free to add any details over at tidymodels/butcher#272 @lschneiderbauer! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants