Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe with no preprocessing fails to predict from the test set #280

Closed
barisguven opened this issue Jan 11, 2025 · 5 comments
Closed

Recipe with no preprocessing fails to predict from the test set #280

barisguven opened this issue Jan 11, 2025 · 5 comments

Comments

@barisguven
Copy link

barisguven commented Jan 11, 2025

The problem

I was trying to compare the performance of two different specifications of a linear regression model using the Ames data set. I wanted to compare the specification with no preprocessing steps with the one that involves some preprocessing steps. In both cases, the formula was Sale_Price ~ .. I used the recipe() function for both specifications. However, the fitted workflow with the recipe without preprocessing failed to compute predictions. The reason seems to be that the fitted model comes across observations in the test set with factor levels that are not available in the training set (adding the step_dummy(all_nominal_predictors()) step removes the problem). Moreover, this problem does not occur when I construct the workflow with a formula instead. I have my code below. In the very last part of it, I have two lines of code to show that the template data the prep function prepares with the recipe without preprocessing is identical to the initial training data, except that Sale_Price, the outcome variable, is placed at the end of the data frame.

One would expect that a workflow with a recipe without preprocessing would lead to the same results as a workflow with a simple formula.

Reproducible example

library(tidymodels)

data(ames)

ames = mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)

ames_form = Sale_Price ~ .
ames_rec1 = recipe(Sale_Price ~ ., data = ames_train)

ames_rec2 = 
  ames_rec1 |>
  step_nzv(all_predictors()) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

# This workflow cannot predict
wflow1 = workflow() |>
  add_recipe(ames_rec1) |>
  add_model(linear_reg())

lm_fit1 = fit(wflow1, data = ames_train)
predict(lm_fit1, new_data = ames_test)

# This workflow can predict
wflow2 = workflow() |>
  add_recipe(ames_rec2) |>
  add_model(linear_reg())

lm_fit2 = fit(wflow2, data = ames_train)
predict(lm_fit2, new_data = ames_test)

# This workflow can predict too
wflow3 = workflow() |>
  add_formula(Sale_Price ~ .) |>
  add_model(linear_reg())

lm_fit3 = fit(wflow3, data = ames_train)
predict(lm_fit3, new_data = ames_test)


rec1_prep = prep(ames_rec1, training = ames_train)$template
all.equal(rec1_prep, ames_train |> relocate(Sale_Price, .after = Latitude))

@EmilHvitfeldt
Copy link
Member

Hello @barisguven 👋

the reason why you are seeing a discrepancy between the 3 approaches is due to how workflow object works.

if you are using a recipe, then workflows assumes that the recipe is handling preprocessing. This is why you are seeing an error in the first workflow, because there are new levels in the test set which recipes isn't handling.

In the last workflow, since a formula is used instead of a recipe, then the workflow does some things for you. One of those things is to create dummy variables responsively. This handling means that you don't see any errors in this case.

@barisguven
Copy link
Author

barisguven commented Jan 12, 2025

Thanks @EmilHvitfeldt for your response.

In all approaches, stats' lm is the computational engine. I would expect that the workflow with formula and the workflow with recipe without preprocessing will pass the same data (design matrix?) to lm, which would estimate the same coefficients, which would then mean that they predict the same values for the same observations in the test set. What am I missing?

@EmilHvitfeldt EmilHvitfeldt transferred this issue from tidymodels/recipes Jan 13, 2025
@EmilHvitfeldt
Copy link
Member

EmilHvitfeldt commented Jan 13, 2025

It is worth noting that you would see this same error if you didn't use tidymodels, and used lm() directly:

library(tidymodels)

data(ames)

ames = mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)

ames_form = Sale_Price ~ .

lm_fit <- lm(ames_form, ames_train)
predict(lm_fit, ames_test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Roof_Matl has new levels Roll

if you use a formula with a workflow. then some things happen to make the users life easier.

One of them is that workflows figures out that we have set predictor_indicators = "traditional". Telling it to generate dummy variables. See Tools to Register Models for more information. This is done to avoid the error we are seeing above with base lm().

If you used a recipe, then this action isn't performed, as it is assumed that the recipe will handle it.

@barisguven
Copy link
Author

That explains it. Thank you!

@EmilHvitfeldt
Copy link
Member

you are welcome! anytime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants