Recipe with no preprocessing fails to predict from the test set #280

barisguven · 2025-01-11T16:03:40Z

The problem

I was trying to compare the performance of two different specifications of a linear regression model using the Ames data set. I wanted to compare the specification with no preprocessing steps with the one that involves some preprocessing steps. In both cases, the formula was Sale_Price ~ .. I used the recipe() function for both specifications. However, the fitted workflow with the recipe without preprocessing failed to compute predictions. The reason seems to be that the fitted model comes across observations in the test set with factor levels that are not available in the training set (adding the step_dummy(all_nominal_predictors()) step removes the problem). Moreover, this problem does not occur when I construct the workflow with a formula instead. I have my code below. In the very last part of it, I have two lines of code to show that the template data the prep function prepares with the recipe without preprocessing is identical to the initial training data, except that Sale_Price, the outcome variable, is placed at the end of the data frame.

One would expect that a workflow with a recipe without preprocessing would lead to the same results as a workflow with a simple formula.

Reproducible example

library(tidymodels)

data(ames)

ames = mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)

ames_form = Sale_Price ~ .
ames_rec1 = recipe(Sale_Price ~ ., data = ames_train)

ames_rec2 = 
  ames_rec1 |>
  step_nzv(all_predictors()) |>
  step_YeoJohnson(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

# This workflow cannot predict
wflow1 = workflow() |>
  add_recipe(ames_rec1) |>
  add_model(linear_reg())

lm_fit1 = fit(wflow1, data = ames_train)
predict(lm_fit1, new_data = ames_test)

# This workflow can predict
wflow2 = workflow() |>
  add_recipe(ames_rec2) |>
  add_model(linear_reg())

lm_fit2 = fit(wflow2, data = ames_train)
predict(lm_fit2, new_data = ames_test)

# This workflow can predict too
wflow3 = workflow() |>
  add_formula(Sale_Price ~ .) |>
  add_model(linear_reg())

lm_fit3 = fit(wflow3, data = ames_train)
predict(lm_fit3, new_data = ames_test)


rec1_prep = prep(ames_rec1, training = ames_train)$template
all.equal(rec1_prep, ames_train |> relocate(Sale_Price, .after = Latitude))

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2025-01-12T04:08:25Z

Hello @barisguven 👋

the reason why you are seeing a discrepancy between the 3 approaches is due to how workflow object works.

if you are using a recipe, then workflows assumes that the recipe is handling preprocessing. This is why you are seeing an error in the first workflow, because there are new levels in the test set which recipes isn't handling.

In the last workflow, since a formula is used instead of a recipe, then the workflow does some things for you. One of those things is to create dummy variables responsively. This handling means that you don't see any errors in this case.

barisguven · 2025-01-12T06:23:13Z

Thanks @EmilHvitfeldt for your response.

In all approaches, stats' lm is the computational engine. I would expect that the workflow with formula and the workflow with recipe without preprocessing will pass the same data (design matrix?) to lm, which would estimate the same coefficients, which would then mean that they predict the same values for the same observations in the test set. What am I missing?

EmilHvitfeldt · 2025-01-13T18:46:53Z

It is worth noting that you would see this same error if you didn't use tidymodels, and used lm() directly:

library(tidymodels)

data(ames)

ames = mutate(ames, Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)

ames_form = Sale_Price ~ .

lm_fit <- lm(ames_form, ames_train)
predict(lm_fit, ames_test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Roof_Matl has new levels Roll

if you use a formula with a workflow. then some things happen to make the users life easier.

One of them is that workflows figures out that we have set predictor_indicators = "traditional". Telling it to generate dummy variables. See Tools to Register Models for more information. This is done to avoid the error we are seeing above with base lm().

If you used a recipe, then this action isn't performed, as it is assumed that the recipe will handle it.

barisguven · 2025-01-15T16:33:17Z

That explains it. Thank you!

EmilHvitfeldt · 2025-01-15T16:40:26Z

you are welcome! anytime

EmilHvitfeldt transferred this issue from tidymodels/recipes Jan 13, 2025

EmilHvitfeldt closed this as completed Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe with no preprocessing fails to predict from the test set #280

Recipe with no preprocessing fails to predict from the test set #280

barisguven commented Jan 11, 2025 •

edited

Loading

EmilHvitfeldt commented Jan 12, 2025

barisguven commented Jan 12, 2025 •

edited

Loading

EmilHvitfeldt commented Jan 13, 2025 •

edited

Loading

barisguven commented Jan 15, 2025

EmilHvitfeldt commented Jan 15, 2025

Recipe with no preprocessing fails to predict from the test set #280

Recipe with no preprocessing fails to predict from the test set #280

Comments

barisguven commented Jan 11, 2025 • edited Loading

The problem

Reproducible example

EmilHvitfeldt commented Jan 12, 2025

barisguven commented Jan 12, 2025 • edited Loading

EmilHvitfeldt commented Jan 13, 2025 • edited Loading

barisguven commented Jan 15, 2025

EmilHvitfeldt commented Jan 15, 2025

barisguven commented Jan 11, 2025 •

edited

Loading

barisguven commented Jan 12, 2025 •

edited

Loading

EmilHvitfeldt commented Jan 13, 2025 •

edited

Loading