-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe with no preprocessing fails to predict from the test set #280
Comments
Hello @barisguven 👋 the reason why you are seeing a discrepancy between the 3 approaches is due to how workflow object works. if you are using a recipe, then workflows assumes that the recipe is handling preprocessing. This is why you are seeing an error in the first workflow, because there are new levels in the test set which recipes isn't handling. In the last workflow, since a formula is used instead of a recipe, then the workflow does some things for you. One of those things is to create dummy variables responsively. This handling means that you don't see any errors in this case. |
Thanks @EmilHvitfeldt for your response. In all approaches, stats' lm is the computational engine. I would expect that the workflow with formula and the workflow with recipe without preprocessing will pass the same data (design matrix?) to lm, which would estimate the same coefficients, which would then mean that they predict the same values for the same observations in the test set. What am I missing? |
It is worth noting that you would see this same error if you didn't use tidymodels, and used library(tidymodels)
data(ames)
ames = mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split = initial_split(ames, prop = 0.8, strata = Sale_Price)
ames_train = training(ames_split)
ames_test = testing(ames_split)
ames_form = Sale_Price ~ .
lm_fit <- lm(ames_form, ames_train)
predict(lm_fit, ames_test)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Roof_Matl has new levels Roll if you use a formula with a workflow. then some things happen to make the users life easier. One of them is that workflows figures out that we have set predictor_indicators = "traditional". Telling it to generate dummy variables. See Tools to Register Models for more information. This is done to avoid the error we are seeing above with base If you used a recipe, then this action isn't performed, as it is assumed that the recipe will handle it. |
That explains it. Thank you! |
you are welcome! anytime |
The problem
I was trying to compare the performance of two different specifications of a linear regression model using the Ames data set. I wanted to compare the specification with no preprocessing steps with the one that involves some preprocessing steps. In both cases, the formula was
Sale_Price ~ .
. I used therecipe()
function for both specifications. However, the fitted workflow with the recipe without preprocessing failed to compute predictions. The reason seems to be that the fitted model comes across observations in the test set with factor levels that are not available in the training set (adding thestep_dummy(all_nominal_predictors())
step removes the problem). Moreover, this problem does not occur when I construct the workflow with a formula instead. I have my code below. In the very last part of it, I have two lines of code to show that the template data the prep function prepares with the recipe without preprocessing is identical to the initial training data, except that Sale_Price, the outcome variable, is placed at the end of the data frame.One would expect that a workflow with a recipe without preprocessing would lead to the same results as a workflow with a simple formula.
Reproducible example
The text was updated successfully, but these errors were encountered: