speed improvement for `step_lencode_glm()` #232

EmilHvitfeldt · 2024-10-30T18:47:09Z

I think i found some evidence that we can improve the speed of step_lencode_glm() significantly

the following shows a rough benchmark. to note

they produce the same result up to 10^-15
the ordering of the values are not the same, but doesn't matter as we left_join it on
this only works for the numeric outcome, but would be easy enough to extend to other supported modes
old method scales linearly in time with the number of levels of x. new method has same speed

library(embed)
n_obs <- 500000

data <- tibble(
  outcome = rnorm(n_obs),
  x = factor(sample(seq_len(100), n_obs, TRUE))
)

tictoc::tic("old")
res <- recipe(outcome ~ x, data = data) |>
  step_lencode_glm(x, outcome = vars(outcome)) |>
  prep()
tictoc::toc()
#> old: 8.327 sec elapsed


tictoc::tic("new")
tmp <- data |>
  summarise(value = mean(outcome), .by = x)
tictoc::toc()
#> new: 0.007 sec elapsed

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2025-01-29T20:49:22Z

fast_lencode_glm <- function(x, y, wts = NULL) {
  data <- tibble::new_tibble(
    list(..level = x, values = y, wts = wts)
  )
  
  if (is.null(wts)) {
    res <- dplyr::summarise(data, ..value = mean(values), .by = ..level)
  } else {
    res <- dplyr::summarise(data, ..value = weighted.mean(values, wts), .by = ..level)
  }

  unseen <- tibble::new_tibble(
    list(
      ..level = "..new",
      ..value = mean(res$..value, trim = 0.1)
    )
  )

  dplyr::bind_rows(res, unseen)
}

they should be based on number of rows, to make sure they always go over the estimate

my proposal: calculate the probability to be (2*n - 1) / (2 * n) instead of 1.

n <- 100
p <- (n-1) / n
log(p / (1 - p))
#> [1] 4.59512

n <- 1000
p <- (n-1) / n
log(p / (1 - p))
#> [1] 6.906755

n <- 1000
p <- (n-1) / n
p1 <- (2*n-1) / (2*n)

log(p / (1 - p))
#> [1] 6.906755
log(p1 / (1 - p1))
#> [1] 7.600402

EmilHvitfeldt added the feature a feature request or enhancement label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed improvement for `step_lencode_glm()` #232

speed improvement for `step_lencode_glm()` #232

EmilHvitfeldt commented Oct 30, 2024

EmilHvitfeldt commented Jan 29, 2025

speed improvement for step_lencode_glm() #232

speed improvement for step_lencode_glm() #232

Comments

EmilHvitfeldt commented Oct 30, 2024

EmilHvitfeldt commented Jan 29, 2025

speed improvement for `step_lencode_glm()` #232

speed improvement for `step_lencode_glm()` #232