-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of transpose_list() #396
base: master
Are you sure you want to change the base?
Conversation
This improves the runtime significantly for loading data with many columns. The order of loop nesting as well as a much more efficient binary search does the trick. In a real world example, fetching ~300k rows with ~50 columns from MongoDB, this brings the query + load time from 70 seconds to ~40. Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435 ``` > set.seed(1) > rows <- 10000 > columns <- 100 > p_missing <- 0.2 > > recordlist <- lapply(1:rows, function(rownum) { + row <- as.list(1:columns) + names(row) <- paste0("col_", row) + row[runif(columns) > p_missing] + }) > columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE, + use.names = FALSE)) ``` Before this change ``` > microbenchmark::microbenchmark( + jsonlite:::transpose_list(recordlist, columns), + times = 10 + ) Unit: milliseconds expr min lq mean median uq max neval jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057 10 ``` With this change ``` > microbenchmark::microbenchmark( + jsonlite:::transpose_list(recordlist, columns), + times = 10 + ) Unit: milliseconds expr min lq mean median uq max neval jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052 10 ```
@@ -1,4 +1,12 @@ | |||
#' @useDynLib jsonlite C_transpose_list | |||
transpose_list <- function(x, names) { | |||
.Call(C_transpose_list, x, names) | |||
# Sort names before entering C, allowing for a binary search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting names
lets us use a binary search in C. Sorting is a lot easier to do in R. If you are willing to add {stringi}
or {withr}
as dependencies, we can sort C-style without this ugly Sys.setlocale()
dance? There might well be other, cleaner ways to do this.
for(size_t k = 0; k < Rf_length(listnames); k++){ | ||
if(!strcmp(CHAR(STRING_ELT(listnames, k)), targetname)){ | ||
SET_VECTOR_ELT(out, i, col); | ||
UNPROTECT(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very inexperienced integrating R and C. Please take extra care to review my PROTECT():s.
If a name exists in the data, sorted less than the smallest being requested, the previous code would end up in an infinite loop.
e43a8f6
to
89a3e99
Compare
This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.
In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.
Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30
seconds simplifying colums. The transpose_list now takes <2 seconds.
Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435
Before this change
With this change