You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @Rdatatable/project-members
does anybody know the expected time complexity of setDT, as a function of number of rows R and number of columns C?
I was expecting O(1), constant time overall, but now that I look at the documentation, it does not seem to explain the expected time complexity. ?setDT says
‘setDT’ converts lists (both named and unnamed) and data.frames to
data.tables _by reference_
...
The ‘setDT’
function takes care of this issue by allowing to convert ‘lists’ -
both named and unnamed lists and ‘data.frames’ _by reference_
instead. That is, the input object is modified in place, no copy
is being made.
Empirically, our perfomance test "setDT improved in #5427" says that setDT is actually O(C), linear in the number of columns, see result figure below. (taken from one of the CI zip files) Is that expected? Why? Does setDT have to run some check for each column? If so, can some detail about that be added to the man page please?
I ran another benchmark varying the number of rows R, with a constant number of columns (1), and I got the result below which indicates setDT time complexity is constant with respect to number of rows R.
Code:
edit.data.table<-function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace<-function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex<- gsub(".", "_?", old.Package, fixed=TRUE)
Package_<- gsub(".", "_", old.Package, fixed=TRUE)
new.Package_<- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
r.res<-atime::atime_versions(
pkg.path="~/R/data.table",
pkg.edit.fun=edit.data.table,
N=10^seq(1, 7, by=0.25),
setup= {
DT<- data.table(i=1:N)
},
expr= {
data.table:::setattr(DT, "class", NULL)
data.table:::setDT(DT)
},
Slow="c4a2085e35689a108d67dacb2f8261e4964d7e12", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/7cc4da4c1c8e568f655ab5167922dcdb75953801) in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issueFast="af48a805e7a5026a0c2d0a7fd9b587fea5cfa3c4") # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issue
plot(r.res)
The text was updated successfully, but these errors were encountered:
It does feel like something of an implementation detail to go into the exact complexity. We should at least mention that columns may be checked for inadmissible entries (#6658 is related).
Hi @Rdatatable/project-members
does anybody know the expected time complexity of setDT, as a function of number of rows R and number of columns C?
I was expecting O(1), constant time overall, but now that I look at the documentation, it does not seem to explain the expected time complexity. ?setDT says
Empirically, our perfomance test "setDT improved in #5427" says that setDT is actually O(C), linear in the number of columns, see result figure below. (taken from one of the CI zip files) Is that expected? Why? Does setDT have to run some check for each column? If so, can some detail about that be added to the man page please?
![Image](https://private-user-images.githubusercontent.com/932850/404958423-0a6d85eb-3151-4e52-90d5-061c11d6db13.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NzM0ODksIm5iZiI6MTczOTU3MzE4OSwicGF0aCI6Ii85MzI4NTAvNDA0OTU4NDIzLTBhNmQ4NWViLTMxNTEtNGU1Mi05MGQ1LTA2MWMxMWQ2ZGIxMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNFQyMjQ2MjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yYjA0ZmJhOTUzNDczMmNjYmVlYzU4NDljYTNmY2U2ZGRhZGYxMTNkMjY5YzNkNTJmMzY4MGRkNzNiODQwMzUwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.3f9HmyrdsqI_8Ybg5c2iVxsGJaJHSgdY1dXdVFuh8tc)
I ran another benchmark varying the number of rows R, with a constant number of columns (1), and I got the result below which indicates setDT time complexity is constant with respect to number of rows R.
![Image](https://private-user-images.githubusercontent.com/932850/404959170-f00ced8b-cf2d-4f09-baee-8e6c705ab324.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NzM0ODksIm5iZiI6MTczOTU3MzE4OSwicGF0aCI6Ii85MzI4NTAvNDA0OTU5MTcwLWYwMGNlZDhiLWNmMmQtNGYwOS1iYWVlLThlNmM3MDVhYjMyNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjE0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxNFQyMjQ2MjlaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04OTYxOTNjODk0Mjg0MjE0YjkxN2RmMDMxMGQwNjY1YjE5N2RiYTllYWFjN2RhZDlkYWRiMGI0NTA2NTlhNTM2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.r1GccJdkkGIlcOe30wxOSvqCjwOq2wPS3-evRJLq-vw)
Code:
The text was updated successfully, but these errors were encountered: