Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expected/documented time complexity of setDT? #6741

Closed
tdhock opened this issue Jan 20, 2025 · 1 comment · Fixed by #6756
Closed

expected/documented time complexity of setDT? #6741

tdhock opened this issue Jan 20, 2025 · 1 comment · Fixed by #6756

Comments

@tdhock
Copy link
Member

tdhock commented Jan 20, 2025

Hi @Rdatatable/project-members
does anybody know the expected time complexity of setDT, as a function of number of rows R and number of columns C?
I was expecting O(1), constant time overall, but now that I look at the documentation, it does not seem to explain the expected time complexity. ?setDT says

     ‘setDT’ converts lists (both named and unnamed) and data.frames to
     data.tables _by reference_
...
        The ‘setDT’
     function takes care of this issue by allowing to convert ‘lists’ -
     both named and unnamed lists and ‘data.frames’ _by reference_
     instead. That is, the input object is modified in place, no copy
     is being made.

Empirically, our perfomance test "setDT improved in #5427" says that setDT is actually O(C), linear in the number of columns, see result figure below. (taken from one of the CI zip files) Is that expected? Why? Does setDT have to run some check for each column? If so, can some detail about that be added to the man page please?
Image

I ran another benchmark varying the number of rows R, with a constant number of columns (1), and I got the result below which indicates setDT time complexity is constant with respect to number of rows R.
Image
Code:

edit.data.table <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
r.res <- atime::atime_versions(  
  pkg.path = "~/R/data.table",
  pkg.edit.fun = edit.data.table,
  N = 10^seq(1, 7, by = 0.25),
  setup = {
    DT <- data.table(i=1:N)
  },
  expr = {
    data.table:::setattr(DT, "class", NULL)
    data.table:::setDT(DT)
  },
  Slow = "c4a2085e35689a108d67dacb2f8261e4964d7e12", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/7cc4da4c1c8e568f655ab5167922dcdb75953801) in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issue
  Fast = "af48a805e7a5026a0c2d0a7fd9b587fea5cfa3c4") # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issue
plot(r.res)
@MichaelChirico
Copy link
Member

Yes, we iterate on columns to check for invalid inputs:

data.table/R/data.table.R

Lines 2922 to 2927 in 3a7ec2d

for (jj in seq_along(x)) {
if (length(dim(x[[jj]])) > 1L) {
.Call(Cwarn_matrix_column_r, jj)
break
}
}

That's for data.frames, done from R, here's for list input (as in your example):

for (R_len_t i = 0; i < LENGTH(x); ++i) {

It does feel like something of an implementation detail to go into the exact complexity. We should at least mention that columns may be checked for inadmissible entries (#6658 is related).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants