expected/documented time complexity of setDT? #6741

tdhock · 2025-01-20T17:10:03Z

Hi @Rdatatable/project-members
does anybody know the expected time complexity of setDT, as a function of number of rows R and number of columns C?
I was expecting O(1), constant time overall, but now that I look at the documentation, it does not seem to explain the expected time complexity. ?setDT says

     ‘setDT’ converts lists (both named and unnamed) and data.frames to
     data.tables _by reference_
...
        The ‘setDT’
     function takes care of this issue by allowing to convert ‘lists’ -
     both named and unnamed lists and ‘data.frames’ _by reference_
     instead. That is, the input object is modified in place, no copy
     is being made.

Empirically, our perfomance test "setDT improved in #5427" says that setDT is actually O(C), linear in the number of columns, see result figure below. (taken from one of the CI zip files) Is that expected? Why? Does setDT have to run some check for each column? If so, can some detail about that be added to the man page please?

I ran another benchmark varying the number of rows R, with a constant number of columns (1), and I got the result below which indicates setDT time complexity is constant with respect to number of rows R.

Code:

edit.data.table <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}
r.res <- atime::atime_versions(  
  pkg.path = "~/R/data.table",
  pkg.edit.fun = edit.data.table,
  N = 10^seq(1, 7, by = 0.25),
  setup = {
    DT <- data.table(i=1:N)
  },
  expr = {
    data.table:::setattr(DT, "class", NULL)
    data.table:::setDT(DT)
  },
  Slow = "c4a2085e35689a108d67dacb2f8261e4964d7e12", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/7cc4da4c1c8e568f655ab5167922dcdb75953801) in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issue
  Fast = "af48a805e7a5026a0c2d0a7fd9b587fea5cfa3c4") # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/5427/commits) that fixes the issue
plot(r.res)

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2025-01-20T23:30:57Z

Yes, we iterate on columns to check for invalid inputs:

data.table/R/data.table.R

Lines 2922 to 2927 in 3a7ec2d

    
           for (jj in seq_along(x)) { 
        
             if (length(dim(x[[jj]])) > 1L) { 
        
               .Call(Cwarn_matrix_column_r, jj) 
        
               break 
        
             } 
        
           }

That's for data.frames, done from R, here's for list input (as in your example):

data.table/src/assign.c

Line 212 in 3a7ec2d

for (R_len_t i = 0; i < LENGTH(x); ++i) {

It does feel like something of an implementation detail to go into the exact complexity. We should at least mention that columns may be checked for inadmissible entries (#6658 is related).

MichaelChirico mentioned this issue Jan 23, 2025

Document essential complexity of setDT #6756

Merged

MichaelChirico added the documentation label Jan 23, 2025

MichaelChirico closed this as completed in #6756 Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expected/documented time complexity of setDT? #6741

expected/documented time complexity of setDT? #6741

tdhock commented Jan 20, 2025 •

edited

Loading

MichaelChirico commented Jan 20, 2025

expected/documented time complexity of setDT? #6741

expected/documented time complexity of setDT? #6741

Comments

tdhock commented Jan 20, 2025 • edited Loading

MichaelChirico commented Jan 20, 2025

tdhock commented Jan 20, 2025 •

edited

Loading