[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300

nicki-dese · 2025-01-19T10:45:13Z

Describe the bug, including details regarding any error messages, version, and platform.

As of version 17.0 of arrow, the .internal.selfref attribute is no longer saved with the data.table when using write_parquet(). This breaks data.table's ability to set columns by reference unless you re-cast it to a data.table (using setDT) after reading it in with read_parquet.

The bug can be replicated as follows (I'm on Windows 11, using version 4.4.2 of R).

library(arrow)          # version 18.1.0.1 
library(data.table)   # version 1.16.4

dt <- data.table(x = 1:3)

names(attributes(dt))

# returns
# "names"             "row.names"         "class"             ".internal.selfref"

#works, creating a new column by reference. 
dt[, y := letters[1:3]]

# save file using write_parquet
write_parquet(dt, "test.parquet")

# read file back in using read_parquet
dt_after_parquet <- read_parquet("test.parquet")

# this has stripped away the .internal.selfref attribute
names(attributes(dt_after_parquet))
# returns
# "names"             "row.names"         "class" 

# meaning that this works but with the following warning message.
dt_after_parquet[, z := 4:6]

# Warning message:
# In `[.data.table`(dt_after_parquet, , `:=`(z, 4:6)) :
#   Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the 
# data.table so that := can add this new column by reference. At an earlier point, 
# this data.table has been copied by R (or was created manually using structure() 
# or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy 
# the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames 
# and ?setattr. If this message doesn't help, please report your use case to the 
# data.table issue tracker so the root cause can be fixed or this message improved.

The behaviour of the .internal.selfref attribute is described on stackoverflow, here.

Despite not being an error, that warning message is actually significant, as a main strength of data.table is that it does not create copies of a dataframe during manipulation or creation of columns within that dataframe, the way other R packages do. Our workflow is to work with large data.tables in a targets pipeline, saving interim files as parquet files, which targets does using arrow::read_parquet - this bug slows down our projects, as data.table creates shallow copies of our large data.tables.

Note that in arrow version 16.0 and earlier this issue did not occur, data.tables make the round trip through read/write_parquet successfully, maintaining the ability to set columns by reference.

Component(s)

R

The text was updated successfully, but these errors were encountered:

amoeba · 2025-01-21T15:15:51Z

Thanks for the report @nicki-dese.

@nealrichardson does this seem like it could be related to #42143?

nealrichardson · 2025-01-21T15:45:37Z

.internal.selfref is a pointer to the data.table in memory. Once you write it to a file and read it back in, it's no longer valid--you're not at the same address in memory anymore.

amoeba · 2025-01-21T16:31:23Z

Thanks @nealrichardson.

@nicki-dese: I think the workflow you're after isn't something the arrow R package can support unfortunately. If data.table were to elevate that behavior to some sort of standard interface it might be possible to reconsider.

To re-gain the speed you had before, you may have to explore more of the functionality of the arrow R package or packages like duckdb. Generally speaking, writing a table out to Parquet and reading it back in is expensive and something the these packages can help you avoid. Happy to chat more here about this or the original report.

nicki-dese · 2025-01-22T23:36:10Z

Thank you @nealrichardson.

Over in the related issue at data.table, @rikivillalba confirmed your comment re inability to save the .internal.selfref pointer, they also made the following observation:

The key is in the help of Arrow's read_parquet
It returns a tibble, not a data.table object.
For some reason, that read function also sets the "class" attribute of the object to their original value of "data.table". So R recognizes it as a data.table even when not really created as is.

Assuming they're correct about the read function setting the class attribute, that seems like something that should be changed? either leaving the class as tibble, thus matching the docs, or updating the docs and adding an exception in the read function to not set the class to data.table, because it isn't.

nicki-dese · 2025-01-22T23:47:42Z

Thank you @amoeba, both for your info and offer of a chat. I have done a lot of exploring of arrow, and more recently duckdb. For the majority of our work, targets plus data.table with interim outputs saved as parquet via targets has worked really well. We often start our targets pipeline with arrow::open_dataset to filter our data before bringing it in to memory, which has been a game changer. However, open_dataset's schema inference from csvs is much worse than both fread and duckdb's, which has stopped our whole-hearted adoption.

nealrichardson · 2025-01-23T00:13:46Z

For some reason, that read function also sets the "class" attribute of the object to their original value of "data.table". So R recognizes it as a data.table even when not really created as is.

We remove some attributes in a few places for classes that get stored differently, like in https://github.com/apache/arrow/blob/main/r/R/metadata.R#L240-L258. We could also prune the "data.table" class if that's more correct--would you like to make a pull request to do that?

nicki-dese · 2025-01-23T02:26:47Z

Hi @nealrichardson, thank you for the suggestion. It sounds like pruning the data.table class would be more correct. Unfortunately I have never pulled or submitted a request (my use of R is in closed VM, with no access to the internet, so I don't know how to contribute to github other than submitting issues).

amoeba · 2025-01-24T16:15:46Z

Thanks for the discussion. I went ahead and created a PR at #45346.

nicki-dese added the Type: bug label Jan 19, 2025

github-actions bot added the Component: R label Jan 19, 2025

nicki-dese mentioned this issue Jan 19, 2025

Arrow write_parquet removes .internal.selfref, data.table warning message not helpful Rdatatable/data.table#6737

Open

amoeba changed the title ~~Arrow write_parquet() breaks data.table ability to set columns by reference~~ [R] Arrow write_parquet() breaks data.table ability to set columns by reference Jan 21, 2025

amoeba linked a pull request Jan 24, 2025 that will close this issue

GH-45300: [R] Remove data.table from class attribute in metadata #45346

Open

github-actions bot assigned amoeba Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300

[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300

nicki-dese commented Jan 19, 2025 •

edited

Loading

amoeba commented Jan 21, 2025

nealrichardson commented Jan 21, 2025

amoeba commented Jan 21, 2025

nicki-dese commented Jan 22, 2025

nicki-dese commented Jan 22, 2025

nealrichardson commented Jan 23, 2025

nicki-dese commented Jan 23, 2025

amoeba commented Jan 24, 2025

[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300

[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300

Comments

nicki-dese commented Jan 19, 2025 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

amoeba commented Jan 21, 2025

nealrichardson commented Jan 21, 2025

amoeba commented Jan 21, 2025

nicki-dese commented Jan 22, 2025

nicki-dese commented Jan 22, 2025

nealrichardson commented Jan 23, 2025

nicki-dese commented Jan 23, 2025

amoeba commented Jan 24, 2025

nicki-dese commented Jan 19, 2025 •

edited

Loading