Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow write_parquet removes .internal.selfref, data.table warning message not helpful #6737

Open
nicki-dese opened this issue Jan 19, 2025 · 5 comments
Labels
message Messages, warnings, errors

Comments

@nicki-dese
Copy link

The bug can be replicated as follows (I'm on Windows 11, using version 4.4.2 of R). It is new behaviour as of arrow 17.0.

library(arrow)          # version 18.1.0.1 
library(data.table)   # version 1.16.4

dt <- data.table(x = 1:3)

names(attributes(dt))

# returns
# "names"             "row.names"         "class"             ".internal.selfref"

#works, creating a new column by reference. 
dt[, y := letters[1:3]]

# save file using write_parquet
write_parquet(dt, "test.parquet")

# read file back in using read_parquet
dt_after_parquet <- read_parquet("test.parquet")

# this has stripped away the .internal.selfref attribute
names(attributes(dt_after_parquet))
# returns
# "names"             "row.names"         "class" 

# meaning that this works but with the following warning message.
dt_after_parquet[, z := 4:6]

# Warning message:
# In `[.data.table`(dt_after_parquet, , `:=`(z, 4:6)) :
#   Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the 
# data.table so that := can add this new column by reference. At an earlier point, 
# this data.table has been copied by R (or was created manually using structure() 
# or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy 
# the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames 
# and ?setattr. If this message doesn't help, please report your use case to the 
# data.table issue tracker so the root cause can be fixed or this message improved.

What was happening took effort to track down, because it was not obvious to me that writing and reading a data.table file was covered by the warning message. (With the added complication that I was using targets, which called write/read parquet in the background because I'd selected to save my targets as parquet files).

I have reported the bug to arrow, here. I debated whether to cross-post, but given the request in the warning message itself, decided to. Please delete/close if this cross-posting was ill-advised.

@rikivillalba
Copy link
Contributor

rikivillalba commented Jan 20, 2025

The key is in the help of Arrow's read_parquet
It returns a tibble, not a data.table object.
For some reason, that read function also sets the "class" attribute of the object to their original value of "data.table". So R recognizes it as a data.table even when not really created as is.
When the warning says "or was created manually using structure() or similar" it includes manually setting the class to "data.table", as arrow does.
I wouldn't discard this as it sound reasonable to me that a better advice in the message could be of value.
Meanwhile, use setDT(dt_after_parquet) following read_parquet

#6494

@rikivillalba rikivillalba added the message Messages, warnings, errors label Jan 20, 2025
@nicki-dese
Copy link
Author

nicki-dese commented Jan 20, 2025

Hi @rikivillalba - good catch with read_parquet - I'd missed that in the arrow docs. The thing that confuses me is that in arrow 16 and earlier, read_parquet used to keep .internal.selfref, and as far as I could tell the returned object behaved as a data.table - is a "class" of "data.table" and an .internal.selfref necessary and sufficient to make a dataframe a data.table? (or was read_parquet giving me a false sense of security and the returned object was still a tibble masquerading as a data.table).

@rikivillalba
Copy link
Contributor

rikivillalba commented Jan 20, 2025

Mmm no . Perhaps earlier arrow versions restore all attributes and newers do not save/restore those starting with dots. However, .internal.selfref cannot be "saved" as it is a true pointer (i.e. not serializable), loading a data.table directiy won't work without warning unless you use data.table native methods, i.e. setDT, as.data.table, or fread. Data.table checks whether .internal.selfref is ok and print the warning if not, and corrects it.
There is the mentioned issue 6494 to possibly improve the warning message.

@tdhock
Copy link
Member

tdhock commented Jan 20, 2025

this sounds like a regression that arrow should fix

@nicki-dese
Copy link
Author

nicki-dese commented Jan 29, 2025

@rikivillalba, I shared your observations on read_parquet on the arrow issue, and they have confirmed your observation around arrow keeping the data.table class attribute. Their solution (which seems correct to me) is to remove that attribute from the arrow metdata, and there is now an arrow pull request in to fix it. Now when a data.table is saved, and read back in it will not have the 'class' of a data.table.

As far as this issue re the warning message is concerned, you said:

loading a data.table directiy won't work without warning unless you use data.table native methods, i.e. setDT, as.data.table, or fread.

Would it be worth updating the warning message to clarify this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
message Messages, warnings, errors
Projects
None yet
Development

No branches or pull requests

3 participants