You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A common user complaint is, that openxlsx2 uses a lot of memory. Unfortunately this is nothing that can be waved away and also a tiny bit annoying (after all, waving things away is easy). The issue is nothing that can be seen with small to medium sized data frames, but it is something that can become a problem surprisingly quick. Especially when writing a file to disk.
When and why does the issue appear?
Frequently users want to dump entire databases into spreadsheets. If one attempts this with openxlsx2 it is easy to see that this wont work. Or, when attempting to do this, it demands that one has plenty of memory available. This can be examined with a medium sized example:
n<-100000k<-100df<- as.data.frame(matrix(rnorm(n*k), n, k))
This data frame has 10 million cells and requires ~80MB of memory
# object size in MB> as.numeric(object.size(df)) / (1024^2)
[1] 76.30597
When writing we have to convert the data frame into characters (Office Open XML after all is just character based). This is costly. Our tiny data frame requires now roughly 10 times the amount of data. This is also not fixed and depends on the number of required digits. Integers are shorter and might not require that much memory, but as a rule of thumb: if one does not want to loose precision and cares about memory, converting to characters is a bad idea.
Oh and we don't have the luxury to replace df, we have to do it on a copy of the data, so that the df object in the user space is untouched. So we have df and a character copy of df (usually with an added row containing the column names).
As much as I have though about this, there is no real alternative to this. It could be possible to keep the data as numeric in the worksheet, but this would slow down writing and unfortunately the entire OOXML structure allows a wild mix of character, numeric and expression in the same columns. Since we want to support everything, there is no real way around this. After all 800MB is not that bad ...
And when does it get really bad?
It gets really bad, when writing the data into the worksheet. Internally we usually create two data frames, one for rows attributes (like hidden state, row height etc.) and one for the cells in the row (column information goes into a different object and is negligible size wise). We create additional data frames to get the size of dims using dims_to_dataframe(fill = TRUE), but this should not really matter. It will, but only if the data is either huge or memory really limited.
In the early days of openxlsx2, when the project was just a branch to evaluate the inclusion of pugixml, it became obvious that openxlsx ignores a lot of cell information. To fix this, we—better I—came up with the idea of the cc data frame. This is basically an all character data frame containing all cell related contents. It is all character, because this is how we extract the data from the worksheet and it seems reasonable to not tinker with it in any other way.
At the moment we have 139 columns per cell (c_cm, c_ph, c_vm, and typ are no longer created per default). Lets write our data (the numeric data, not the character converted) into the worksheet and have a look at cc.
Basically storing data in an all character data frame was a bad idea to begin with. Well, at least if one wants to write a lot of data and has memory restrictions (which is a great chance to recommend the following completely unrelated blog post: https://motherduck.com/blog/big-data-is-dead/).
Unfortunately the size of cc is highly correlated to the size of df. The more cells one wants to write, the larger cc (each cell is a row in cc). And even an entirely blank cc data frame for the df object above requires almost a GB of memory.
Why not combine all c attributes to c_attr similar to how we handle f_attr#1250 ? Memory wise this would probably save a lot. Instead of having six different c columns, we would end up with only a single.
Okay, lets do it? Not so fast. Obviously this will break a lot of stuff all over the place. In addition, to be able to interact with this combined attribute column, we would have to increase and collapse it multiple times. Basically always if any modification to a cell is applied. At the moment I'm unsure how this would impact performance and how this impacts the memory too. After when assigning e.g. a numfmt, we have to alter the style information for potentially all cells and blowing up c_attr at this point and later collapsing it again, would potentially be just as bad as the current implementation (just that the current would be faster, due to not having to increase r="A" s="1" t="inlineStr" to a three column data frame that we can alter and again store as a single line string.
Not great yet, but after all 400MB. Oh and we do not need the typ column IIRC.
Change cc column type
Since much of the problem stems from having an all character data frame, why not fall back to logical() or integer()? Both are similar sized, iirc logical is just an extension to integer in R. Both use identical and less memory than a character.
Converting into this, could bring some benefits. This would not alter the data structure everywhere. It will still break things, because flushing a cell is no longer possible assigning "" to a row in cc and a bunch of code would have to be rewritten. Will this bring a lot?
Probably not, the initial cc is 2.2GB and after conversion we end up with 2GB:
Make use of database normalization or trim down cc?
Another approach would be to try to further use normalization to trim down cc (IIRC I once experimented with a slim cc branch). This way we could be able to simply merge columns that are filled with data when needed. Maybe a fraction of c_attr that is usually left blank (like c_ph or c_vm). Or a slim cc when writing and another wider cc only when loading and if a column is really required. After all we create many blank cells and even columns in cc. (cc is already trimmed down in #1253 , #1254).
Replace cc with something else?
There is a branch where the arrow package is used. But this increased the dependencies a lot and we still want to be able to modify the data in the file (at the moment it only wrote the parquet file and imported if afterwards). In short, we would still have to rewrite many functions to be able to interact with this file type. And none of this seems pleasing.
Lets talk about wb_save()
When writing output files, memory consumption increases at least by a factor of 2. How come? This is caused by our internal pugixml library. When saving, the XML output is created in memory and once completed, is written to the disk. While bad in memory extensive tasks, it guarantees that the XML output is valid, all special characters are escaped where required and the encoding is valid UTF-8. Since XML is character based, the created DOM tree of the pugixml worksheets is basically another copy of the cc data frame with additional metadata included. While this has benefits, this is memory wise a bottleneck. The alternative to using pugixml to create the output, would be to stream the cc data frame cell by cell directly into an XML file. A POC of such a streaming approach is available in #1255. It might be limited and at the moment it is not clear when or if this will ever make it to the main branch.
But other packages are using much less memory!
Yes, but other packages have other goals to achieve. In openxlsx2 we want to be able to modify cells once we have written them into a worksheet and we want to be able to load a workbook and alter it. Therefore we have to store our information somewhere in a way that we can access and modify. This is not required, when dumping a data frame into an xlsx file. Even the costly conversion to character could then be reduced to an on the fly conversion.
Conclusion
As much as I would like to please everyone, in this case I am a lot less interested. The package works very well for my own use cases. I am not sure if I ever had to write more than ten thousand rows and this works well. If this isn't true for you or your company, please go ahead and fork the package. I would be willing to discuss ideas and pull requests, but I am not actively developing something to solve this memory issue.
At work I have access to various databases, at home I don't have and do not need to. I understand that this is luxury that I have, but please understand that this open source work is a luxury that you and your company are using.
The text was updated successfully, but these errors were encountered:
A common user complaint is, that
openxlsx2
uses a lot of memory. Unfortunately this is nothing that can be waved away and also a tiny bit annoying (after all, waving things away is easy). The issue is nothing that can be seen with small to medium sized data frames, but it is something that can become a problem surprisingly quick. Especially when writing a file to disk.When and why does the issue appear?
Frequently users want to dump entire databases into spreadsheets. If one attempts this with
openxlsx2
it is easy to see that this wont work. Or, when attempting to do this, it demands that one has plenty of memory available. This can be examined with a medium sized example:This data frame has 10 million cells and requires ~80MB of memory
When writing we have to convert the data frame into characters (Office Open XML after all is just character based). This is costly. Our tiny data frame requires now roughly 10 times the amount of data. This is also not fixed and depends on the number of required digits. Integers are shorter and might not require that much memory, but as a rule of thumb: if one does not want to loose precision and cares about memory, converting to characters is a bad idea.
Oh and we don't have the luxury to replace
df
, we have to do it on a copy of the data, so that thedf
object in the user space is untouched. So we havedf
and a character copy ofdf
(usually with an added row containing the column names).As much as I have though about this, there is no real alternative to this. It could be possible to keep the data as numeric in the worksheet, but this would slow down writing and unfortunately the entire OOXML structure allows a wild mix of character, numeric and expression in the same columns. Since we want to support everything, there is no real way around this. After all 800MB is not that bad ...
And when does it get really bad?
It gets really bad, when writing the data into the worksheet. Internally we usually create two data frames, one for rows attributes (like hidden state, row height etc.) and one for the cells in the row (column information goes into a different object and is negligible size wise). We create additional data frames to get the size of dims using
dims_to_dataframe(fill = TRUE)
, but this should not really matter. It will, but only if the data is either huge or memory really limited.In the early days of
openxlsx2
, when the project was just a branch to evaluate the inclusion ofpugixml
, it became obvious thatopenxlsx
ignores a lot of cell information. To fix this, we—better I—came up with the idea of thecc
data frame. This is basically an all character data frame containing all cell related contents. It is all character, because this is how we extract the data from the worksheet and it seems reasonable to not tinker with it in any other way.The
cc
data frame is constructed as follows:At the moment we have
139 columns per cell (c_cm
,c_ph
,c_vm
, andtyp
are no longer created per default). Lets write our data (the numeric data, not the character converted) into the worksheet and have a look atcc
.Well, how big is
cc
now? It is bad ... roughly 2.2GBWhat can be done and what are potential issues?
Basically storing data in an all character data frame was a bad idea to begin with. Well, at least if one wants to write a lot of data and has memory restrictions (which is a great chance to recommend the following completely unrelated blog post: https://motherduck.com/blog/big-data-is-dead/).
Unfortunately the size of
cc
is highly correlated to the size ofdf
. The more cells one wants to write, the largercc
(each cell is a row incc
). And even an entirely blankcc
data frame for thedf
object above requires almost a GB of memory.Combining
c
attributesWhy not combine all
c
attributes toc_attr
similar to how we handlef_attr
#1250 ? Memory wise this would probably save a lot. Instead of having six differentc
columns, we would end up with only a single.Okay, lets do it? Not so fast. Obviously this will break a lot of stuff all over the place. In addition, to be able to interact with this combined attribute column, we would have to increase and collapse it multiple times. Basically always if any modification to a cell is applied. At the moment I'm unsure how this would impact performance and how this impacts the memory too. After when assigning e.g. a numfmt, we have to alter the style information for potentially all cells and blowing up
c_attr
at this point and later collapsing it again, would potentially be just as bad as the current implementation (just that the current would be faster, due to not having to increaser="A" s="1" t="inlineStr"
to a three column data frame that we can alter and again store as a single line string.An estimate how much we could save:
Not great yet, but after all 400MB. Oh and we do not need the
typ
column IIRC.Change
cc
column typeSince much of the problem stems from having an all character data frame, why not fall back to
logical()
orinteger()
? Both are similar sized, iirclogical
is just an extension tointeger
in R. Both use identical and less memory than a character.Converting into this, could bring some benefits. This would not alter the data structure everywhere. It will still break things, because flushing a cell is no longer possible assigning
""
to a row incc
and a bunch of code would have to be rewritten. Will this bring a lot?Probably not, the initial
cc
is 2.2GB and after conversion we end up with 2GB:Make use of database normalization or trim down
cc
?Another approach would be to try to further use normalization to trim down
cc
(IIRC I once experimented with a slim cc branch). This way we could be able to simply merge columns that are filled with data when needed. Maybe a fraction ofc_attr
that is usually left blank (likec_ph
orc_vm
). Or a slimcc
when writing and another widercc
only when loading and if a column is really required. After all we create many blank cells and even columns incc
. (cc
is already trimmed down in #1253 , #1254).Replace
cc
with something else?There is a branch where the
arrow
package is used. But this increased the dependencies a lot and we still want to be able to modify the data in the file (at the moment it only wrote the parquet file and imported if afterwards). In short, we would still have to rewrite many functions to be able to interact with this file type. And none of this seems pleasing.Lets talk about
wb_save()
When writing output files, memory consumption increases at least by a factor of 2. How come? This is caused by our internal
pugixml
library. When saving, the XML output is created in memory and once completed, is written to the disk. While bad in memory extensive tasks, it guarantees that the XML output is valid, all special characters are escaped where required and the encoding is valid UTF-8. Since XML is character based, the created DOM tree of thepugixml
worksheets is basically another copy of thecc
data frame with additional metadata included. While this has benefits, this is memory wise a bottleneck. The alternative to usingpugixml
to create the output, would be to stream thecc
data frame cell by cell directly into an XML file. A POC of such a streaming approach is available in #1255. It might be limited and at the moment it is not clear when or if this will ever make it to themain
branch.But other packages are using much less memory!
Yes, but other packages have other goals to achieve. In
openxlsx2
we want to be able to modify cells once we have written them into a worksheet and we want to be able to load a workbook and alter it. Therefore we have to store our information somewhere in a way that we can access and modify. This is not required, when dumping a data frame into an xlsx file. Even the costly conversion to character could then be reduced to an on the fly conversion.Conclusion
As much as I would like to please everyone, in this case I am a lot less interested. The package works very well for my own use cases. I am not sure if I ever had to write more than ten thousand rows and this works well. If this isn't true for you or your company, please go ahead and fork the package. I would be willing to discuss ideas and pull requests, but I am not actively developing something to solve this memory issue.
At work I have access to various databases, at home I don't have and do not need to. I understand that this is luxury that I have, but please understand that this open source work is a luxury that you and your company are using.
The text was updated successfully, but these errors were encountered: