Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is openxlsx2 using so much memory? #1252

Open
JanMarvin opened this issue Jan 27, 2025 · 0 comments
Open

Why is openxlsx2 using so much memory? #1252

JanMarvin opened this issue Jan 27, 2025 · 0 comments
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on

Comments

@JanMarvin
Copy link
Owner

JanMarvin commented Jan 27, 2025

A common user complaint is, that openxlsx2 uses a lot of memory. Unfortunately this is nothing that can be waved away and also a tiny bit annoying (after all, waving things away is easy). The issue is nothing that can be seen with small to medium sized data frames, but it is something that can become a problem surprisingly quick. Especially when writing a file to disk.

When and why does the issue appear?

Frequently users want to dump entire databases into spreadsheets. If one attempts this with openxlsx2 it is easy to see that this wont work. Or, when attempting to do this, it demands that one has plenty of memory available. This can be examined with a medium sized example:

n <- 100000
k <- 100

df <- as.data.frame(matrix(rnorm(n*k), n, k))

This data frame has 10 million cells and requires ~80MB of memory

# object size in MB
> as.numeric(object.size(df)) / (1024 ^2)
[1] 76.30597

When writing we have to convert the data frame into characters (Office Open XML after all is just character based). This is costly. Our tiny data frame requires now roughly 10 times the amount of data. This is also not fixed and depends on the number of required digits. Integers are shorter and might not require that much memory, but as a rule of thumb: if one does not want to loose precision and cares about memory, converting to characters is a bad idea.

> df[] <- lapply(df, as.character)
> as.numeric(object.size(df)) / (1024 ^2)
[1] 835.9633

Oh and we don't have the luxury to replace df, we have to do it on a copy of the data, so that the df object in the user space is untouched. So we have df and a character copy of df (usually with an added row containing the column names).

As much as I have though about this, there is no real alternative to this. It could be possible to keep the data as numeric in the worksheet, but this would slow down writing and unfortunately the entire OOXML structure allows a wild mix of character, numeric and expression in the same columns. Since we want to support everything, there is no real way around this. After all 800MB is not that bad ...

And when does it get really bad?

It gets really bad, when writing the data into the worksheet. Internally we usually create two data frames, one for rows attributes (like hidden state, row height etc.) and one for the cells in the row (column information goes into a different object and is negligible size wise). We create additional data frames to get the size of dims using dims_to_dataframe(fill = TRUE), but this should not really matter. It will, but only if the data is either huge or memory really limited.

In the early days of openxlsx2, when the project was just a branch to evaluate the inclusion of pugixml, it became obvious that openxlsx ignores a lot of cell information. To fix this, we—better I—came up with the idea of the cc data frame. This is basically an all character data frame containing all cell related contents. It is all character, because this is how we extract the data from the worksheet and it seems reasonable to not tinker with it in any other way.

The cc data frame is constructed as follows:

> cc <- openxlsx2:::empty_sheet_data_cc(n = 1)
> cc
  r row_r c_r c_s c_t c_cm c_ph c_vm v f f_attr is typ
1         

At the moment we have 139 columns per cell (c_cm, c_ph, c_vm, and typ are no longer created per default). Lets write our data (the numeric data, not the character converted) into the worksheet and have a look at cc.

> wb <- openxlsx2::write_xlsx(x = df)
> cc <- wb$worksheets[[1]]$sheet_data$cc
> head(cc)
   r row_r c_r c_s       c_t c_cm c_ph c_vm v f f_attr                 is typ
1 A1     1   A     inlineStr                           <is><t>V1</t></is>   4
2 B1     1   B     inlineStr                           <is><t>V2</t></is>   4
3 C1     1   C     inlineStr                           <is><t>V3</t></is>   4
4 D1     1   D     inlineStr                           <is><t>V4</t></is>   4
5 E1     1   E     inlineStr                           <is><t>V5</t></is>   4
6 F1     1   F     inlineStr                           <is><t>V6</t></is>   4
> tail(cc)
                r  row_r c_r c_s c_t c_cm c_ph c_vm                 v f f_attr is typ
10000095 CQ100001 100001  CQ                        -1.07263895628018               2
10000096 CR100001 100001  CR                        0.561385104244743               2
10000097 CS100001 100001  CS                          2.0746709055076               2
10000098 CT100001 100001  CT                         1.37557376511109               2
10000099 CU100001 100001  CU                        0.827289072319456               2
10000100 CV100001 100001  CV                        -1.96342635559554               2

Well, how big is cc now? It is bad ... roughly 2.2GB

> as.numeric(object.size(cc)) / (1024 ^3)
[1] 2.237213

What can be done and what are potential issues?

Basically storing data in an all character data frame was a bad idea to begin with. Well, at least if one wants to write a lot of data and has memory restrictions (which is a great chance to recommend the following completely unrelated blog post: https://motherduck.com/blog/big-data-is-dead/).
Unfortunately the size of cc is highly correlated to the size of df. The more cells one wants to write, the larger cc (each cell is a row in cc). And even an entirely blank cc data frame for the df object above requires almost a GB of memory.

> as.numeric(object.size(cc)) / (1024 ^3)
[1] 0.9685782

Combining c attributes

Why not combine all c attributes to c_attr similar to how we handle f_attr #1250 ? Memory wise this would probably save a lot. Instead of having six different c columns, we would end up with only a single.

> cc <- openxlsx2:::create_char_dataframe(
+   c("r", "row_r", "c_r", "c_s", "c_t", "c_cm", "c_ph", "c_vm", "v", "f", "f_attr", "is", "typ"),
+   n = (n*k))
> as.numeric(object.size(cc)) / (1024 ^3)
[1] 0.9685782
> cc <- openxlsx2:::create_char_dataframe(
+   c("r", "row_r", "c_attr", "v", "f", "f_attr", "is", "typ"),
+   n = (n*k))
> as.numeric(object.size(cc)) / (1024 ^3)
[1] 0.5960483

Okay, lets do it? Not so fast. Obviously this will break a lot of stuff all over the place. In addition, to be able to interact with this combined attribute column, we would have to increase and collapse it multiple times. Basically always if any modification to a cell is applied. At the moment I'm unsure how this would impact performance and how this impacts the memory too. After when assigning e.g. a numfmt, we have to alter the style information for potentially all cells and blowing up c_attr at this point and later collapsing it again, would potentially be just as bad as the current implementation (just that the current would be faster, due to not having to increase r="A" s="1" t="inlineStr" to a three column data frame that we can alter and again store as a single line string.

An estimate how much we could save:

> cc <- wb$worksheets[[1]]$sheet_data$cc
> # remove potentially blank cells
> cc$c_cm <- NULL
> cc$c_ph <- NULL
> cc$c_vm <- NULL
> cc$c_r <- NULL
> cc$c_s <- NULL
> as.numeric(object.size(cc)) / (1024 ^3)
[1] 1.864674

Not great yet, but after all 400MB. Oh and we do not need the typ column IIRC.

Change cc column type

Since much of the problem stems from having an all character data frame, why not fall back to logical() or integer()? Both are similar sized, iirc logical is just an extension to integer in R. Both use identical and less memory than a character.

> object.size(character(1))
112 bytes
> object.size(numeric(1))
56 bytes
> object.size(integer(1))
56 bytes
> object.size(logical(1))
56 bytes

Converting into this, could bring some benefits. This would not alter the data structure everywhere. It will still break things, because flushing a cell is no longer possible assigning "" to a row in cc and a bunch of code would have to be rewritten. Will this bring a lot?
Probably not, the initial cc is 2.2GB and after conversion we end up with 2GB:

> sel <- c("c_r", "c_s", "c_t", "c_cm", "c_ph", "c_vm")
> cc[sel] <- lapply(cc[sel], as.factor)
> as.numeric(object.size(cc)) / (1024 ^3)
[1] 2.013696

Make use of database normalization or trim down cc?

Another approach would be to try to further use normalization to trim down cc (IIRC I once experimented with a slim cc branch). This way we could be able to simply merge columns that are filled with data when needed. Maybe a fraction of c_attr that is usually left blank (like c_ph or c_vm). Or a slim cc when writing and another wider cc only when loading and if a column is really required. After all we create many blank cells and even columns in cc. (cc is already trimmed down in #1253 , #1254).

Replace cc with something else?

There is a branch where the arrow package is used. But this increased the dependencies a lot and we still want to be able to modify the data in the file (at the moment it only wrote the parquet file and imported if afterwards). In short, we would still have to rewrite many functions to be able to interact with this file type. And none of this seems pleasing.

Lets talk about wb_save()

When writing output files, memory consumption increases at least by a factor of 2. How come? This is caused by our internal pugixml library. When saving, the XML output is created in memory and once completed, is written to the disk. While bad in memory extensive tasks, it guarantees that the XML output is valid, all special characters are escaped where required and the encoding is valid UTF-8. Since XML is character based, the created DOM tree of the pugixml worksheets is basically another copy of the cc data frame with additional metadata included. While this has benefits, this is memory wise a bottleneck. The alternative to using pugixml to create the output, would be to stream the cc data frame cell by cell directly into an XML file. A POC of such a streaming approach is available in #1255. It might be limited and at the moment it is not clear when or if this will ever make it to the main branch.

But other packages are using much less memory!

Yes, but other packages have other goals to achieve. In openxlsx2 we want to be able to modify cells once we have written them into a worksheet and we want to be able to load a workbook and alter it. Therefore we have to store our information somewhere in a way that we can access and modify. This is not required, when dumping a data frame into an xlsx file. Even the costly conversion to character could then be reduced to an on the fly conversion.

Conclusion

As much as I would like to please everyone, in this case I am a lot less interested. The package works very well for my own use cases. I am not sure if I ever had to write more than ten thousand rows and this works well. If this isn't true for you or your company, please go ahead and fork the package. I would be willing to discuss ideas and pull requests, but I am not actively developing something to solve this memory issue.

At work I have access to various databases, at home I don't have and do not need to. I understand that this is luxury that I have, but please understand that this open source work is a luxury that you and your company are using.

@JanMarvin JanMarvin added documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on labels Jan 27, 2025
@JanMarvin JanMarvin pinned this issue Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant