-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building in ALT-REP to stringi #474
Comments
I was actually thinking about giving And, yes, I am sure that ALTREP should be supported too, I could use your help in the future, thanks! I took a look at your |
Here's the code for that plot: (I updated it a bit to make it easier to include stringi in the plot output) Glad you're interested in ALTREP support. I'll fork stringi and add in ALT-REP to a few functions, as a proof of concept. |
Okay, nice, I'll be happy to take a look at the prototype I see that stringfish uses PCRE. To be a bit fairer, I think you should be comparing the timings for |
Here's the plot setting |
So the most significant speed-up gain would be due to not using the R's CHARSXP cache? (for the 1-threaded version) Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS? |
Yes, all ALTREP frameworks are inherently compatible wiht read/saveRDS. Here's the prototype: https://github.com/traversc/stringi Benchmark code: https://gist.github.com/traversc/f357c5f1a4b0368649849dd3d1f49d14 Dataset used for benchmark:
Running the benchmark:
PS: I think it's important to run the benchmark in separate R sessions for a true comparison. I've found that even after proper garbage collection, there seems to be a big difference running a command multiple times in a row. But even if you run the benchmark in the same R session, ALTREP is faster. |
Thanks for the working prototype. |
Sounds good, looking forward to it! |
Just coming in here waving 👋 -- @traversc and I were chatting about possible fast and lightweight (enough) containers for string vectors. I am currently working a bit arrow objects that have character vectors (in their encoding of contiguous vector plus a vector of offsets, ie |
Might be a good one! My three Aussie cents:
Possible issue worth considering: R's string cache is bypassed... (can be a good thing) Would be nice to agree on a common representation across many packages. |
The And no null termination anywhere 😿 But that is what is out there and what would be most efficient to use. Now, we could of course define another representation standard but that would start as an uphill battle with wind in our face 😿 |
Both @eddelbuettel and I have run into bottlenecks dealing with large amounts of string processing. Existing ALTREP frameworks (e.g. in vrooom, arrow, etc.) don't really help because materialization happens too often, e.g. whenever you use dplyr or data.table. So an ALTREP "common representation across many packages" is very much the goal and would be huge for the R community :) Figuring out the very best optimal representation does not need to be a bottleneck to getting started. We can hide the implementation behind a set of access and modify API functions and test out various implementations without too much work. Like @eddelbuettel I would also have some time to help. PS: I believe the Rstudio folks would also be interested (and hopefully supportive) |
What about doing this at the R, not just package level? Maybe we should ping Tomas Kalibera and ask what he thinks about it... I'm a big proponent of unity.. |
I always have Duncan Murdoch in the back of my ear: "if something can be done at the package level ..." It's just easier that way. You raise a fair point. It may just make everything a tad harder to pull off. |
I agree. We also need to think about how |
We probably have to do what Arrow and others do with is an extra (bitmap) vector to signal it. (It took me some time to come around to this as I actually truly madly deeply love how R has NA/NaA in ints and chars (and bools (!!) etc). But these days interop is likely as if not more important and if we want to do this for 'medium data at scale' we probably have to go with the times and have Arrow interop anyway.) I'd have to double check but I think in the offsets vector it then simply repeat the last position implying length zero of the NA element. But when nullabillity is set (Arrow makes that optional) then there is an additional vector flagging this. |
A zero-length string with an additional NA-marker makes sense to me too. |
Hello all, I don't have any knowledge in C/C++ or other low-level languages (just dabbling a bit in Rust) but I thought you might be interested in the string type in I hope this fits in this conversation, and sorry for the spam otherwise |
Thanks! |
Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:
If there's interest, I'd be happy to develop and work on it.
To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:
You don't need to keep track of encoding, if you can assume UTF-8.
You'd probably want some global configuration parameter:
You'd have to replace every interaction with R string memory with a conditional.
And replace any comparison of address for testing string equality (not sure if stringi does so).
There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.
The text was updated successfully, but these errors were encountered: