Building in ALT-REP to stringi #474

traversc · 2022-03-24T20:07:45Z

Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:

Simpler data structures compared to R's heavy CHARSXP and R's global string cache
Short string optimization
The possibility of true multithreading (you can't multithread R internals)

If there's interest, I'd be happy to develop and work on it.

To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:

std::vector<std::string>

You don't need to keep track of encoding, if you can assume UTF-8.

You'd probably want some global configuration parameter:

stri_use_alt_rep(bool)

You'd have to replace every interaction with R string memory with a conditional.

CHAR
SET_STRING_ELT
STRING_ELT
mkChar*
Rf_allocVector(STRSXP,...)

And replace any comparison of address for testing string equality (not sure if stringi does so).

There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.

The text was updated successfully, but these errors were encountered:

gagolews · 2022-03-25T00:37:01Z

I was actually thinking about giving stringi a major re-write for quite a long time. Now that the Windows-UCRT build of R assumes all strings are natively UTF-8, and the users are likely to transition to it in the future, it has started to make more sense.

And, yes, I am sure that ALTREP should be supported too, I could use your help in the future, thanks!

I took a look at your stringfish package – nice work! I was wondering what code did you use for generating the benchmarks summarised at https://github.com/traversc/stringfish/blob/master/vignettes/bench_v2.png ? I'd like to play with it a bit.

traversc · 2022-03-25T01:57:07Z

Here's the code for that plot:
https://github.com/traversc/stringfish/blob/master/inst/extra_tests/benchmark_test.r

(I updated it a bit to make it easier to include stringi in the plot output)

Glad you're interested in ALTREP support. I'll fork stringi and add in ALT-REP to a few functions, as a proof of concept.

gagolews · 2022-03-25T02:36:15Z

Okay, nice, I'll be happy to take a look at the prototype

I see that stringfish uses PCRE. To be a bit fairer, I think you should be comparing the timings for gsub and friends with perl=TRUE set. Can you generate the benchmark results featuring a base vs stringi vs stringfish comparison and post them here? Thanks

traversc · 2022-03-25T06:07:18Z

Here's the plot setting perl = TRUE. Much less of a difference for regex substitution, but still lots of performance to be gained in various places. I believe the difference between the blue and yellow bars would be what you could expect to gain in stringi.

gagolews · 2022-03-26T01:35:34Z

So the most significant speed-up gain would be due to not using the R's CHARSXP cache? (for the 1-threaded version)

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

traversc · 2022-03-27T01:00:50Z

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

Yes, all ALTREP frameworks are inherently compatible wiht read/saveRDS.

Here's the prototype: https://github.com/traversc/stringi

Benchmark code: https://gist.github.com/traversc/f357c5f1a4b0368649849dd3d1f49d14

Dataset used for benchmark:

cd ~/Downloads
wget https://data.deepai.org/enwik8.zip
unzip enwik8.zip

Running the benchmark:

Rscript bench_stringi.R
# 2.616 sec elapsed

Rscript bench_stringi_with_alt_rep.R
# 1.412 sec elapsed

# stringi CRAN version, comment out stri_use_alt_rep(FALSE)
Rscript bench_stringi.R
# 2.562 sec elapsed

PS: I think it's important to run the benchmark in separate R sessions for a true comparison. I've found that even after proper garbage collection, there seems to be a big difference running a command multiple times in a row.

But even if you run the benchmark in the same R session, ALTREP is faster.

gagolews · 2022-03-29T00:33:04Z

Thanks for the working prototype.
I will get back to you some time later this year when I hope I will be able to come up with a prototype of a re-write (probably something a'la stringi2 - that would be a separate project) of the package.
I definitely do not want to make any major changes is the upstream version of stringi, where stability is of great importance.

traversc · 2022-03-29T02:41:40Z

Sounds good, looking forward to it!

eddelbuettel · 2023-05-17T01:22:05Z

Just coming in here waving 👋 -- @traversc and I were chatting about possible fast and lightweight (enough) containers for string vectors. I am currently working a bit arrow objects that have character vectors (in their encoding of contiguous vector plus a vector of offsets, ie c("the", "black", "cat") becomes "theblackcat" and c(0,3,8,11) ) which is quick to pass around -- and it would be nice to have something quick and light and powerful to work with it. Of course once we mod the vector the contiguous blob may no longer be viable but at least for read access before that a std::vector<string_view> should be easy and could be ALTREP backed. Anyway, food for thought. Happy to help, time permitting.

gagolews · 2023-05-17T05:20:37Z

Might be a good one!

My three Aussie cents:

always use UTF-8
consider including null-bytes at the end of each substring (makes processing by other functions easier; the\0black\0cat\0 above

Possible issue worth considering: R's string cache is bypassed... (can be a good thing)

Would be nice to agree on a common representation across many packages.

eddelbuettel · 2023-05-17T12:17:12Z

The c("the", "black", "cat") with c(0,3,8,11) ) is a given and cannot be changed. I do not know who started it but it seems common, it is how TileDB stores on disks / retrieves and also what arrow does. (There is one fine difference whether vector offsets is length n or n+1 as shown here. The latter is easier and preferred.

And no null termination anywhere 😿 But that is what is out there and what would be most efficient to use.

Now, we could of course define another representation standard but that would start as an uphill battle with wind in our face 😿

traversc · 2023-05-17T19:46:08Z

Both @eddelbuettel and I have run into bottlenecks dealing with large amounts of string processing. Existing ALTREP frameworks (e.g. in vrooom, arrow, etc.) don't really help because materialization happens too often, e.g. whenever you use dplyr or data.table.

So an ALTREP "common representation across many packages" is very much the goal and would be huge for the R community :)

Figuring out the very best optimal representation does not need to be a bottleneck to getting started. We can hide the implementation behind a set of access and modify API functions and test out various implementations without too much work.

Like @eddelbuettel I would also have some time to help.

PS: I believe the Rstudio folks would also be interested (and hopefully supportive)

gagolews · 2023-05-17T22:59:35Z

What about doing this at the R, not just package level? Maybe we should ping Tomas Kalibera and ask what he thinks about it... I'm a big proponent of unity..

eddelbuettel · 2023-05-17T23:06:48Z

I always have Duncan Murdoch in the back of my ear: "if something can be done at the package level ..." It's just easier that way.

You raise a fair point. It may just make everything a tad harder to pull off.

gagolews · 2023-05-18T01:08:35Z

I agree.

We also need to think about how NA_character_s would be represented.
NAs in the index vector (like c(0,3,NA,8,11))? But then it would slow down string extraction for "sparse" vectors (with many missing strings).

eddelbuettel · 2023-05-18T01:15:52Z

We probably have to do what Arrow and others do with is an extra (bitmap) vector to signal it. nanoarrow has helpers.

(It took me some time to come around to this as I actually truly madly deeply love how R has NA/NaA in ints and chars (and bools (!!) etc). But these days interop is likely as if not more important and if we want to do this for 'medium data at scale' we probably have to go with the times and have Arrow interop anyway.)

I'd have to double check but I think in the offsets vector it then simply repeat the last position implying length zero of the NA element. But when nullabillity is set (Arrow makes that optional) then there is an additional vector flagging this.

gagolews · 2023-05-18T02:59:59Z

A zero-length string with an additional NA-marker makes sense to me too.
This will not paralyse other algorithms.

etiennebacher · 2024-02-06T09:03:36Z

Hello all, I don't have any knowledge in C/C++ or other low-level languages (just dabbling a bit in Rust) but I thought you might be interested in the string type in polars, since they rewrote it very recently and had important performance gains: https://pola.rs/posts/polars-string-type/

I hope this fits in this conversation, and sorry for the spam otherwise

gagolews · 2024-02-12T00:07:08Z

Thanks!

gagolews added the question label Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building in ALT-REP to stringi #474

Building in ALT-REP to stringi #474

traversc commented Mar 24, 2022

gagolews commented Mar 25, 2022

traversc commented Mar 25, 2022

gagolews commented Mar 25, 2022

traversc commented Mar 25, 2022

gagolews commented Mar 26, 2022

traversc commented Mar 27, 2022 •

edited

Loading

gagolews commented Mar 29, 2022

traversc commented Mar 29, 2022

eddelbuettel commented May 17, 2023

gagolews commented May 17, 2023

eddelbuettel commented May 17, 2023 •

edited

Loading

traversc commented May 17, 2023

gagolews commented May 17, 2023

eddelbuettel commented May 17, 2023

gagolews commented May 18, 2023

eddelbuettel commented May 18, 2023 •

edited

Loading

gagolews commented May 18, 2023

etiennebacher commented Feb 6, 2024

gagolews commented Feb 12, 2024

Building in ALT-REP to stringi #474

Building in ALT-REP to stringi #474

Comments

traversc commented Mar 24, 2022

gagolews commented Mar 25, 2022

traversc commented Mar 25, 2022

gagolews commented Mar 25, 2022

traversc commented Mar 25, 2022

gagolews commented Mar 26, 2022

traversc commented Mar 27, 2022 • edited Loading

gagolews commented Mar 29, 2022

traversc commented Mar 29, 2022

eddelbuettel commented May 17, 2023

gagolews commented May 17, 2023

eddelbuettel commented May 17, 2023 • edited Loading

traversc commented May 17, 2023

gagolews commented May 17, 2023

eddelbuettel commented May 17, 2023

gagolews commented May 18, 2023

eddelbuettel commented May 18, 2023 • edited Loading

gagolews commented May 18, 2023

etiennebacher commented Feb 6, 2024

gagolews commented Feb 12, 2024

traversc commented Mar 27, 2022 •

edited

Loading

eddelbuettel commented May 17, 2023 •

edited

Loading

eddelbuettel commented May 18, 2023 •

edited

Loading