-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GDAL arrow stream #545
Comments
Thanks for scoping this out and pointing me to the sources. It's only slightly different in gdalraster in the context of class GDALVector.
One difference is, a GDALVector object has a persistent connection to a vector data source, i.e., with pointers to an OGRLayer object and the GDALDataset that owns it. In the code you linked, much of what is done in, e.g., An ArrowArrayStream or the ArrowSchema/ArrowArray objects it returns cannot be used once the OGRLayer they are initialized from has been destroyed, usually at dataset closing. There is not already an inherent object in sf with dataset/layer pointers on the R side. So GDALStreamWrapper in sf/src/gdal_read_stream.cpp holds the pointers in its member variable For just exposing the arrow stream, this is proof-of-concept only (currently does not handle GDAL < 3.6, undocumented, untested, potential design flaws, etc.). The code additions are in src/gdalvector.cpp, mainly the exposed method https://github.com/ctoney/gdalraster/tree/arrowstream library(gdalraster)
#> GDAL 3.8.4, released 2024/02/08, GEOS 3.12.1, PROJ 9.3.1
dsn <- system.file("extdata/ynp_fires_1984_2022.gpkg", package = "gdalraster")
lyr <- new(GDALVector, dsn, "mtbs_perims")
stream = nanoarrow::nanoarrow_allocate_array_stream()
lyr$getArrowStream(stream)
stream$get_schema()
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 11
#> ..$ fid :<nanoarrow_schema int64>
#> .. ..$ format : chr "l"
#> .. ..$ name : chr "fid"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 0
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ event_id :<nanoarrow_schema string>
#> .. ..$ format : chr "u"
#> .. ..$ name : chr "event_id"
#> .. ..$ metadata :List of 1
#> .. .. ..$ GDAL:OGR:width: chr "254"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ incid_name :<nanoarrow_schema string>
#> .. ..$ format : chr "u"
#> .. ..$ name : chr "incid_name"
#> .. ..$ metadata :List of 1
#> .. .. ..$ GDAL:OGR:width: chr "254"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ incid_type :<nanoarrow_schema string>
#> .. ..$ format : chr "u"
#> .. ..$ name : chr "incid_type"
#> .. ..$ metadata :List of 1
#> .. .. ..$ GDAL:OGR:width: chr "254"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ map_id :<nanoarrow_schema int64>
#> .. ..$ format : chr "l"
#> .. ..$ name : chr "map_id"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ burn_bnd_ac :<nanoarrow_schema int64>
#> .. ..$ format : chr "l"
#> .. ..$ name : chr "burn_bnd_ac"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ burn_bnd_lat:<nanoarrow_schema string>
#> .. ..$ format : chr "u"
#> .. ..$ name : chr "burn_bnd_lat"
#> .. ..$ metadata :List of 1
#> .. .. ..$ GDAL:OGR:width: chr "10"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ burn_bnd_lon:<nanoarrow_schema string>
#> .. ..$ format : chr "u"
#> .. ..$ name : chr "burn_bnd_lon"
#> .. ..$ metadata :List of 1
#> .. .. ..$ GDAL:OGR:width: chr "10"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ ig_date :<nanoarrow_schema date32>
#> .. ..$ format : chr "tdD"
#> .. ..$ name : chr "ig_date"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ ig_year :<nanoarrow_schema int32>
#> .. ..$ format : chr "i"
#> .. ..$ name : chr "ig_year"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> ..$ geom :<nanoarrow_schema ogc.wkb{binary}>
#> .. ..$ format : chr "z"
#> .. ..$ name : chr "geom"
#> .. ..$ metadata :List of 1
#> .. .. ..$ ARROW:extension:name: chr "ogc.wkb"
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
stream$release()
lyr$close() Created on 2024-10-07 with reprex v2.1.1 cc: @joshyam-k |
awesome that clarifies a lot, thanks! I hadn't considered the layer staying alive after it hands over the stream to arrow - but the two objects are clearly entwined still - would you consider a dep on nanoarrow? I see that sf uses it as Suggests so maybe that's the way to do it. |
@mdsumner Agree, we should add nanoarrow as Suggests for now if we only use it in examples and tests. I don't mind importing from it if other uses come up though. I like that it's relatively straightforward to expose the stream and consume in R using nanoarrow. I imagined eventually using the OGR ArrowStream interface as an alternative and optional way of reading for the |
No other thoughts, I just want to be up with recent moves and be able to explore while Arrow takes over everything 😀 |
I'm pretty keen to get the Arrow stream support in, I tried doing it but got lost in the details of how this package is actually structured at the C++ level.
Dewey Dunnington wrote the required wrapper here, and I'd suggest just returning the arrow stream as the first step:
https://github.com/r-spatial/sf/blob/main/src/gdal_read_stream.cpp
apart from GDAL (3.6.0) and Rcpp that code relies only on "sf/src/gdal_read.h", which has a declaration for an Rcpp list with options passed in by the user in CPL_ogr_layer_setup.
I need some guidance as to where (in files) here to place this code, which I think is easy for you (!), and I'm happy to flesh it out in terms of GDALVector and tests/doc once we have it in. Or, I could branch my naive attempt and ask you to review that?
I have it built into vapour here where I have an option to pull the stream and return a df, or just return the stream object. I got scared off including that because of other problems I was having, and fwiw the code here is a bit out of date compared to Dewey's version:
https://github.com/hypertidy/vapour/blob/main/inst/include/gdalarrowstream/gdalvectorstream.h
https://github.com/hypertidy/vapour/blob/8fb170c6364a0363a9e2793d7b837b216f4e4d79/R/read_stream_internal.R#L10
The text was updated successfully, but these errors were encountered: