Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDAL arrow stream #545

Open
mdsumner opened this issue Oct 5, 2024 · 4 comments
Open

GDAL arrow stream #545

mdsumner opened this issue Oct 5, 2024 · 4 comments

Comments

@mdsumner
Copy link
Collaborator

mdsumner commented Oct 5, 2024

I'm pretty keen to get the Arrow stream support in, I tried doing it but got lost in the details of how this package is actually structured at the C++ level.

Dewey Dunnington wrote the required wrapper here, and I'd suggest just returning the arrow stream as the first step:

https://github.com/r-spatial/sf/blob/main/src/gdal_read_stream.cpp

apart from GDAL (3.6.0) and Rcpp that code relies only on "sf/src/gdal_read.h", which has a declaration for an Rcpp list with options passed in by the user in CPL_ogr_layer_setup.

I need some guidance as to where (in files) here to place this code, which I think is easy for you (!), and I'm happy to flesh it out in terms of GDALVector and tests/doc once we have it in. Or, I could branch my naive attempt and ask you to review that?

I have it built into vapour here where I have an option to pull the stream and return a df, or just return the stream object. I got scared off including that because of other problems I was having, and fwiw the code here is a bit out of date compared to Dewey's version:

https://github.com/hypertidy/vapour/blob/main/inst/include/gdalarrowstream/gdalvectorstream.h

https://github.com/hypertidy/vapour/blob/8fb170c6364a0363a9e2793d7b837b216f4e4d79/R/read_stream_internal.R#L10

@ctoney
Copy link
Collaborator

ctoney commented Oct 7, 2024

Thanks for scoping this out and pointing me to the sources. It's only slightly different in gdalraster in the context of class GDALVector.

the details of how this package is actually structured at the C++ level

One difference is, a GDALVector object has a persistent connection to a vector data source, i.e., with pointers to an OGRLayer object and the GDALDataset that owns it. In the code you linked, much of what is done in, e.g., CPL_read_gdal_stream(), ogr_layer_setup() and the list it returns, already exists on a GDALVector object. For example, its layer may have been derived from a SQL query, or otherwise have spatial and/or attribute filters defined. The information gathered in CPL_read_gdal_stream() is available in the list returned by GDALVector::getLayerDefn() (includes the geom columns and their spatial ref as WKT) and by GDALVector::getFeatureCount(). We already have a wrapped OGRLayer to obtain all of that type of information in R.

An ArrowArrayStream or the ArrowSchema/ArrowArray objects it returns cannot be used once the OGRLayer they are initialized from has been destroyed, usually at dataset closing. There is not already an inherent object in sf with dataset/layer pointers on the R side. So GDALStreamWrapper in sf/src/gdal_read_stream.cpp holds the pointers in its member variable Rcpp::List shelter_, which were obtained from CPL_ogr_layer_setup() and passed in GDALStreamWrapper::Make(). This allows those objects to exist in R while the stream object is in use. The GDALDataset is eventually closed in the destructor for GDALStreamWrapper. None of that is needed separately if implemented in exposed class GDALVector since it already persists the dataset/layer pointers in an R object.

For just exposing the arrow stream, this is proof-of-concept only (currently does not handle GDAL < 3.6, undocumented, untested, potential design flaws, etc.). The code additions are in src/gdalvector.cpp, mainly the exposed method GDALVector::getArrowStream() at line 1623, and wrappers for the call backs at 2962, and an updated class declaration in src/gdalvector.h:

https://github.com/ctoney/gdalraster/tree/arrowstream

library(gdalraster)
#> GDAL 3.8.4, released 2024/02/08, GEOS 3.12.1, PROJ 9.3.1

dsn <- system.file("extdata/ynp_fires_1984_2022.gpkg", package = "gdalraster")
lyr <- new(GDALVector, dsn, "mtbs_perims")

stream = nanoarrow::nanoarrow_allocate_array_stream()
lyr$getArrowStream(stream)

stream$get_schema()
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 11
#>   ..$ fid         :<nanoarrow_schema int64>
#>   .. ..$ format    : chr "l"
#>   .. ..$ name      : chr "fid"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 0
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ event_id    :<nanoarrow_schema string>
#>   .. ..$ format    : chr "u"
#>   .. ..$ name      : chr "event_id"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ GDAL:OGR:width: chr "254"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ incid_name  :<nanoarrow_schema string>
#>   .. ..$ format    : chr "u"
#>   .. ..$ name      : chr "incid_name"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ GDAL:OGR:width: chr "254"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ incid_type  :<nanoarrow_schema string>
#>   .. ..$ format    : chr "u"
#>   .. ..$ name      : chr "incid_type"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ GDAL:OGR:width: chr "254"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ map_id      :<nanoarrow_schema int64>
#>   .. ..$ format    : chr "l"
#>   .. ..$ name      : chr "map_id"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ burn_bnd_ac :<nanoarrow_schema int64>
#>   .. ..$ format    : chr "l"
#>   .. ..$ name      : chr "burn_bnd_ac"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ burn_bnd_lat:<nanoarrow_schema string>
#>   .. ..$ format    : chr "u"
#>   .. ..$ name      : chr "burn_bnd_lat"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ GDAL:OGR:width: chr "10"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ burn_bnd_lon:<nanoarrow_schema string>
#>   .. ..$ format    : chr "u"
#>   .. ..$ name      : chr "burn_bnd_lon"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ GDAL:OGR:width: chr "10"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ ig_date     :<nanoarrow_schema date32>
#>   .. ..$ format    : chr "tdD"
#>   .. ..$ name      : chr "ig_date"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ ig_year     :<nanoarrow_schema int32>
#>   .. ..$ format    : chr "i"
#>   .. ..$ name      : chr "ig_year"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>   ..$ geom        :<nanoarrow_schema ogc.wkb{binary}>
#>   .. ..$ format    : chr "z"
#>   .. ..$ name      : chr "geom"
#>   .. ..$ metadata  :List of 1
#>   .. .. ..$ ARROW:extension:name: chr "ogc.wkb"
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

stream$release()

lyr$close()

Created on 2024-10-07 with reprex v2.1.1

cc: @joshyam-k

@mdsumner
Copy link
Collaborator Author

mdsumner commented Oct 7, 2024

awesome that clarifies a lot, thanks! I hadn't considered the layer staying alive after it hands over the stream to arrow - but the two objects are clearly entwined still - would you consider a dep on nanoarrow? I see that sf uses it as Suggests so maybe that's the way to do it.

@ctoney
Copy link
Collaborator

ctoney commented Oct 9, 2024

@mdsumner Agree, we should add nanoarrow as Suggests for now if we only use it in examples and tests. I don't mind importing from it if other uses come up though. I like that it's relatively straightforward to expose the stream and consume in R using nanoarrow.

I imagined eventually using the OGR ArrowStream interface as an alternative and optional way of reading for the GDALVector::fetch() method. Ideally there would be corresponding write. I don't plan to work on that until the rest of the vector API is complete. Do you have other thoughts along those lines, as far as using the Arrow interface beyond the first step of exposing the stream?

@mdsumner
Copy link
Collaborator Author

No other thoughts, I just want to be up with recent moves and be able to explore while Arrow takes over everything 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants