-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
st_collect()
, st_as_sf()
, and default conversion from Arrow to R
#21
Comments
Two thoughts come to mind:
library(bcdata)
airports <- bcdc_query_geodata("bc-airports")
ap <- airports %>%
filter(LOCALITY == "Vancouver") %>%
select(AIRPORT_NAME) %>%
collect()
class(ap)
[1] "bcdc_sf" "sf" "tbl_df" "tbl" "data.frame" |
That's clever! I never hadn't looked into the internals of bcdata but always wished there was an nsdata I could use to make lake maps. I'll definitely be using that for blog posts/reprexes! The I suppose the underlying tension is the (maybe just my?) desire for geoarrow to provide a flexible and lightning-fast developer-facing interface but also provide users with some convenience functions. |
I hope you don't mind me chiming in here. Just thought I would add a voice for query <- open_dataset_sf("nc.parquet") %>%
filter(grepl("^A", NAME)) %>%
select(NAME, geometry) %>%
collect() # or st_as_sf() vs query <- open_dataset("nc.parquet") %>%
filter(grepl("^A", NAME)) %>%
select(NAME, geometry) %>%
st_collect() # or st_as_sf() In the end it probably doesn't really matter... but I kind of like defining the expected output upfront - and if someone is not using |
Actually one other thought occurred to me... you could define |
Love the chiming in! I really like the geo-specific dataset opener concept...much easier for an sf user to follow because we'd get the geometry column handling that's used there. It looks like we'd be able to draw from bcdata's code for this. That said, it's also a lot of work to implement a dplyr backend! My personal development priority right now is at a very low level but I'm happy to prototype something if there are others willing to take it forward. This could probably be generalized to wrap any dplyr backend with a geometry column, too (e.g, PostGIS, OGR layer via SQL, tibble with a non-sfc geometry column), although at that point there starts to be some significant overlap with sf. Keep the ideas coming and let me know what I can do to move things forward! |
Yeah, I definitely wondered about alignment with any other work/conversations for a dplyr backend. There is this which I somehow missed before now: https://github.com/hypertidy/lazysf. It makes sense that the low-level stuff is the priority right now. Unfortunately for the time being I probably don't have much time to actually contribute meaningfully, but the conversations are fun and good to record for the future! |
collect() shouldn't return sf, it should always be closest to native, with explicit conversions, lazysf is pretty rubbish but that distinction to st_as_sf() was pretty clear why not column of smart pointers to GDAL features? in RODBC days a dataframe couldn't even print a raw column without erroring, it was stifling ... I just wanted to read from Manifold and push stuff around ... then {blob} came along and lazy_tbl and now there's even {pool} for keeping pointers alive! so much is possible, this is exciting! |
Agreed...I think we may also be able to get some sf-like behaviour by implementing methods like
One of the cool parts about GDAL's RFC86 is that for some drivers a GDAL feature is never instantiated (e.g., for the gpkg driver it just copies the blob directly from sqlite), which is what makes it so much faster. From the R side, we really want to avoid a
That's awesome...I will take a closer look! |
Right now, geoarrow doesn't convert to sf by default and instead maintains a zero-copy shell around the
ChunkedArray
from whence it came. This is instantaneous and is kind of like ALTREP for geometry, since we can't do ALTREP on lists like Arrow does for character, integer, and factor. This is up to 10x faster and prevents a full copy of the geometry column. I also rather like that it maintains neutrality between terra, sf, vapour, wk, or others that may come along in the future...who are we to guess where the user wants to put the geometry column next? The destination could be Arrow itself (e.g., viagroup_by() %>% write_dataset()
), or the column could get dropped, filtered, or rearranged before calling an sf method.However, 99% of the time a user just wants an sf object. After #20 we can use
sf::st_as_sf()
on anarrow_dplyr_query
tocollect()
it into an sf object, and @boshek suggestedst_collect()
, which is a way better name and is more explicit than ast_as_sf()
. There's alsost_geometry()
,st_crs()
,st_bbox()
, andst_as_crs()
methods for thegeoarrow_vctr
column; however, we still get an awkward error if wecollect()
and then try to convert to sf:That might be solvable in sf, although I'd like to give the current implementation a chance to get tested to collect feedback on whether this is or is not a problem for anybody before committing to the current zero-copy-shell-by-default.
The text was updated successfully, but these errors were encountered: