-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sfarrow to install_geospatial.sh #692
Conversation
Add `geoarrow` dep
add `sfarrow`
@yuvipanda ☝️ |
Thanks @ranchodeluxe! Can you update the PR description with a more detailed note about:
I think that would help the maintainers of this project make a decision on wether it should be included here or not. I personally think this is a good place for it :D |
Thanks @yuvipanda |
Thanks, @ranchodeluxe! @cboettig what do you think? :) |
The failing check looks unrelated |
supporting geoparquet makes sense, though I think that's important because geoparquet files might be somewhat large, and users may want to extract a subset from them without downloading them, and they can do that with geoparquet using the same a la sf::read_sf("/vsicurl/https://data.source.coop/cholmes/eurocrops/geoparquet-projected/NL_2020_EC21.parquet",
wkt_filter = subset_polygon) for some subsetting polygon. Unfortunately it looks like our gdal build in the most recent The So, I see the pro argument as a stop-gap measure while LTS gdal doesn't have arrow drivers (which might be resolved in future LTS releases), but I'm worried the package gives the wrong impression that somehow sf can't handle geoparquet natively. Thoughts on this? Have you tried |
Disclaimer: I haven't used R for a few years, but hopefully I can add some GeoParquet context.
Reading GeoParquet through GDAL can be a lot slower than reading through a Parquet driver. For example in GeoPandas, reading with the default fiona engine through GDAL is 75x slower than with the Parquet driver: In [1]: import geopandas as gpd
In [2]: %time gdf = gpd.read_file('./nz-building-outlines.parquet', engine="fiona")
CPU times: user 5min 29s, sys: 33.6 s, total: 6min 2s
Wall time: 6min 15s
In [3]: %time gdf = gpd.read_parquet('./nz-building-outlines.parquet')
CPU times: user 4.4 s, sys: 1.02 s, total: 5.43 s
Wall time: 5.02 s (This is a 410MB file, downloaded from here) The reason for this is because GeoParquet and GeoPandas are both columnar (I assume If you use GDAL's new Arrow interface defined in RFC 86, then it would likely be much faster, because there's no transpose back and forth (Not sure if this has been implemented in R anywhere). But this is likely still slower than using a Parquet driver directly because GDAL's RFC 86 implementation uses a fixed row group size (currently set to 65536 by default). So if this row group size doesn't match the underlying row group size in the Parquet files, GDAL will have to slice the original row group into the new size with extra copies.
Just a note that the current GeoParquet spec doesn't yet include spatial indexing/partitioning support, so this will still fetch the entire file in memory, and then do a filter. |
@cboettig brings up a good point, that we actually will likely want the direct parquet support in either/both the @kylebarron brings up another good point that we probably need both OGR and direct Parquet support for Maybe for now we just roll |
That doesn't apply to the Arrow & Parquet drivers whose specialized GetNextArrowArray() directly map on arrow::ExportRecordBatch(). I'd be surprised going through GDAL with the ArrowArray interface would be measurably slower than using a specialized Parquet reader. |
Ah, apologies! You are right! pyogrio using the Arrow API is just about the same speed as pyarrow.
|
Huge thanks to everyone, many great points here and I'm learning a lot from all of you. I think this raises several different issues all of which are worth consideration, from the immediate concern of adding geoparquet parsing@kylebarron, thanks for raising the point about rowise vs columnar parsing in GDAL, I was unaware of that and as you show, it definitely makes a difference. Here on the R side, for comparison, in bench::bench_time({
x <- sf::read_sf("/vsicurl/https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet")
})
# process real
# 2.11m 2.38m
bench::bench_time({
+ sfarrow::st_read_parquet("https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet")
+ })
# process real
# 36.1s 45.2s Perhaps I need to be telling GDAL to use columnar reads to get better performance (at least on the current GDAL)? Anyway, I agree with the general argument that it makes sense to support both direct parsers and GDAL, and the former may be particularly compelling at least for the LTS-pinned I am still trying to wrap my head around support for lazy operations outside of RAM here, which seems important. For instance, note that library(duckdbfs) # remotes::install_github("cboettig/duckdbfs")
library(dplyr)
bench::bench_time({
x <-
open_dataset("https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet") |>
mutate(geometry = ST_GeomFromWKB(geometry)) |>
to_sf()
})
# process real
# 37s 38.2s Moreover, this is compelling I think since the lazy operations mean that those first two commands (open_dataset + mutate) are about 20ms, and we can add a wide array of additional spatial operations to the SQL (or dplyr) commands, e.g. Action-wise, I think it makes sense to add Adjusting the geospatial stackBeyond adding So far Thanks everyone for sharing your expertise, all this input helps a lot and Rocker has always been driven by the community of users and devs! |
Specifically for GDAL, I would rally for always being up to date with latest release, staying behind in versions has never made sense to me. For lazy ops in R, I recommend using osgeo.gdal via reticulate because R simply doesn't have have access to the facilities GDAL provides (except in a few chosen cases across disparate packages, and even then the really best ones are throttled or masked), but the native python interface obviously does. It's similar for other packages, rasterio, rioxarray etc have good features but they aren't the native facility so please take care to ensure you are leveraging and benchmarking the actual GDAL library. |
As an aside, newer gdal releases have much improved TileDB support, see eg https://gdal.org/drivers/raster/tiledb.html opening the door to parallel / cloud-native read/write ops and more. Most of what my colleagues do in geospatial applications is in Python but I think we could have compelling and performant examples in R. But as I am not much of a geospatial user myself I may not be the best person to drive that. If someone would like to poke I'd be more than happy to help keep the geospatial stats space multilingual and multi-API. |
Picking this back up Is there anyway we can get a The sad news is I can't understand your build pipeline enough to put in a PR myself 😄 |
Multiple ideas here.
I need to explore/discuss a bit more about maintaining binary compatibility with RSPM / PPM builds. As I outlined above, I like keeping the apt repos on the core tags contained to the official LTS repos for long term support, security, and compatibility, but maybe there's more options here we should be considering. @gaborcsardi do you have any thoughts about rocker maintaining compatibility with Posit's ubuntu:latest binaries while supporting more recent GDAL releases? Are the binaries Posit provides for 22.04 always being built against the gdal in 22.04? |
Well there is
17 seconds all in, can add |
Given that we currently automatically track and record GDAL and other releases in our Dockerfile, I don't see how installing the latest GDAL and other releases diverges from the purpose of this repository. (I think it could be treated the same as RStudio Server)
However, I believe that the R packages will have to be source installed, as they will obviously be incompatible with the binaries provided by PPM. |
Thanks @eitsupi -- right, there are two threads here. I think the open question is how to make the most recent gdal version more widely available (e.g. on the binder images). As we have both noted, we already have the script to do this in I guess technically this change is only needed for packages binding these libraries, and I'm still wondering if there is a more clever option here, which is why I tagged @gaborcsardi. It would be wonderful to have an image that could offer both the benefits of having the access to the latest official GDAL release while still benefiting from binary builds. I can move this to a new thread since none of this is really relevant to adding sfarrow, which doesn't even have compiled code. |
Thanks for making the PR here, @ranchodeluxe! |
Yes, the Posit binaries for 22.04 are built using the GDAL etc. packages in 22.04, I am pretty sure. If the newer GDAL packages do not break the API and ABI, then they will work with them, otherwise they might not. |
Hello, some scientists on the NASA VEDA platform using this image on RStudio wanted to read/write geoparquet files but didn't see the common libraries for interacting with geoparquet files installed. So adding them in this PR.
geoarrow
doesn't have a CRAN so let's addsfarrow
for starters please.sfarrow
lib can read/write parquet files with simple geometry features