Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sfarrow to install_geospatial.sh #692

Merged
merged 2 commits into from
Sep 9, 2023

Conversation

ranchodeluxe
Copy link
Contributor

@ranchodeluxe ranchodeluxe commented Aug 30, 2023

Hello, some scientists on the NASA VEDA platform using this image on RStudio wanted to read/write geoparquet files but didn't see the common libraries for interacting with geoparquet files installed. So adding them in this PR. geoarrow doesn't have a CRAN so let's add sfarrow for starters please.

  • GeoParquet is an OGC standard that adds geospatial types (Point, Line, Polygon) to Parquet files
  • sfarrow lib can read/write parquet files with simple geometry features

@ranchodeluxe
Copy link
Contributor Author

@yuvipanda ☝️

@yuvipanda
Copy link
Contributor

yuvipanda commented Aug 30, 2023

Thanks @ranchodeluxe! Can you update the PR description with a more detailed note about:

  1. What is geoparquet, and why / where it is being used?
  2. What functionality does the sfarrow package provide?

I think that would help the maintainers of this project make a decision on wether it should be included here or not.

I personally think this is a good place for it :D

@ranchodeluxe
Copy link
Contributor Author

Thanks @yuvipanda

@yuvipanda
Copy link
Contributor

Thanks, @ranchodeluxe!

@cboettig what do you think? :)

@yuvipanda
Copy link
Contributor

The failing check looks unrelated

@cboettig
Copy link
Member

supporting geoparquet makes sense, though sf (and terra) can already read and write geoparquet because GDAL supports geoparquet, including over the gdal virtual filesystem.

I think that's important because geoparquet files might be somewhat large, and users may want to extract a subset from them without downloading them, and they can do that with geoparquet using the same sf (or terra) already in use for reading any of the scores of other spatial vector formats,

a la

sf::read_sf("/vsicurl/https://data.source.coop/cholmes/eurocrops/geoparquet-projected/NL_2020_EC21.parquet",
            wkt_filter = subset_polygon)

for some subsetting polygon. Unfortunately it looks like our gdal build in the most recent rocker/geospatial doesn't have arrow driver support, though rocker/geospatial:dev-osgeo, which builds the latest gdal, does. (users can run /rocker_scripts/experimental/install_dev-osgeo as root on an existing geospatial container to install this too). I guess that means we won't get gdal with libarrow driver support if we stick to installing gdal from the ubuntu LTS repos (which we prefer as the default to help ensure the container is compatible with R package binaries from RSPM, which also build almost exclusively from LTS repos).

The sfarrow package doesn't mention virtual filesystem access, and it's read method doesn't seem to support spatial filters or spatial queries, like the terra and sf read methods do? Maybe it does have support for these and I'm missing it?

So, I see the pro argument as a stop-gap measure while LTS gdal doesn't have arrow drivers (which might be resolved in future LTS releases), but I'm worried the package gives the wrong impression that somehow sf can't handle geoparquet natively.

Thoughts on this? Have you tried rocker/geospatial:dev-osgeo? Overall gdal is moving pretty quickly, e.g. it has support for Zarr drivers, and gdal 3.8 will have support for PMTiles...

@kylebarron
Copy link

kylebarron commented Sep 1, 2023

Disclaimer: I haven't used R for a few years, but hopefully I can add some GeoParquet context.

supporting geoparquet makes sense, though sf (and terra) can already read and write geoparquet because GDAL supports geoparquet, including over the gdal virtual filesystem.

Reading GeoParquet through GDAL can be a lot slower than reading through a Parquet driver. For example in GeoPandas, reading with the default fiona engine through GDAL is 75x slower than with the Parquet driver:

In [1]: import geopandas as gpd

In [2]: %time gdf = gpd.read_file('./nz-building-outlines.parquet', engine="fiona")
CPU times: user 5min 29s, sys: 33.6 s, total: 6min 2s
Wall time: 6min 15s

In [3]: %time gdf = gpd.read_parquet('./nz-building-outlines.parquet')
CPU times: user 4.4 s, sys: 1.02 s, total: 5.43 s
Wall time: 5.02 s

(This is a 410MB file, downloaded from here)

The reason for this is because GeoParquet and GeoPandas are both columnar (I assume sf is also columnar) but the default way to read data from GDAL is its GetNextFeature API. This means that GDAL has to transpose from column-oriented memory to row-oriented memory, and then Python has to transpose back from row-oriented to column-oriented! For most source datasets that are row based, this doesn't matter, but for Parquet it does! (The fiona-based engine for GeoPandas is also slow because it uses a for loop in Python, not C, to assemble the output table. Not sure if there's a vectorized gdal reader in R?).

If you use GDAL's new Arrow interface defined in RFC 86, then it would likely be much faster, because there's no transpose back and forth (Not sure if this has been implemented in R anywhere). But this is likely still slower than using a Parquet driver directly because GDAL's RFC 86 implementation uses a fixed row group size (currently set to 65536 by default). So if this row group size doesn't match the underlying row group size in the Parquet files, GDAL will have to slice the original row group into the new size with extra copies.

I think that's important because geoparquet files might be somewhat large, and users may want to extract a subset from them without downloading them, and they can do that with geoparquet using the same sf (or terra) already in use for reading any of the scores of other spatial vector formats,

a la

sf::read_sf("/vsicurl/https://data.source.coop/cholmes/eurocrops/geoparquet-projected/NL_2020_EC21.parquet",
            wkt_filter = subset_polygon)

Just a note that the current GeoParquet spec doesn't yet include spatial indexing/partitioning support, so this will still fetch the entire file in memory, and then do a filter.

@wildintellect
Copy link

@cboettig brings up a good point, that we actually will likely want the direct parquet support in either/both the dev-osgeo and/or geospatial_unstable images. Since we'll also want to make sure that we are getting newer GDAL/OGR versions too, to get Zarr etc..., and to avoid future issues like rspatial/terra#666 <- this is actually the primary reason Ubuntugis Repos exist (and are compatible with LTS)

@kylebarron brings up another good point that we probably need both OGR and direct Parquet support for sf. This feels more like a documentation/examples problem.

Maybe for now we just roll sfarrow into dev-osgeo and geospatial_unstable. Then the question is, what will it take to get one of these out of experimental? For all the projects with NASA and others I consult on one of these will always be the recommended base image due to the aging of GDAL/OGR, 2 years between LTS is way too long to wait for new GDAL versions (even R packages change dramatically in that time period).

@rouault
Copy link

rouault commented Sep 1, 2023

. But this is likely still slower than using a Parquet driver directly because GDAL's RFC 86 implementation uses a fixed row group size (currently set to 65536 by default)

That doesn't apply to the Arrow & Parquet drivers whose specialized GetNextArrowArray() directly map on arrow::ExportRecordBatch(). I'd be surprised going through GDAL with the ArrowArray interface would be measurably slower than using a specialized Parquet reader.

@kylebarron
Copy link

kylebarron commented Sep 1, 2023

That doesn't apply to the Arrow & Parquet drivers whose specialized GetNextArrowArray() directly map on arrow::ExportRecordBatch(). I'd be surprised going through GDAL with the ArrowArray interface would be measurably slower than using a specialized Parquet reader.

Ah, apologies! You are right! pyogrio using the Arrow API is just about the same speed as pyarrow.

In [5]: %time _gdf = pyogrio.read_dataframe('./nz-building-outlines.parquet')
CPU times: user 27.5 s, sys: 1.99 s, total: 29.5 s
Wall time: 32.1 s

In [9]: %timeit _gdf = pyogrio.read_dataframe('./nz-building-outlines.parquet', use_arrow=True)
2.84 s ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit _gdf = gpd.read_parquet('./nz-building-outlines.parquet')
2.53 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@cboettig
Copy link
Member

cboettig commented Sep 2, 2023

Huge thanks to everyone, many great points here and I'm learning a lot from all of you. I think this raises several different issues all of which are worth consideration, from the immediate concern of adding sfarrow or other mechanisms for geoparquet to the standard rocker/geospatial distribution, to adding support for more recent gdal in something other than the dev-osgeo, graduating dev-osgeo out of experimental status, resolving the strategy between using ubuntugis-PPA (also currently in experimental scripts but not included in the default build matrix), default LTS repos, and source builds, etc).

geoparquet parsing

@kylebarron, thanks for raising the point about rowise vs columnar parsing in GDAL, I was unaware of that and as you show, it definitely makes a difference. Here on the R side, for comparison, in rocker/geospatial:dev-osgeo, I see about 45 sec via sfarrow vs over 2.38 minutes in sf via default GDAL (i.e. a factor of ~ 3x)

bench::bench_time({
x <- sf::read_sf("/vsicurl/https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet") 
})
# process    real 
#  2.11m   2.38m 

bench::bench_time({
+ sfarrow::st_read_parquet("https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet")
+ })
# process    real 
#  36.1s   45.2s 

Perhaps I need to be telling GDAL to use columnar reads to get better performance (at least on the current GDAL)? Anyway, I agree with the general argument that it makes sense to support both direct parsers and GDAL, and the former may be particularly compelling at least for the LTS-pinned rocker/geospatial images.

I am still trying to wrap my head around support for lazy operations outside of RAM here, which seems important. For instance, note that duckdb (with it's spatial extension enabled) can provide lazy reads of geoparquet over the https connection, including support for spatial subsetting and a wide array of other options, e.g.

library(duckdbfs) # remotes::install_github("cboettig/duckdbfs")
library(dplyr)
bench::bench_time({
   x <- 
     open_dataset("https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet") |>
     mutate(geometry = ST_GeomFromWKB(geometry)) |>
     to_sf()
 })
# process    real 
#    37s   38.2s 

Moreover, this is compelling I think since the lazy operations mean that those first two commands (open_dataset + mutate) are about 20ms, and we can add a wide array of additional spatial operations to the SQL (or dplyr) commands, e.g. mutate(area = ST_AREA(geometry)), before calling to_sf(), which executes and collects the result as an sf object in RAM. On the downside, that workflow is less familiar and duckdb spatial support is incomplete and evolving.

Action-wise, I think it makes sense to add sfarrow as proposed to the standard rocker/geospatial install, but we need better discussion and documentation as a community in navigating the best way to read geoparquet. (With gdal 3.8, pmtiles may come into this question as well?)

Adjusting the geospatial stack

Beyond adding sfarrow, I think there's a need to be able to serve more recent versions of GDAL. I'm reluctant to alter the recipe of the main rocker/geospatial:latest / and versioned tags to use PPA or source builds rather than installing GDAL from the default LTS repos for several reasons. In general we have avoided PPAs on the core versioned stack because (a) LTS repos have long maintenance window and stable versions, ensuring frozen dockerfile recipes can rebuild years later (and receive more scrutiny on security, etc), this can help ensure compatibility with binary R packages built by Posit's Package Manager, which (I think, with rare exceptions) compiles against these. We've had issues in the past when a user will install a binary R package from PPM that binds gdal etc, and build another from source that binds more recent gdal or other dev libs installed from a PPA, and thus creates version conflicts and unexpected errors that aren't easily resolved. We definitely need to provide an image with more up-to-date gdal/geos/proj libs, but I think doing so in an explicit tag on geospatial (or even a new image name if that's preferred) would be ideal.

So far dev-osgeo has sought to fill this gap by building those libs from source as you know. The build recipes could be modified (I don't think we're getting all gdal extensions there!) or a case could be made to use the PPA route here too. (We've also had users ask about pre-release gdal cc @mdsumner ). Instead of building from source or PPA, we could also do a multistage build / COPY from one of the official Ubuntu-based osgeo docker images, which might better avoid duplicated effort? Meanwhile, maybe we could start by moving the dev-osgeo recipe out of "experimental", as it has at least proven pretty stable since the migration of gdal to cmake-based builds... I'm totally open to ideas here, but worth more discussion!

Thanks everyone for sharing your expertise, all this input helps a lot and Rocker has always been driven by the community of users and devs!

@mdsumner
Copy link
Contributor

mdsumner commented Sep 2, 2023

Specifically for GDAL, I would rally for always being up to date with latest release, staying behind in versions has never made sense to me.

For lazy ops in R, I recommend using osgeo.gdal via reticulate because R simply doesn't have have access to the facilities GDAL provides (except in a few chosen cases across disparate packages, and even then the really best ones are throttled or masked), but the native python interface obviously does.

It's similar for other packages, rasterio, rioxarray etc have good features but they aren't the native facility so please take care to ensure you are leveraging and benchmarking the actual GDAL library.

@eddelbuettel
Copy link
Member

As an aside, newer gdal releases have much improved TileDB support, see eg https://gdal.org/drivers/raster/tiledb.html opening the door to parallel / cloud-native read/write ops and more. Most of what my colleagues do in geospatial applications is in Python but I think we could have compelling and performant examples in R. But as I am not much of a geospatial user myself I may not be the best person to drive that. If someone would like to poke I'd be more than happy to help keep the geospatial stats space multilingual and multi-API.

@ranchodeluxe
Copy link
Contributor Author

Picking this back up

Is there anyway we can get a rocker/binder:unstable build based on rocker/geospatial:unstable in order to get the newer GDAL to test out? That would be magical ✨

The sad news is I can't understand your build pipeline enough to put in a PR myself 😄

@cboettig
Copy link
Member

cboettig commented Sep 8, 2023

Multiple ideas here.

  • Going to merge this as I think adding sfarrow is sensible and straight-forward, even while there are other options for geoparquet as well.

  • Separately, yes we need a newer gdal. I think rocker/geospatial:unstable (PPA-flavor) isn't built regularly and is way out of date, but rocker/geospatial:dev-osgeo currently builds latest releases (per their GitHub) of GDAL, GEOS and PROJ from source. (Aside but rocker/versioned stack should move to to 24.04 eventually, but that's still a ways out and won't be a solution to stay current on GDAL).

I need to explore/discuss a bit more about maintaining binary compatibility with RSPM / PPM builds. As I outlined above, I like keeping the apt repos on the core tags contained to the official LTS repos for long term support, security, and compatibility, but maybe there's more options here we should be considering. @gaborcsardi do you have any thoughts about rocker maintaining compatibility with Posit's ubuntu:latest binaries while supporting more recent GDAL releases? Are the binaries Posit provides for 22.04 always being built against the gdal in 22.04?

@eddelbuettel
Copy link
Member

eddelbuettel commented Sep 8, 2023

Well there is r2u too:

edd@rob:~$ docker run --rm -ti rocker/r2u Rscript -e 'system.time(install.packages("sf")); library("sf")' 
Get:1 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB] 
[.... lots omitted ...]
Setting up r-cran-sf (1.0-14-1.ca2204.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.1) ...
   user  system elapsed 
  7.403   2.249  17.394 
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
edd@rob:~$ 

17 seconds all in, can add arrow and many things equally easily. I like it as a building block.

@eitsupi
Copy link
Member

eitsupi commented Sep 8, 2023

Given that we currently automatically track and record GDAL and other releases in our Dockerfile, I don't see how installing the latest GDAL and other releases diverges from the purpose of this repository. (I think it could be treated the same as RStudio Server)

ENV PROJ_VERSION=9.3.0
ENV GDAL_VERSION=3.7.1
ENV GEOS_VERSION=3.12.0

However, I believe that the R packages will have to be source installed, as they will obviously be incompatible with the binaries provided by PPM.

@cboettig
Copy link
Member

cboettig commented Sep 9, 2023

Thanks @eitsupi -- right, there are two threads here. I think the open question is how to make the most recent gdal version more widely available (e.g. on the binder images). As we have both noted, we already have the script to do this in dev-osgeo, and we have also noted that the script has the necessary but unfortunate consequence of 'turning off' binary builds by changing the default repo away from PPM to point back to the default CRAN cloud CDN.

I guess technically this change is only needed for packages binding these libraries, and I'm still wondering if there is a more clever option here, which is why I tagged @gaborcsardi. It would be wonderful to have an image that could offer both the benefits of having the access to the latest official GDAL release while still benefiting from binary builds.

I can move this to a new thread since none of this is really relevant to adding sfarrow, which doesn't even have compiled code.

@cboettig cboettig merged commit 3741678 into rocker-org:master Sep 9, 2023
32 of 33 checks passed
@yuvipanda
Copy link
Contributor

Thanks for making the PR here, @ranchodeluxe!

@gaborcsardi
Copy link

Yes, the Posit binaries for 22.04 are built using the GDAL etc. packages in 22.04, I am pretty sure. If the newer GDAL packages do not break the API and ABI, then they will work with them, otherwise they might not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants