-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce bounding box column definition #191
Conversation
* Add documentation to the top-level GeoParquet description and definition. * Add the geometry_bbox definition to the json schema * Add a few tests. Verify with `pytest test_json_schema.py`
How do people feel about the bbox column field names When putting this together I noticed that geopandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up! Just a few comments to get the discussion going.
Sorry for missing your earlier comment before you got to the PR, but I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.
A bounding box column MUST be a Parquet struct
It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.
I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!
For three dimensions the additional fields
zmin
andzmax
MUST be present.
In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.
Parquet indeed does not speak about "struct" columns, and this is unfortunately one of those areas that is completely underdocumented. I think the relevant term is "group", but in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types that is only used to explain LIST and MAP logical type. When translating an Arrow struct column, I get the following Parquet schema: import pyarrow as pa
struct_arr = pa.StructArray.from_arrays([[1, 2]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"bbox": struct_arr})
import pyarrow.parquet as pq
pq.write_table(table, "test_bbox_schema.parquet")
pq.read_metadata("test_bbox_schema.parquet").schema
So something like "a group field with 4 child fieds with names ..."? |
Change the geometry_bbox to the broader "covering" section. Update tests and examples. Made some documentation updates: * Parquet schema -> group * Do not require zmin/zmax if geometries have 3 dimensions
Sounds good. I just made the updates to go to that original proposal with the "covering" section. I admit to not having a better word so let's stick with it for now.
Good call. I went with @jorisvandenbossche's language here. And I agree, the Parquet docs/website are really lacking on this.
Agreed. We can change as necessary depending on the outcome of that testing.
Thanks. I changed the language to read |
@jorisvandenbossche Thanks! I made some updates and went with this wording. |
I'm just thinking that the use of a struct/group with child fields could potentially be a complication for some implementations. I mean the "flat" model is quite a common one among geospatial formats. I struggled a bit to deal with Parquet nested construct when mapping that to the OGR data model. I've bitten that bullet now, so the current proposal is perfectly fine to me. Just wanted to raise that potential issue for other "simple" hypothetical implementations. But perhaps dealing with Parquet means that dealing with nested constructs is an implicit requirement. |
What about specifically listing the names of the paths in the schema: "covering": {
"box": {
"xmin": "bbox.minx",
"ymin": "bbox.miny",
"xmax": "bbox.maxx",
"ymax": "bbox.maxy",
}
} Then it can be straightforward to either use dot-access notation as above, or to refer to columns at the top-level: "covering": {
"box": {
"xmin": "minx",
"ymin": "miny",
"xmax": "maxx",
"ymax": "maxy",
}
} This also means that a struct point column could be defined using something like "covering": {
"box": {
"xmin": "geometry.x",
"ymin": "geometry.y",
"xmax": "geometry.x",
"ymax": "geometry.y",
}
} And a geoarrow linestring column could be defined using something like "covering": {
"box": {
"xmin": "geometry.list.element.list.element.x",
"ymin": "geometry.list.element.list.element.y",
"xmax": "geometry.list.element.list.element.x",
"ymax": "geometry.list.element.list.element.y",
}
} (The exact syntax is up for future discussion; that's what pyarrow.parquet shows for the
|
Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering. |
I think the question is whether Given that creating bounding boxes from a geometry column (once in memory) should be very fast, I think the main goal should be on solidifying a flexible way to describe the MBR of the chunk in the row group metadata, without defining what the values of that column must be. I like the above because it's flexible among struct and top-level columns while also being future proof to a geoarrow encoding. (Though to be clear, I'm not arguing for this PR to explicitly allow geoarrow; for now we can only document top-level columns and a struct column) |
One nice outcome of this approach is it avoids the duplication (triplication?) of point coordinates that @rouault raised in #188 (comment). Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated. |
agreed on that: I would expect the content of the xmin/... field to be a scalar to KISS |
Just chiming in with my 2c. DuckDB can't do predicate pushdowns into struct columns yet (hopefully someday!), and I imagine other simpler readers may also struggle to fully utilize nested types, so I would think the "flat" representation is the most practical/has the lowest "barrier of entry". Although from a conceptual standpoint its kind of a bummer because I feel like this is the perfect use case for structs :/. |
Just popping in with a 👍 to this proposal. GDAL’s new support for column stats in 3.8.0 combined with spatially-clustered rows and relatively-small row groups has shown excellent performance in my tests with country-scale OSM datasets. I hadn’t seen the work on structs here so I was experimenting with separate xmin/ymin/xmax/ymax columns. They seem to be equivalent for read performance. |
I like the upside of being able to handle both structs and top level columns, and handling points without duplication would definitely be a win. But I worry about the complexity - a reader would need to at least know about all the options, and if it's not capable of predict pushdown would have to know that some fields wouldn't work. And ideally a reader would notify the user in some way that it's not working / implemented. And writer implementations would have to make choices of how to do it. And either push that decision on the user, or else we recommend a 'default'. So if we just allow any field specified then it feels like we just sorta punt on the question of which approach to do, and we push the complexity out to the ecosystem. And it'd be harder to just look at a parquet file and know just by looking at the data if it's got a bbox - if we have one set approach of fields of set names then you can be pretty sure if a dataset has that then it is implementing a spatial optimization. So I think I lean towards first picking one way and seeing if it will work for everyone. But it could be interesting to use this way to specify that one way if we want to future proof it a bit - like we list the path names, but we align on just having one locked in value for the first release, like how we did the geometry encoding. Maybe that's what you were proposing? I wasn't sure if the idea was people could specify any field they wanted to serve as the box, or if it'd just be a way to specify a set of options. |
One thing that I do think should accompany this is some recommendations on how to actually use this construct to accelerate spatial querying. Like talk through different indexing options, explain a bit about how you need to order the file and how the row group stats work. I think this would likely fit best in some sort of 'best practice' document that sits alongside the spec. And I don't think it needs to be part of this PR, but would be good to have in place for a first 1.1-beta release. I'd be up to take a first crack at it, but would be good for others who are deeper into spatial indexing and parquet to help out. |
It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps). It is a bummer because a struct column is definitely the right concept for this. That would mean:
A future version of the spec could allow nesting with |
Agreed. We had discussed surveying existing Parquet readers prior to making a decision, but it feels like we already know the answer without having to do more investigation...? As @paleolimbot said, we could go with @kylebarron's suggestion of having the column specified per box element, but the spec could enforce that they MUST point to fields in the root of the schema. We always have the option later to relax that constraint without breaking backward compatibility. Separately, I like the idea of the increased flexibility but wonder if it allows too many personal preferences to exist in the ecosystem... Someone could write:
because they prefer |
Perhaps a prefix? |
Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants |
As someone that spends a lot of time writing SQL, the real interface of a Parquet file for me is the table schema's columns and data types because most SQL query engines can't read the GeoParquet metadata - at least not for a while. If I want to reuse queries across different GeoParquet datasets, they'll have to be dynamic over column names which is a pain especially if determining the columns requires reading the data before I know how to query it. I guess in my opinion, the more standardized the Parquet schema is - column names and types - and the less flexible the metadata allows it to be, the easier GeoParquet will be to work with and the more successful it will be. This is a very similar discussion to #169 as well. The broader difference of philosophy is beyond the scope of this PR but a fun one that I think will keep coming up :) Either way, I think Chris's proposal to constrain the allowed values could make this work and future-proof to make it more open-ended later. |
Thanks everyone for the feedback and good discussion. Here's my read on what would be a good path forward: Specify every bounding box columnWe can adopt @kylebarron's idea to define the bounding box like this:
Constraints on the initial release that can be relaxed laterTop-Level ColumnsSince duckdb and others don't do predicate pushdown on nested columns, the initial definition should require the box columns to be at the schema root. Otherwise we'll be negating most performance benefits for some systems. Restricting Column NamesPer @cholmes's comment, we can start by restricting the column names and pinning them to be What to do about files with multiple geometries?We can't require the same names for a different geometry column. I like @paleolimbot's idea of constraining the names to be a prefix that is the geometry column name + "_". So the bbox columns for Thoughts? I can make the PR changes soon if there's agreement on the direction here. |
@jwass - I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like |
@tschaub That's not quite what I was suggesting since I am proposing adding this to the metadata. That way the systems that can interact with the metadata can still find it in a definitive way. We could define the bounding box more as convention for now without touching the geo metadata yet, and/or add it to the compatible parquet spec. But it seemed like most folks wanted it in there. |
If we don't make a change to the geo metadata, it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases - without needing a breaking release if we find out that the initial proposal wasn't quite right. |
The min/max columns create a single-level spatial index and are absolutely useful in selection use cases. This example shows a 44MB Geoparquet file for the country of Ghana with only data for the town of Kumasi selected: Preview, showing in-Kumasi row groups in blue vs. non-Kumasi row groups in gray: With row groups of 100 features and everything sorted by quadkey of one feature corner we use the column stats alone to identify 106 row groups out of 1,001 total that must be read. This saves almost 90% of required reads. I did some tests with other space-filling curves and they don’t make a substantial difference. It’s the min/max column stats and the clustering alone that do it. For larger source files this would be a requirement to use the network efficiently. Worldwide OSM road and building data can be dozens or hundreds of GB in size and I wouldn’t be able to afford a table scan to efficiently query a region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have anything additional to add here!
My only question is what happens when there is a period in column_name
. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!
@paleolimbot This came up today and @jorisvandenbossche mentioned it might be common in R. My guess is in some SQL engines you'd refer to a column like that with quotes like:
We could require something similar such that periods are nested groups but when they are within quotes they're just part of the column name. The quotes would just have to be properly escaped in the JSON. |
Describe bounding box as the "coordinate range" of the geometry which is language used by the GeoJSON spec
A more robust way to specify a "nested path", is to specify each part of the path as a separate string. In Python we would naturally use a tuple for that, like I was quickly looking into how I would implement this bbox filter with pyarrow, and there I will actually need to convert the dotted path we have here to a tuple or list of paths, because to specify a filter when reading I can do something like That means that I would need to split the dotted path. A simple |
I was also thinking to that Or, as we want all the xmin/ymin/xmax/ymax to be subfields of the same field: "covering": {
"bbox": {
"struct_field": "bbox",
"xmin_subfield": "xmin",
"ymin_subfield": "ymin",
"xmax_subfield": "xmax",
"ymax_subfield": "ymax"
}
} |
If we want to be the most future-proof, maybe we should just go with the array:
Kind of annoying but it should make things much easier for readers that need to find the right field in the schema as @jorisvandenbossche pointed out. We can constrain the json schema so that each element has length 2 and fix the final element which gives lots of flexibility to relax those later. I'm good with it. |
Update bbox so that each element is an array. For example: ["bbox", "xmin"] which represents a top-level bbox struct field with an xmin member beneath it
Accidentally committed in last version for testing
Updated to new format as described above. @jorisvandenbossche I re-requested a review just to make sure this in line with your thinking here. |
FYI, I've a GDAL implementation of this in OSGeo/gdal#9185 |
format-specs/geoparquet.md
Outdated
|
||
##### bbox covering encoding | ||
|
||
Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the column name and fields containing the bounding box of the geometry for every row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that pages should be also mentioned besides row groups, as bbox column also works for page level indexes. Wrote a longer rant about this in #188 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated to reflect this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csringhofer Can you take a look at 40ebb37?
Pretty small change just to also mention page indexes in addition to row groups. There aren't any other discussions of row groups in the spec so that's the only change. Let me know if you think that works.
``` | ||
"covering": { | ||
"bbox": { | ||
"xmin": ["bbox", "xmin"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first the column name "bbox" was confusing to me as it is the same as the json struct name. Maybe "bbox_col" would be clearer? Afterwards it could be added that if there is a single geometry column, then the recommended bbox column name is simply "bbox".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csringhofer Are you referring to the example as in:
"covering": {
"bbox": {
"xmin": ["bbox_col", "xmin"],
...
I'm hesitant to change it because our recommendation really is to call it "bbox". I agree it's a bit confusing. If there's anything to rename it might be the "bbox" under covering. It used to be called just "box" in earlier versions of the PR but now that it's just the bbox columns, I put it back. I'm open to other ideas though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed a bit confusing here in the example, but for the actual spec I would also keep "bbox" both for the recommended column name as the key here in the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another example could be added with bbox column for multiple geometry columns.
It is also not clear what is the recommended name in that case - there is an example with "any_column", but using something like "geom_column_name_bbox" seems clearer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but using something like "geom_column_name_bbox" seems clearer to me.
FWIW, that's the convention I've used in the GDAL writer
format-specs/geoparquet.md
Outdated
|
||
### Bounding Box Columns | ||
|
||
A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE`. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE
About repetition: while the struct's repetition must be the same as geometry column's, the nested fields' repetation must be "required", right? So they can never be null if their parent is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could be added the the coordinates must have the same type, e.g. all FLOAT or all DOUBLE
I concur with that! This is an assumption I've actually made in my GDAL implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the update in 40ebb37. Let me know if okay or not.
* Mention that bounding boxes are useful for both row group statistics and page indexes * Require that bbox encoding columns be all of the same type (float or double)
Can we merge this in? Any remaining work to do? If I don't hear in a day or two I'm going to hit the green button. |
format-specs/geoparquet.md
Outdated
|
||
### Bounding Box Columns | ||
|
||
A bounding box column MUST be a Parquet group field with 4 child fields named `xmin`, `xmax`, `ymin`, and `ymax` representing the geometry's coordinate range. For three dimensions the additional fields `zmin` and `zmax` MAY be present but are not required. The fields MUST be of Parquet type `FLOAT` or `DOUBLE` and all columns MUST use the same type. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row does not contain a geometry value, the row MUST NOT contain a bounding box value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xmin
,xmax
,ymin
, andymax
representing the geometry's coordinate range.
I am confused about the semantics in case of spherical geometries.
"range" suggests to me that xmin should be always <= xmax, but this is not true in the spherical case, right? How to represent a bbox that crosses the 180.0° line of longitude or contains a pole? Or such a bbox cannot be represented?
It would be nice to add some guidance/warning about interpreting in the spherical case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the bounding box at the file-level metadata (https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#bbox), we refer to the GeoJSON spec:
For geometries in a geographic coordinate reference system, longitude and latitude values are listed for the most southwesterly coordinate followed by values for the most northeasterly coordinate. This follows the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for a set of geometries that cross the antimeridian.
Would that work here as well / provide sufficient information?
Of course, in contrast with the bbox metadata for the full file which is just a JSON array of 4 numbers, here we need to give the numbers explicit field names. The current proposal uses "xmin", "xmax", etc, which is not ideal for the geographical case.
(in general a simple bbox might not be ideal for geographical data anyway, and the current proposal leaves it open to add other "covering" types later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @csringhofer.
I agree with @jorisvandenbossche that we'd just defer to how GeoJSON defines bbox for anti-meridian crossings. (There's another discussion about also adopting GeoJSON's recommendation to split geometries at the anti-meridian but that's probably further out).
I suppose we could rename the bbox fields to "south", "west", etc. but it feels off to me and I think xmin, xmax is still the right name with the caveats Joris listed. It's also worth mentioning that naive queries against the bbox won't be effective for anti-meridian crossings, including row group filtering optimizations. But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this sentence to the docs: As with the top-level [
bbox](#bbox) column, the values follow the GeoJSON specification (RFC 7946, section 5), which also describes how to represent the bbox for geometries that cross the antimeridian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think engines with specific knowledge of geospatial data could handle anti-meridian crossings appropriately including row group filtering
I assume that would require the geometries crossing the anti-meridian are in dedicated row groups. If you start mixing geometries crossing the A.M. and geometries not crossing it, then the min(minx), min(miny), max(maxx), max(maxy) statistics aren't going to make any sense.
e.g if you have a geometry [-10,-10,10,10] and a [170,-10,-170,10] (crossing A-M) in a single row group, then the row group stats are going to be [-10,-10,10,10] and thus the geometry crossing A-M will not be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, that's a good point. Silently not selecting a row if you are not aware of this, doesn't sound good.
Shall we for now just say that this feature (bbox column) doesn't support A-M crossing geometries, and thus cannot be used for data including such geometries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @rouault. We discussed this at the last GeoParquet meeting and decided that we'll just say that antimeridian crossings aren't supported for now. I'll also make an issue to track that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #198 to continue that discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last updates!
Going to merge this now, we can always iterate a bit further on the details in follow-ups (such as the antimeridian case for which you already opened an issue)
Introduces the per-row bounding box definition following the discussion from #188.
This initial proposal goes with a definition that looks like:
pytest test_json_schema.py
poetry run python generate_example.py
andpoetry run python update_example_schemas.py