Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce bounding box column definition #191

Merged
merged 15 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified examples/example.parquet
Binary file not shown.
3 changes: 3 additions & 0 deletions examples/example_metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,9 @@
},
"edges": "planar",
"encoding": "WKB",
"geometry_bbox": {
"column": "bbox"
},
"geometry_types": [
"Polygon",
"MultiPolygon"
Expand Down
16 changes: 16 additions & 0 deletions format-specs/geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ Each geometry column in the dataset MUST be included in the `columns` field abov
| edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. |
| bbox | \[number] | Bounding Box of the geometries in the file, formatted according to [RFC 7946, section 5](https://tools.ietf.org/html/rfc7946#section-5). |
| epoch | number | Coordinate epoch in case of a dynamic CRS, expressed as a decimal year. |
| geometry_bbox | object | Object specifying a column name of a [Bounding Box Column](#bounding-box-columns). |


#### crs

Expand Down Expand Up @@ -134,6 +136,20 @@ For non-geographic coordinate reference systems, the items in the bbox are minim

The bbox values are in the same coordinate reference system as the geometry.

#### geometry_bbox

Including a per-row bounding box can be useful for accelerating spatial queries by allowing consumers to inspect row group bounding box summary statistics. Furthermore a bounding box may be used to avoid complex spatial operations by first checking for bounding box overlaps. This field captures the name of a column containing the bounding box of the geometry for every row.

The format of `geometry_bbox` is `{"name": "column_name"}` where `column_name` MUST exist in the Parquet file and meet the criteria in the [Bounding Box Column](#bounding-box-columns) definition.

Note: the value specified in this field should not be confused with the [`bbox`](#bbox) field which contains the single bounding box of this geometry over the whole GeoParquet file.

### Bounding Box Columns

A bounding box column MUST be a Parquet struct with required fields `xmin`, `xmax`, `ymin`, and `ymax`. For three dimensions the additional fields `zmin` and `zmax` MUST be present. The fields MUST be of Parquet type `FLOAT` or `DOUBLE`. The repetition of a bounding box column MUST match the geometry column's [repetition](#repetition). A row MUST contain a bounding box value if and only if the row contains a geometry value. In cases where the geometry is optional and a row not contain a geometry value, the row MUST NOT contain a bounding box value.

The bounding box column MUST be at the root of the schema. The bounding box column MUST NOT be nested in a group.

### Additional information

#### Feature identifiers
Expand Down
10 changes: 10 additions & 0 deletions format-specs/schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,16 @@
},
"epoch": {
"type": "number"
},
"geometry_bbox": {
"type": "object",
"required": ["column"],
"properties": {
"column": {
"type": "string",
"minLength": 1
}
}
}
}
}
Expand Down
15 changes: 11 additions & 4 deletions scripts/generate_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
>>> import json, pprint, pyarrow.parquet as pq
>>> pprint.pprint(json.loads(pq.read_schema("example.parquet").metadata[b"geo"]))
"""
from collections import OrderedDict
import json
import pathlib

Expand All @@ -19,6 +20,14 @@

df = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
df = df.to_crs("ogc:84")

geometry_bbox = df.bounds.rename(
OrderedDict(
[("minx", "xmin"), ("miny", "ymin"), ("maxx", "xmax"), ("maxy", "ymax")]
),
axis=1,
)
df["bbox"] = geometry_bbox.to_dict("records")
table = pa.Table.from_pandas(df.head().to_wkb())


Expand All @@ -39,14 +48,12 @@ def get_version() -> str:
"crs": json.loads(df.crs.to_json()),
"edges": "planar",
"bbox": [round(x, 4) for x in df.total_bounds],
"geometry_bbox": {"column": "bbox"},
},
},
}

schema = (
table.schema
.with_metadata({"geo": json.dumps(metadata)})
)
schema = table.schema.with_metadata({"geo": json.dumps(metadata)})
table = table.cast(schema)

pq.write_table(table, HERE / "../examples/example.parquet")
15 changes: 15 additions & 0 deletions scripts/test_json_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ def get_version() -> str:
"geometry": {
"encoding": "WKB",
"geometry_types": [],
"geometry_bbox": {
"column": "bbox",
},
},
},
}
Expand Down Expand Up @@ -210,6 +213,18 @@ def get_version() -> str:
metadata["columns"]["geometry"]["epoch"] = "2015.1"
invalid_cases["epoch_string"] = metadata

# Geometry Bbox

metadata = copy.deepcopy(metadata_template)
metadata["columns"]["geometry"]["geometry_bbox"].pop("column")
invalid_cases["empty_geometry_bbox"] = metadata


metadata = copy.deepcopy(metadata_template)
metadata["columns"]["geometry"]["geometry_bbox"]["column"] = ""
invalid_cases["empty_geometry_bbox_column"] = metadata



# # Tests

Expand Down