-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: implement using and writing bounding box column, for faster spatial filtering #9185
Conversation
Cool! For example, now you used the Rtree from geopackage. But another option could be to use GEOS (if available in the GDAL build) to calculate the HilbertCode (GEOS >= 3.11) for each bbox value, and then use the Arrow C++ APIs to sort the data based on those values before writing to Parquet. |
For the sake of simplicity, I'd prefer not to have to support several methods... We might switch to something better, but I'd see that as an implementation detail. The current implementation might be tunable. I'm not super convinced that the points where I flush row groups are ideal for bbox compacity (there are sometimes significant overlap between different row groups), but couldn't find an obvious way to improve things.
do you have pointers to the API doc for that sorting API ? But I'd assume that you have to ingest the whole file into memory ? Besides the "simplicity" of re-using the GeoPackage RTree, one of its advantage is that it can work with files much larger than RAM. Users have for example ran into RAM issues when generating very very large FlatGeoBuf files when the driver builds its packed Hilbert R*Tree, which requires to be able to store in RAM something like 84 bytes per feature, and thus if your number of features reaches 1 billion... |
… fields and the FID column is not the first one
…Ignored on reading side
Defaults to NO Documentation: ``` - .. lco:: SORT_BY_BBOX :choices: YES, NO :default: NO :since: 3.9 Whether features should be sorted based on the bounding box of their geometries, before being written in the final file. Sorting them enables faster spatial filtering on reading, by grouping together spatially close features in the same group of rows. Note however that enabling this option involves creating a temporary GeoPackage file (in the same directory as the final Parquet file), and thus requires temporary storage (possibly up to several times the size of the final Parquet file, depending on Parquet compression) and additional processing time. The efficiency of spatial filtering depends on the ROW_GROUP_SIZE. If it is too large, too many features that are not spatially close will be grouped together. If it is too small, the file size will increase, and extra processing time will be necessary to browse through the row groups. Note also that when this option is enabled, the Arrow writing API (which is for example triggered when using ogr2ogr to convert from Parquet to Parquet), fallbacks to the generic implementation, which does not support advanced Arrow types (lists, maps, etc.). ``` Experiments with the canonical https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset: * Generation of datasets: // Organize in row groups of 65,536 features, no BBOX, no sorting ``` $ time ogr2ogr out_no_bbox.parquet nz-building-outlines.parquet -progress -lco WRITE_COVERING_BBOX=NO 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m4,457s ``` // Organize in row groups of 65,536 features, add BBOX columns, no sorting ``` $ time ogr2ogr out_unsorted.parquet nz-building-outlines.parquet -progress 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m5,408s ``` // Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree ``` $ time ogr2ogr out_sorted.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m40,311s ``` // Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree ``` $ time ogr2ogr out_sorted_16384.parquet nz-building-outlines.parquet -progress -lco SORT_BY_BBOX=YES -lco ROW_GROUP_SIZE=16384 0...10...20...30...40...50...60...70...80...90...100 - done. real 0m44,149s ``` * File sizes: ``` out_no_bbox.parquet 436,475,127 out_unsorted.parquet 504,120,728 out_sorted.parquet 489,507,910 out_sorted_16384.parquet 492,760,561 ``` * Spatial filter selecting a single feature: ``` $ time ogrinfo out_no_bbox.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m1,302s $ time ogrinfo out_unsorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,947s $ time ogrinfo out_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,278s $ time ogrinfo out_sorted_16384.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount 1 real 0m0,183s ``` * Spatial filter selecting ~ 470,000 features (over a total of 3.2 millions): ``` $ time ogrinfo out_no_bbox.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,957s $ time ogrinfo out_unsorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,718s $ time ogrinfo out_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,067s $ time ogrinfo out_sorted_16384.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount 471147 real 0m1,021s ```
54eaeab
to
eb3d124
Compare
Documentation:
Experiments with the canonical https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet dataset:
// Organize in row groups of 65,536 features, no BBOX, no sorting
// Organize in row groups of 65,536 features, add BBOX columns, no sorting
// Organize in row groups of max 65,536 features, add BBOX columns, sort using RTree
// Organize in row groups of max 16,384 features, add BBOX columns, sort using RTree