Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Row in Shapefile using ST_Read() #414

Closed
jwass opened this issue Sep 26, 2024 · 3 comments · Fixed by #415
Closed

Missing Row in Shapefile using ST_Read() #414

jwass opened this issue Sep 26, 2024 · 3 comments · Fixed by #415

Comments

@jwass
Copy link

jwass commented Sep 26, 2024

duckdb is omitting a row from some shapefiles. Here's an example using NYC Open Data

$ curl -O "https://data.cityofnewyork.us/api/geospatial/tv64-9x69?method=export&format=Shapefile"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 41104    0 41104    0     0  97830      0 --:--:-- --:--:-- --:--:-- 97866

$ unzip tv64-9x69
Archive:  tv64-9x69
  inflating: geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.dbf
  inflating: geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.cpg
  inflating: geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp
  inflating: geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shx
  inflating: geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.prj

The shapefile has 5 rows.

$ ogr2ogr -if 'ESRI Shapefile' -f CSV output.csv geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp
$ cat output.csv
boro_code,boro_name,shape_area,shape_leng
3.0000000000000000000000000000000,Brooklyn,2697660957.6599998474121093750000000000000,236441.8174230000004172325134277343750
5.0000000000000000000000000000000,Staten Island,2851517714.9800000190734863281250000000000,220557.4760759999917354434728622436523
1.0000000000000000000000000000000,Manhattan,944328627.3769999742507934570312500000000,203495.3184499999915715306997299194336
2.0000000000000000000000000000000,Bronx,1598501129.5699999332427978515625000000000,188162.4827079999959096312522888183594
4.0000000000000000000000000000000,Queens,4962897942.3900003433227539062500000000000,459302.9055159999988973140716552734375

Duckdb only shows 4 rows.

$ ~/duckdb --version
v1.1.1 af39bd0dcf

$ echo "INSTALL SPATIAL; LOAD SPATIAL; SELECT * FROM ST_READ('geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp');" | ~/duckdb
┌───────────┬───────────────┬───────────────┬───────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ boro_code │   boro_name   │  shape_area   │  shape_leng   │                                                                                                                                                                                                                                                                                                                  geom                                                                                                                                                                                                                                                                                                                   │
│  double   │    varchar    │    double     │    double     │                                                                                                                                                                                                                                                                                                                geometry                                                                                                                                                                                                                                                                                                                 │
├───────────┼───────────────┼───────────────┼───────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│       3.0 │ Brooklyn      │ 2697660957.66 │ 236441.817423 │ POLYGON ((-73.92722485839732 40.725332797858556, -73.92653850981335 40.724065200861965, -73.92534378298296 40.721637982465985, -73.92417994115488 40.72001999582064, -73.92427812361059 40.71853611888922, -73.92312683417494 40.717023660709394, -73.92283177836151 40.716636030802356, -73.92271117629188 40.71647759153867, -73.92273875723602 40.71615785672431, -73.92432687084326 40.71556819492673, -73.9240590973705 40.714111559644955, -73.92404011334023 40.71400831262894, -73.92313936283307 40.71339472809052, -73.92223466477681 40.71285595386036, -73.92181880372804 40.71178206795015, -73.92154601695684 40.711043…  │
│       5.0 │ Staten Island │ 2851517714.98 │ 220557.476076 │ POLYGON ((-74.05581438861887 40.64970879790647, -74.05619414135802 40.63927748254579, -74.05623556634002 40.63813939513815, -74.0563014878872 40.6370585732503, -74.05646822584012 40.6317466905475, -74.05661769631568 40.627284070550076, -74.05534110428691 40.62513123369629, -74.05352244692213 40.62209079578645, -74.05235184918668 40.620133622308884, -74.04996267761109 40.61625667425286, -74.04906155737653 40.61482391679318, -74.04806555855417 40.613240224258114, -74.0474892630543 40.61233098164059, -74.04644311547133 40.61075134673787, -74.04388942485573 40.60691293977066, -74.04384055466234 40.606840052295…  │
│       1.0 │ Manhattan     │ 944328627.377 │  203495.31845 │ MULTIPOLYGON (((-74.04387761573885 40.69018767537152, -74.0435059601254 40.68968735963638, -74.04273533893009 40.69005019142024, -74.04278433380006 40.69012097669119, -74.04270428426766 40.690155204644356, -74.04255372037308 40.68996275928963, -74.0426392937119 40.68992817641339, -74.0426938081918 40.68999725910724, -74.04346752310265 40.68963699010349, -74.04351637245855 40.68919103374238, -74.04364078627414 40.68876655957018, -74.04397458556184 40.68858240705594, -74.0443852177728 40.68851617840272, -74.04478399040363 40.6885956601159, -74.04627538937642 40.689327425897034, -74.04680284898575 40.68995325…  │
│       2.0 │ Bronx         │ 1598501129.57 │ 188162.482708 │ POLYGON ((-73.86477258283523 40.90201244187387, -73.86305085664789 40.9015080840389, -73.86279850645575 40.901535521721, -73.8626620866681 40.901459187897046, -73.86245772627257 40.901358396514496, -73.86224439595213 40.90126770653601, -73.86202342518322 40.90118777618785, -73.8617962600684 40.901119122288335, -73.86156438318812 40.901062122833935, -73.86134100834322 40.90111157030154, -73.86109329992773 40.9010320752821, -73.8608813844133 40.90096407584999, -73.86078380097729 40.90093304486161, -73.86068622778949 40.90090201330024, -73.86050354628262 40.90084609261496, -73.86027030967344 40.90077469541309…  │
└───────────┴───────────────┴───────────────┴───────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Queens is missing.

Also tried using Python/fiona which also properly has 5 rows:

In [1]: import fiona

In [2]: with fiona.open('geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp') as f:
   ...:     features = list(f)
   ...:

In [3]: features
Out[3]:
[fiona.Feature(geometry=fiona.Geometry(coordinates=[[(-73.92722485839732, 40.725332797858556), ...]], type='Polygon'), id='0', properties=fiona.Properties(boro_code=3.0, boro_name='Brooklyn', shape_area=2697660957.66, shape_leng=236441.817423)),
 fiona.Feature(geometry=fiona.Geometry(coordinates=[[(-74.05581438861887, 40.64970879790647), ...]], type='Polygon'), id='1', properties=fiona.Properties(boro_code=5.0, boro_name='Staten Island', shape_area=2851517714.98, shape_leng=220557.476076)),
 fiona.Feature(geometry=fiona.Geometry(coordinates=[[[(-74.04387761573885, 40.69018767537152), ...]], ...], type='MultiPolygon'), id='2', properties=fiona.Properties(boro_code=1.0, boro_name='Manhattan', shape_area=944328627.377, shape_leng=203495.31845)),
 fiona.Feature(geometry=fiona.Geometry(coordinates=[[(-73.86477258283523, 40.90201244187387), ...]], type='Polygon'), id='3', properties=fiona.Properties(boro_code=2.0, boro_name='Bronx', shape_area=1598501129.57, shape_leng=188162.482708)),
 fiona.Feature(geometry=fiona.Geometry(coordinates=[[(-73.77895751354573, 40.811713192208586), ...]], type='Polygon'), id='4', properties=fiona.Properties(boro_code=4.0, boro_name='Queens', shape_area=4962897942.39, shape_leng=459302.905516))]

In [4]: len(features)
Out[4]: 5
@jwass
Copy link
Author

jwass commented Sep 27, 2024

I discovered the undocumented function ST_ReadSHP() which does not exhibit this issue.

D select * from st_readshp('geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp');
┌───────────┬───────────────┬───────────────┬───────────────┬────────────────────────────────────────────────────────────────┐
│ boro_code │   boro_name   │  shape_area   │  shape_leng   │                              geom                              │
│  double   │    varchar    │    double     │    double     │                            geometry                            │
├───────────┼───────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────┤
│       3.0 │ Brooklyn      │ 2697660957.66 │ 236441.817423 │ POLYGON ((-73.92722485839732 40.725332797858556, -73.9265385…  │
│       5.0 │ Staten Island │ 2851517714.98 │ 220557.476076 │ POLYGON ((-74.05581438861887 40.64970879790647, -74.05619414…  │
│       1.0 │ Manhattan     │ 944328627.377 │  203495.31845 │ MULTIPOLYGON (((-74.04387761573885 40.69018767537152, -74.04…  │
│       2.0 │ Bronx         │ 1598501129.57 │ 188162.482708 │ POLYGON ((-73.86477258283523 40.90201244187387, -73.86305085…  │
│       4.0 │ Queens        │ 4962897942.39 │ 459302.905516 │ POLYGON ((-73.77895751354573 40.811713192208586, -73.7637120…  │
└───────────┴───────────────┴───────────────┴───────────────┴────────────────────────────────────────────────────────────────┘

Also get the same correct result using select * from 'geo_export_4fe3aed9-1fcc-46be-b58b-e1be1f31bd1b.shp'; which I'm guessing ends up in the same code path as the st_readshp() call.

@Maxxen
Copy link
Member

Maxxen commented Sep 27, 2024

Yeah, this was quite a ride to chase down but it turns out that in our file system abstraction layer we plug into GDAL, we did not have the same behavior regarding when a file handle should return EOF as was expected by GDAL.

GDAL seems to consider a file handle to be "set" to EOF once it tries to read past the end of the file (e.g. read() does not return the same amount of bytes as requested, and no error is set), and resets the EOF state whenever a seek() is issued. In contrast we always considered EOF == true whenever tell() == file size regardless of how previous reads behaved.

The ST_ReadSHP function uses our own filesystem directly, and also does not do the same piecewise reading until EOF to count records in the .dbf as the GDAL driver seems to do - if that's good or not I can't say for sure.

Should be solved in #415

@jwass
Copy link
Author

jwass commented Sep 27, 2024

Awesome. Thanks for tracking it down and for the explanation.

@jwass jwass closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants