Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoParquet fails to reads hive partioned data from Azure #11309

Open
iferencik opened this issue Nov 20, 2024 · 1 comment · May be fixed by #11310
Open

GeoParquet fails to reads hive partioned data from Azure #11309

iferencik opened this issue Nov 20, 2024 · 1 comment · May be fixed by #11310
Assignees

Comments

@iferencik
Copy link

iferencik commented Nov 20, 2024

What is the bug?

According to docs and ogr_parquet.py GDAL should be able to read partitioned data.

However, the docs also says: "This support is only enabled if the driver is built against the arrowdataset C++ library."

I am not sure how can this be checked except:

ogrinfo --formats | grep Arrow
  Arrow -vector- (rw+v): (Geo)Arrow IPC File Format / Stream (*.arrow, *.feather, *.arrows, *.ipc)

or

ogrinfo --formats | grep Parquet
  Parquet -vector- (rw+v): (Geo)Parquet (*.parquet)

Steps to reproduce the issue

  1. list overture data

    • list themes
         az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11-13.0/ --delimiter "/"
      
                  Name                                Blob Type    Blob Tier    Length    Content Type    Last Modified    Snapshot
         ----------------------------------  -----------  -----------  --------  --------------  ---------------  ----------
         2024-11-13.0/theme=addresses/
         2024-11-13.0/theme=base/
         2024-11-13.0/theme=buildings/
         2024-11-13.0/theme=divisions/
         2024-11-13.0/theme=places/
         2024-11-13.0/theme=transportation/
    • list divisions
         az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11- 
         13.0/theme=divisions/ --delimiter "/"
    
        Name                                                  Blob Type    Blob Tier    Length    Content Type    Last Modified    Snapshot
        ----------------------------------------------------  -----------  -----------  --------  --------------  ---------------  ----------
        2024-11-13.0/theme=divisions/type=division/
        2024-11-13.0/theme=divisions/type=division_area/
        2024-11-13.0/theme=divisions/type=division_boundary/
    
    
    • list partitions
       az storage blob list --account-name overturemapswestus2 --container-name release --output table  --prefix 2024-11- 
       13.0/theme=divisions/type=division_area/
    
       Name                                                                                                               Blob Type    Blob Tier    Length      Content Type              Last Modified              Snapshot
       -----------------------------------------------------------------------------------------------------------------  -----------  -----------  ----------  ------------------------  -------------------------  ----------
       2024-11-13.0/theme=divisions/type=division_area/part-00000-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          1303206504  application/octet-stream  2024-11-13T18:36:57+00:00
       2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          977614904   application/octet-stream  2024-11-13T18:36:49+00:00
       2024-11-13.0/theme=divisions/type=division_area/part-00002-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet  BlockBlob    Hot          781317207   application/octet-stream  2024-11-13T18:40:24+00:00
    
    
  2. read one file

    ogrinfo "PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet" -al -so
    INFO: Open of `PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd.parquet'
          using driver `Parquet' successful.
    
    Layer name: part-00001-be2f62f1-1d1a-4846-8fae-516229e2b6df-c000.zstd
    Geometry: Multi Polygon
    Feature Count: 332188
    Extent: (-180.000000, -4.899520) - (180.000000, 71.588953)
    Layer SRS WKT:
    GEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            MEMBER["World Geodetic System 1984 (G2296)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        CS[ellipsoidal,2],
            AXIS["geodetic latitude (Lat)",north,
                ORDER[1],
                ANGLEUNIT["degree",0.0174532925199433]],
            AXIS["geodetic longitude (Lon)",east,
                ORDER[2],
                ANGLEUNIT["degree",0.0174532925199433]],
        USAGE[
            SCOPE["Horizontal component of 3D system."],
            AREA["World."],
            BBOX[-90,-180,90,180]],
        ID["EPSG",4326]]
    Data axis to CRS axis mapping: 2,1
    Geometry Column = geometry
    id: String (0.0)
    country: String (0.0)
    version: Integer (0.0)
    sources: String(JSON) (0.0)
    subtype: String (0.0)
    class: String (0.0)
    names.primary: String (0.0)
    names.common: String(JSON) (0.0)
    names.rules: String(JSON) (0.0)
    wikidata: String (0.0)
    division_ids: StringList (0.0)
    is_disputed: Integer(Boolean) (0.0)
    perspectives.mode: String (0.0)
    perspectives.countries: StringList (0.0)
    local_type: String(JSON) (0.0)
    region: String (0.0)
    hierarchies: String(JSON) (0.0)
    parent_division_id: String (0.0)
    norms.driving_side: String (0.0)
    population: Integer (0.0)
    capital_division_ids: StringList (0.0)
    capital_of_divisions: String(JSON) (0.0)
    division_id: String (0.0)
    
  3. read partitioned data

        ogrinfo "PARQUET:/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/" -al -so
        ERROR 1: parquet::arrow::OpenFile() failed
      ogrinfo "/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/" -al -so
       ERROR 4: `/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/' not recognized as being in a supported file format.
       ogrinfo failed - unable to open '/vsicurl/https://overturemapswestus2.blob.core.windows.net/release/2024-11-13.0/theme=divisions/type=division_area/'.

Versions and provenance

ogrinfo --version
GDAL 3.10.0, released 2024/11/01

Additional context

I am trying to read effectively Parquet files in a bbox directly from Azure

@rouault rouault self-assigned this Nov 20, 2024
rouault added a commit to rouault/gdal that referenced this issue Nov 20, 2024
@rouault
Copy link
Member

rouault commented Nov 20, 2024

Proper fix in #11310

Workaround with existing versions: AZURE_NO_SIGN_REQUEST=YES AZURE_STORAGE_ACCOUNT=overturemapswestus2 ogrinfo "PARQUET:/vsiaz/release/2024-11-13.0/theme=divisions/type=division_area//" --debug on -al -so
Note the trailing slash repeated twice. The workaround is not perfect because it causes the layer name to be an empty string, hence when converting to other formats with ogr2ogr you need to use -nln some_layer_name

rouault added a commit to rouault/gdal that referenced this issue Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants