Libpostal parsing Indonesian addresses with poor accuracy #645

AsfarHorani · 2023-10-25T08:47:59Z

Hi!

I was checking out libpostal, and saw something that could be improved.

My country is

Pakistan but i was working on Indonesian data for a project

Here's how I'm using libpostal

Created a docket images

Here's what I did

Jl. Arif Rahman Hakim No.5, Mataram, Nusa Tenggara Bar, 83127, Indonesia

Here's what I got

Result:

{
"road": "jl. arif rahman hakim no.5 mataram nusa tenggara bar",
"postcode": "83127",
"country": "indonesia"
}

Here's what I was expecting

Result:
Street: Jl. Arif Rahman Hakim No.5
City: Mataram
State/Province: Nusa Tenggara Bar
Postal Code: 83127
Country: Indonesia

For parsing issues, please answer "yes" or "no" to all that apply.

Does the input address exist in OpenStreetMap?
yes: https://www.openstreetmap.org/search?query=Jl.%20Arif%20Rahman%20Hakim%20Mataram#map=17/-8.59163/116.10771- Do all the toponyms exist in OSM (city, state, region names, etc.)?
no
If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
no
If the address does not contain city, region, etc., does adding those fields to the input improve the result?
Might be
If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?
I dont know

Here's what I think could be improved

Update the data

albarrentine · 2024-02-14T23:38:55Z

think it has trouble with the "Bar" part (maybe listed differently in OSM/GeoNames, etc.). This works fine:

Jl. Arif Rahman Hakim No.5, Mataram, Nusa Tenggara, 83127, Indonesia

{
  "road": "jl. arif rahman hakim",
  "house_number": "no.5",
  "city": "mataram",
  "state_district": "nusa tenggara",
  "postcode": "83127",
  "country": "indonesia"
}

AsfarHorani · 2024-02-15T18:44:57Z

But it was just a sample text that I shared with you. Classification is poor in 90% of the cases for Indonesian localities

albarrentine · 2024-02-16T16:39:43Z

Will probably need to check the training data (in the address_parser cli it’s possible to type .print_features and then some test addresses and it will print out a JSON representation of what the model is doing for every word in the input and can test different formulations). Could be that the municipality names from OSM are substantially different or there’s some convention being used that is different from how things are tagged in OSM/GeoNames, etc. In some cases there are things that can be in preprocessing like extracting out a regex. For instance if all the cities in the test set had something like “Bar” at the end, and in the training set they did not, it’s easy to write a regex to remove that before parsing and then let the model handle the rest and optionally add it back later. If most of the test addresses are comma-separated you can also walk backward through the string parsing and keep adding one comma-separated phrase to the parse until something becomes inconsistent (two non-adjacent phrases labeled as “road”). If so, try throwing out that component and reparsing. Generally for something like an admin there’s often a reference database/search index to look it up.

rjurney · 2024-05-03T00:23:45Z

@AsfarHorani have you tried the new Senzing parsing model? Check out the README instructions. It may help, though that isn't clear from the latest metrics: https://github.com/Senzing/libpostal-data/blob/main/files/stats/v1.1.0/Parsing_comparison_v1_1_0.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libpostal parsing Indonesian addresses with poor accuracy #645

Libpostal parsing Indonesian addresses with poor accuracy #645

AsfarHorani commented Oct 25, 2023

albarrentine commented Feb 14, 2024 •

edited

Loading

AsfarHorani commented Feb 15, 2024

albarrentine commented Feb 16, 2024

rjurney commented May 3, 2024 •

edited

Loading

Libpostal parsing Indonesian addresses with poor accuracy #645

Libpostal parsing Indonesian addresses with poor accuracy #645

Comments

AsfarHorani commented Oct 25, 2023

My country is

Pakistan but i was working on Indonesian data for a project

Here's how I'm using libpostal

Here's what I did

Here's what I got

{ "road": "jl. arif rahman hakim no.5 mataram nusa tenggara bar", "postcode": "83127", "country": "indonesia" }

Here's what I was expecting

Result: Street: Jl. Arif Rahman Hakim No.5 City: Mataram State/Province: Nusa Tenggara Bar Postal Code: 83127 Country: Indonesia

For parsing issues, please answer "yes" or "no" to all that apply.

Here's what I think could be improved

albarrentine commented Feb 14, 2024 • edited Loading

AsfarHorani commented Feb 15, 2024

albarrentine commented Feb 16, 2024

rjurney commented May 3, 2024 • edited Loading

{
"road": "jl. arif rahman hakim no.5 mataram nusa tenggara bar",
"postcode": "83127",
"country": "indonesia"
}

Result:
Street: Jl. Arif Rahman Hakim No.5
City: Mataram
State/Province: Nusa Tenggara Bar
Postal Code: 83127
Country: Indonesia

albarrentine commented Feb 14, 2024 •

edited

Loading

rjurney commented May 3, 2024 •

edited

Loading