Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent parsing results US address. #19

Open
TerranceNHanlon opened this issue Apr 26, 2024 · 4 comments
Open

Inconsistent parsing results US address. #19

TerranceNHanlon opened this issue Apr 26, 2024 · 4 comments

Comments

@TerranceNHanlon
Copy link

Apologies if this is the wrong medium for this question but i'm at a wall. I'm getting inconsistent parsing results in my environments that is making it difficult to debug.

For example this address (it's a fake street address but real city, state, and zip) parses incorrectly in my docker instance(debian), but if i were to run it locally (m1 macos) it would parse correctly.

1111 main street, Chapel Hill, North Carolina 27516

It seems to confuse the state North Carolina and appends North to the city value:

{
    "label": "house_number",
    "value": "1111"
},
{
    "label": "road",
    "value": "main street"
},
{
    "label": "city",
    "value": "chapel hill north"
},
{
    "label": "state",
    "value": "carolina"
},
{
    "label": "postcode",
    "value": "27516"
}

While in another instance,

{
    "label": "house_number",
    "value": "1111"
},
{
    "label": "road",
    "value": "main street"
},
{
    "label": "city",
    "value": "chapel hill"
},
{
    "label": "state",
    "value": "north carolina"
},
{
    "label": "postcode",
    "value": "27516"
}

They're consistent in their environment, i have not re-compiled my local (correctly parsing) instance but both the docker instance and my local instance are using the same forked version of Libpostal when compiling and configuring/setting up.

I imagine this is an open ended and hard to answer question, but im wondering if this has been seen before and would appreciate just any insight into why they're different results and why it's not recognizing the state. Thanks in advance .

@albarrentine
Copy link
Contributor

Maybe try it using the command-line parser that comes with the C library (clone the C library https://github.com/openvenues/libpostal and build it with make as usual. Then run ./src/address_parser. This is a command-line interface to the C library, and it has a special command .print_features which, for any address input after that, will print a set of the input features extracted by the model for each token in the input. Check the differences in the feature output between the environments to see if anything doesn't match for some reason.

Sometimes Docker issues are related to the resource requirements being somewhat larger than the default specs (4GB of RAM usually works). There needs to be enough disk space to hold the models, so I would check the byte size of the files in the data dir between the working env and the Docker one and make sure everything's downloaded/decompressed properly.

@TerranceNHanlon
Copy link
Author

Thanks for the response. The .print_features output for my local instance looks as id expect.

{
  "house_number": "1111",
  "road": "main street",
  "city": "chapel hill",
  "state": "north carolina",
  "postcode": "27516"
}

and an du -h libpostal/ for my local instance:

1.7G    libpostal/address_parser
8.3M    libpostal/address_expansions
19M    libpostal/transliteration
388K    libpostal/numex
74M    libpostal/language_classifier
1.8G    libpostal

And the following are outputted from my docker instance:

2.7G    libpostal/address_parser
8.4M    libpostal/address_expansions
19M     libpostal/transliteration
388K    libpostal/numex
75M     libpostal/language_classifier
2.8G    libpostal/
{
  "house_number": "1111",
  "road": "main street",
  "city": "chapel hill north",
  "state": "carolina",
  "postcode": "27516"
}

In my docker container i also tried MODEL=senzing and get the same results. My local instance is a gb less than my docker instances, could that be indicative of anything?

@TerranceNHanlon
Copy link
Author

Sorry realized i didnt include the entire response from .print_features

Here is what i get from the docker instances

{ postcode no context|DDDD, bias, word|DDDD, first word|DDDD, next word|main street, word+next word|DDDD|main street }
{ phrase|main street, phrase type+phrase|suburb|main street, phrase type+phrase|city|main street, commonly city|main street, prev word|DDDD, prev word+word|DDDD|main street, next word|chapel hill north, word+next word|main street|chapel hill north, prev tag+word|main street, prev tag+prev word|DDDD }
{ phrase|main street, phrase type+phrase|suburb|main street, phrase type+phrase|city|main street, commonly city|main street, prev word|DDDD, prev word+word|DDDD|main street, next word|chapel hill north, word+next word|main street|chapel hill north }
{ phrase|chapel hill north, unambiguous phrase type|suburb, unambiguous phrase type+phrase|suburb|chapel hill north, commonly suburb|chapel hill north, prev word|main street, prev word+word|main street|chapel hill north, next word|carolina, word+next word|chapel hill north|carolina, prev tag+word|chapel hill north, prev tag+prev word|main street }
{ phrase|chapel hill north, unambiguous phrase type|suburb, unambiguous phrase type+phrase|suburb|chapel hill north, commonly suburb|chapel hill north, prev word|main street, prev word+word|main street|chapel hill north, next word|carolina, word+next word|chapel hill north|carolina }
{ phrase|chapel hill north, unambiguous phrase type|suburb, unambiguous phrase type+phrase|suburb|chapel hill north, commonly suburb|chapel hill north, prev word|main street, prev word+word|main street|chapel hill north, next word|carolina, word+next word|chapel hill north|carolina }
{ bias, word|carolina, prev word|chapel hill north, prev word+word|chapel hill north|carolina, next word|DDDDD, word+next word|carolina|DDDDD }
{ postcode no context|DDDDD, bias, word|DDDDD, prev word|carolina, prev word+word|carolina|DDDDD, prev tag+word|DDDDD, prev tag+prev word|carolina }

And from my local

{ postcode no context|DDDD, bias, word|DDDD, first word|DDDD, next word|main street, word+next word|DDDD|main street }
{ phrase|main street, phrase type+phrase|suburb|main street, phrase type+phrase|city|main street, commonly suburb|main street, prev word|DDDD, prev word+word|DDDD|main street, next word|chapel hill, word+next word|main street|chapel hill, prev tag+word|main street, prev tag+prev word|DDDD }
{ phrase|main street, phrase type+phrase|suburb|main street, phrase type+phrase|city|main street, commonly suburb|main street, prev word|DDDD, prev word+word|DDDD|main street, next word|chapel hill, word+next word|main street|chapel hill }
{ phrase|chapel hill, phrase type+phrase|suburb|chapel hill, phrase type+phrase|city|chapel hill, commonly city|chapel hill, prev word|main street, prev word+word|main street|chapel hill, next word|north carolina, word+next word|chapel hill|north carolina, prev tag+word|chapel hill, prev tag+prev word|main street }
{ phrase|chapel hill, phrase type+phrase|suburb|chapel hill, phrase type+phrase|city|chapel hill, commonly city|chapel hill, prev word|main street, prev word+word|main street|chapel hill, next word|north carolina, word+next word|chapel hill|north carolina }
{ phrase|north carolina, unambiguous phrase type|state, unambiguous phrase type+phrase|state|north carolina, commonly state|north carolina, prev word|chapel hill, prev word+word|chapel hill|north carolina, next word|DDDDD, word+next word|north carolina|DDDDD, prev tag+word|north carolina, prev tag+prev word|chapel hill }
{ phrase|north carolina, unambiguous phrase type|state, unambiguous phrase type+phrase|state|north carolina, commonly state|north carolina, prev word|chapel hill, prev word+word|chapel hill|north carolina, next word|DDDDD, word+next word|north carolina|DDDDD }
{ postcode have context, postcode have context|DDDDD, bias, word|DDDDD, prev word|north carolina, prev word+word|north carolina|DDDDD, prev tag+word|DDDDD, prev tag+prev word|north carolina }

Not entirely sure on how to decipher that but it does look confused around the city component. The build on the docker instance is confusing the city as unambiguous phrase type| and on my local instance phrase type+phrase|suburb|chapel hill .. and on the preceeding line, the docker build has phrase type+phrase|city|main street, commonly city where my local instance has type+phrase|city|main street, commonly suburb

@albarrentine
Copy link
Contributor

yeah it looks like you have an old version of the C library and model (pre 1.0) on docker and the latest from-source version on your machine. Check how it's being installed. If it's through apt-get or something, might need to check that those packages are up-to-date or just install it from source. Takes a little longer on Ubuntu to compile the scanner but not excessively long and can always throw it into a base image if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants