Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Abbreviations of country names #674

Open
Corvan opened this issue Oct 20, 2024 · 1 comment
Open

Question: Abbreviations of country names #674

Corvan opened this issue Oct 20, 2024 · 1 comment

Comments

@Corvan
Copy link

Corvan commented Oct 20, 2024

Hi!

I played around a bit with libpostal and pypostal, and I am quite impressed. Kudos!


My country is

Germany


Here's how I'm using libpostal

Thinking about including it in the Odoo instance of where I work, but I am just in the (private) explorative stage yet. The use-case would be to deduplicate purchased leads.


Here's what I did

I tried using expand_address (from pypostal, but this is not a pypostal issue).

This is my unittest:

    def test_normalize_address__abbreviations(self):
        address = Address(
            id=0,
            name=Keeper(value="H. P. Lovecraft", keep=True),
            email=Keeper(value="[email protected]", keep=True),
            company=Keeper(value="Arkham House", keep=True),
            street="Town Sqr.",
            building_number="5",
            postcode="01938",
            city="Innsmouth",
            state="MA",
            country="U.S.A.",
        )
        normalized_address = normalize_address(address, languages=["en"])
        self.assertEqual("town square", normalized_address.normalized_street)
        self.assertEqual("massachusetts", normalized_address.normalized_state)
        self.assertEqual("united states of america", normalized_address.normalized_country)

This is the normalize_address function:

def normalize_address(address: Address, languages: list[str]) -> NormalizedAddress:
    """Normalizes the fields of an address that make sense to be normalized,
    adds fields to the dict of the address with normalized values"""
    normalized_address = NormalizedAddress(
        id=address.id,
        name=address.name,
        email=address.email,
        company=address.company,
        street=address.street,
        building_number=address.building_number,
        postcode=address.postcode,
        city=address.city,
        state=address.state,
        country=address.country,
    )

    normalized_address.normalized_name = normalize_string(address.name.value)
    normalized_address.normalized_email = normalize_email_address(address.email.value)
    normalized_address.normalized_company = normalize_string(address.company.value)
    normalized_address.normalized_street = normalize_address_string(
        address.street,
        languages=languages
    )
    normalized_address.normalized_building_number = normalize_string(address.building_number)
    normalized_address.normalized_postcode = normalize_string(address.postcode)
    normalized_address.normalized_city = normalize_string(address.city)
    normalized_address.normalized_state = normalize_state_string(address.state, languages=languages)
    normalized_address.normalized_country = normalize_country_string(
        address.country,
        languages=languages
    )

    return normalized_address

And this is the normalize_country_string function:

def normalize_country_string(state: str, languages: list[str]) -> str:
    """Normalize state String, like e.g. "MA" for Massachusetts, by expanding it with pypostal"""
    parsed_country = postal.parser.parse_address(
        state,
        language=languages[0],
    )
    expanded_country = postal.expand.expand_address(parsed_country[0][0], languages=languages)
    return expanded_country[0]

Here's what I got

Worked well with addresses, like sqr to Square
But what I got back from U.S.A. I got back usa


Here's what I was expecting

I was not able to expand the abbreviation of U.S.A. to e.g.: United States of America ar another representation. Maybe the library is not intended to do so, which would be completely fine with me. I was just wondering, if I made an Error or if that is intentional?


Here's what I think could be improved

More documentation, and maybe reStructuredText docstrings, instead of something doxygen-like in the python parts, because they can be better parsed by Python tools (like e.g. PyCharm)

@albarrentine
Copy link
Contributor

Geographic expansions are mostly not included since the implementation is simplistic, just cartesian product of the expansions to create keys for matching. It can lead to a lot of irrelevant results across languages. The library's not really designed for standardizing addresses for display though it can be cludged into doing so. You can always just include your own dictionary of expansions and look up the parsed result in that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants