Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: consolidate remote ids and wikisource identifiers #10092

Draft
wants to merge 254 commits into
base: master
Choose a base branch
from

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Nov 27, 2024

this should be squash merged to avoid conflicts with #9674, which split off from this PR

===This is a WIP===

Closes #10029
Closes #9927

Technical

  • fetching live wikidata will also save wikidata's author identifiers to the author's remote ids (at least any that don't conflict with existing ones that already exist for that author, and only if there aren't TOO many conflicts)
  • backfill this operation for all existing authors with backfill_author_identifiers.py
  • import pipeline matches by OL ID first, then remote_ids (which should have been filled out with wikidata identifiers via backfill) (and as long as there aren't too many conflicts), then fall back to existing logic of matching by name
  • import records can use an "identifiers" field in the author dict, which is a dict in the form of { "viaf": "blahblahblah" ...etc }. related schema update PR: Update author/import models openlibrary-client#419

TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)

The import model is expanded by adding some additional logic to write to the author's remote_ids when detected in the incoming json object, and search Infogami against those remote ids to detect if the author already exists. The incoming author dict can store an optional remote_ids field to contain these (i.e. viaf stored in author["remote_ids"]["viaf"]), except for OL ID, which is not a remote identifier, so it is expected at author["key"].

Issues:

  • Importing books returns a 200 success, but the author's page still says 0 works

Testing

tested using the output from #9674

To test the import, I wasn't sure how to hit /api/import with user credentials, so I disabled the if not can_write(): condition in openlibrary/plugins/importapi/code.py as well as the if not account_key: condition in openlibrary/catalog/add_book/init.py, and copy-pasted the printed JSON records into a Postman request body.

I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.

Example:

{
    "title": "Equitation",
    "source_records": [
        "wikisource:en:Equitation"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation",
    "publish_date": "1922",
    "authors": [
        {
            "name": "Henry L. de Bussigny",
            "remote_ids": {
                "wikidata": "Q16862522",
                "viaf": "305913238",
                "isni": "0000000424758764",
                "project_gutenberg": "40106"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

This responds successfully with:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "created"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL18M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL9W",
        "status": "created"
    }
}

Viewing this author key at http://localhost:8080/authors/OL15A shows that the strong identifiers were imported correctly:

image

Editing the author verifies this as well:
image

I then created a test book record whose author uses the same VIAF but has a missing name:

{
    "title": "Equitation test",
    "source_records": [
        "wikisource:en:Equitation_test"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test",
    "publish_date": "1922",
    "authors": [
        {
            "name": "viaf test",
            "remote_ids": {
                "viaf": "305913238"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

The response shows that the author was successfully matched to an existing one by VIAF:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL19M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL10W",
        "status": "created"
    }
}

This also works for OL IDs, which uses a slightly different fetch query than the other strong identifiers do:

{
    "title": "Equitation test 2",
    "source_records": [
        "wikisource:en:Equitation_test_2"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test_2"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test_2",
    "publish_date": "1922",
    "authors": [
        {
            "name": "author ol_id test",
            "key": "/authors/OL15A"
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}
{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL23M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL14W",
        "status": "created"
    }
}

Importing optional cover images also works:
image

I added support for all strong identifiers found in identifiers.yml, except for Inventaire, which I couldn't find in Wikidata:
image

Screenshot

Stakeholders

pidgezero-one and others added 30 commits August 1, 2024 17:45
@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.12%. Comparing base (347bff9) to head (41cc424).
Report is 91 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #10092   +/-   ##
=======================================
  Coverage   17.12%   17.12%           
=======================================
  Files          89       89           
  Lines        4752     4752           
  Branches      831      831           
=======================================
  Hits          814      814           
  Misses       3428     3428           
  Partials      510      510           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Freso Freso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems (mostly) good to me, but I did skim over some parts. Maybe I’ll look again later. 🙈

Comment on lines +174 to +176
def test_second_match_strong_identifier(self, mock_site):
"""
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tom Morris says that Goodreads and Amazon IDs aren’t “strong” identifiers 🙃 Maybe just use “identifier”? Or “remote identifier”? (Paralleling the remote_ids property.)

Suggested change
def test_second_match_strong_identifier(self, mock_site):
"""
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.
def test_second_match_remote_identifier(self, mock_site):
"""
Next highest priority match is any other remote identifier, such as VIAF, Goodreads ID, Amazon ID, etc.

@@ -217,6 +218,10 @@ def _get_d(self):
"l": self._get_lists_cached(),
}

def get_key_numeric(self):
"""Returns just the numeric part of the key."""
return int(re.search(r'\d+', self.key))
Copy link
Contributor

@Freso Freso Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we know our own identifiers, we also know that they always start with OL and end with a single letter (currently A, W, or E). This means that, instead of doing an expensive regular expression, we could probably just some .strip()’ing – or even just slicing the string:

Suggested change
return int(re.search(r'\d+', self.key))
return int(self.key[2:-1])

But also, what does extract_numeric_id_from_olid from openlibrary.utils do? Just based on the name, it seems eerily similar to this method, but I didn’t look it up. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here to choose the lower of two OL IDs when identifiers are otherwise the same: https://github.com/internetarchive/openlibrary/pull/10092/files#diff-8a66754753640315d80bf708c3483a13439e24736f081161b22e2d59cee76314R183

I haven't tried yet if this will work with just a string, which was going to be part of the unit tests I add when I mark this PR as ready for review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why this method instead of openlibrary.utils.extract_numeric_id_from_olid() for this? What does this method do that the existing function doesn’t do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably nothing, I've just never seen that function before

@@ -29,6 +31,25 @@
}
]

# The keys in this dict need to match their corresponding names in openlibrary/plugins/openlibrary/config/author/identifiers.yml
# Ordered by what I assume is most (viaf) to least (amazon/youtube) reliable for author matching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we plan to do anything with the ordering, it’d probably be better to just do something like alphabetic ordering so it’s easy to add new identifiers to this.

Also, an alternative approach could maybe be to add a wikidata item to identifiers.yml which could be read here? Otherwise this approach means that there are more places to edit when adding/editing identifiers (e.g., #9982 (pending) and #10052 (merged and live on prod, but the identifier is not included here)). This would also mean that we wouldn’t need to maintain and handle separate REMOTE_IDS lists for authors, editions, and works (e.g., musicbrainz and bookbrainz have different Wikidata properties depending on whether it’s an Author, Edition, or Work, which can’t be handled with this current structure).

Comment on lines +30 to +32
password = open(os.path.expanduser('~/.openlibrary_db_password')).read()
if password.endswith('\n'):
password = password[:-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
password = open(os.path.expanduser('~/.openlibrary_db_password')).read()
if password.endswith('\n'):
password = password[:-1]
with pwfile as open(os.path.expanduser('~/.openlibrary_db_password')):
password = pwfile.read().strip('\n')

password = open(os.path.expanduser('~/.openlibrary_db_password')).read()
if password.endswith('\n'):
password = password[:-1]
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bare except is trouble. What kind of exceptions do you expect to run into here? E.g., if a user hits Ctrl-C at this specific point, it probably shouldn’t be ignored.

Copy link
Contributor Author

@pidgezero-one pidgezero-one Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, I stole this part from another script for the sake of seeing if the thing would work

Comment on lines +142 to +144
res = self._get_statement_values("P648")
if len(res) > 0:
return res[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re using re.fullmatch(r"^OL\d+A$", ol_id) on the return value of this method both of the places it is used. Maybe it would make sense to add that check in here directly?

Also, if there are more than one OL IDs returned, it should probably return the lowest (get_key_numeric()). (Or the lowest of the highest available WD ranking (ie., preferred > normal > deprecated), but this is not possible currently. It might also be worth looking into making a separate PR to make _get_statement_values() not include deprecated values – or maybe even only “preferred” values (if available). But that’s out of the scope of this PR.)

Comment on lines +38 to +39
imdb: /^\w{2}\d+$/,
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of this PR?

things = find_author(author)
if author.get('entity_type', 'person') != 'person':
return things[0] if things else None
things = get_redirected_authors([web.ctx.site.get(k) for k in reply])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: break here

name = author["name"].replace("*", r"\*")
queries = [
{"type": "/type/author", "name~": name},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: remove this

requirements.txt Outdated
@@ -30,3 +32,4 @@ sentry-sdk==2.8.0
simplejson==3.19.1
statsd==4.0.1
validate_email==1.3
wikitextparser==0.56.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: put these back in the other PR

@@ -0,0 +1,54 @@
"""
Copies all author identifiers from the author's stored Wikidata info into their remote_ids.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for later: should this also scrape wikidata for authors that have an OL ID on their side but we don't have their wikidata json on our side? not sure if any of these actually exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure they exist, but I’d suggest leaving that out of this one, and consider it for a later PR. Better to keep commits/PRs as atomic as possible. :)

return authors

# Look for OL ID first.
if key := author.get("key"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a new feature to use author identifiers for author matching that is being mixed in with other work rather than being put forward as a clear stand alone pre-requisite.

#10029 feels like a planning step, without actually making the feature request because it is presented as a question. #9927 covers the first step, but it focuses on importing identifiers in the first place, which we mostly do.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

This would involve extending the import schema to support populating the Author.remote_ids values, which does seem to be missing at the moment, although the UI allows them to be edited after import.

I'm flagging this line in particular because I don't the the author OLID value should be called key in the import schema. At this point it is just another identifier sourced from somewhere other than Open Library. It should be something like identifiers: {openlibrary: OL1234A}

Being clear about which identifiers are suitable for this purpose up front would be good too. I agree that Amazon ids aren't very 'strong' for example. VIAF, ISNI, and Wikidata are the core ones I believe have decent coverage in OL and are likely to be useful right now.

Obviously, when a general mechanism is in place, the list can be extended as new usecases are brought forward.

There was quite a bit of reading between the lines to figure out what the core new feature is here. Changing the import schema significantly in code without a clear driving feature or design had me worried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the importing piece out into a separate PR. That code is here because I'd already done a proof of concept for it, but I was wary about trying to push it through when the remote_ids we'd be matching to aren't filled out as much as they could be. So backfilling that info for existing authors and proactively filling it out for future WD fetches going forward would ideally be a prerequisite to incorporating that import change.

I should probably hold off on further work on this at least until an agreement is reached about which identifiers should be used, @Freso mentioned in the issue thread that "library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc please see also my mid-October comments on the related PR #9674

The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.

@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

#9411
#9448

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for those issue links @Freso , individual PRs that address #9411 and #9448 would be much easier to review and understand. I don't think either are really a pre-requisite to the Wikisource import feature, but #9411 and #9448 are clear and self contained, and seem reasonable to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7724 from a year and a half ago is also relevant. Since MARC records have the highest volume of identifiers, I'd suggest starting with that.

@hornc
Copy link
Collaborator

hornc commented Nov 28, 2024

@pidgezero-one general comments based on the examples in the description:

  1. I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

  2. I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

@pidgezero-one
Copy link
Contributor Author

@hornc

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?

I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants