-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: consolidate remote ids and wikisource identifiers #10092
base: master
Are you sure you want to change the base?
feat: consolidate remote ids and wikisource identifiers #10092
Conversation
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10092 +/- ##
=======================================
Coverage 17.12% 17.12%
=======================================
Files 89 89
Lines 4752 4752
Branches 831 831
=======================================
Hits 814 814
Misses 3428 3428
Partials 510 510 ☔ View full report in Codecov by Sentry. |
This reverts commit c2b7908.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems (mostly) good to me, but I did skim over some parts. Maybe I’ll look again later. 🙈
def test_second_match_strong_identifier(self, mock_site): | ||
""" | ||
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tom Morris says that Goodreads and Amazon IDs aren’t “strong” identifiers 🙃 Maybe just use “identifier”? Or “remote identifier”? (Paralleling the remote_ids
property.)
def test_second_match_strong_identifier(self, mock_site): | |
""" | |
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc. | |
def test_second_match_remote_identifier(self, mock_site): | |
""" | |
Next highest priority match is any other remote identifier, such as VIAF, Goodreads ID, Amazon ID, etc. |
@@ -217,6 +218,10 @@ def _get_d(self): | |||
"l": self._get_lists_cached(), | |||
} | |||
|
|||
def get_key_numeric(self): | |||
"""Returns just the numeric part of the key.""" | |||
return int(re.search(r'\d+', self.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we know our own identifiers, we also know that they always start with OL
and end with a single letter (currently A
, W
, or E
). This means that, instead of doing an expensive regular expression, we could probably just some .strip()
’ing – or even just slicing the string:
return int(re.search(r'\d+', self.key)) | |
return int(self.key[2:-1]) |
But also, what does extract_numeric_id_from_olid
from openlibrary.utils
do? Just based on the name, it seems eerily similar to this method, but I didn’t look it up. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used here to choose the lower of two OL IDs when identifiers are otherwise the same: https://github.com/internetarchive/openlibrary/pull/10092/files#diff-8a66754753640315d80bf708c3483a13439e24736f081161b22e2d59cee76314R183
I haven't tried yet if this will work with just a string, which was going to be part of the unit tests I add when I mark this PR as ready for review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why this method instead of openlibrary.utils.extract_numeric_id_from_olid()
for this? What does this method do that the existing function doesn’t do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably nothing, I've just never seen that function before
@@ -29,6 +31,25 @@ | |||
} | |||
] | |||
|
|||
# The keys in this dict need to match their corresponding names in openlibrary/plugins/openlibrary/config/author/identifiers.yml | |||
# Ordered by what I assume is most (viaf) to least (amazon/youtube) reliable for author matching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we plan to do anything with the ordering, it’d probably be better to just do something like alphabetic ordering so it’s easy to add new identifiers to this.
Also, an alternative approach could maybe be to add a wikidata
item to identifiers.yml
which could be read here? Otherwise this approach means that there are more places to edit when adding/editing identifiers (e.g., #9982 (pending) and #10052 (merged and live on prod, but the identifier is not included here)). This would also mean that we wouldn’t need to maintain and handle separate REMOTE_IDS
lists for authors, editions, and works (e.g., musicbrainz
and bookbrainz
have different Wikidata properties depending on whether it’s an Author, Edition, or Work, which can’t be handled with this current structure).
password = open(os.path.expanduser('~/.openlibrary_db_password')).read() | ||
if password.endswith('\n'): | ||
password = password[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
password = open(os.path.expanduser('~/.openlibrary_db_password')).read() | |
if password.endswith('\n'): | |
password = password[:-1] | |
with pwfile as open(os.path.expanduser('~/.openlibrary_db_password')): | |
password = pwfile.read().strip('\n') |
password = open(os.path.expanduser('~/.openlibrary_db_password')).read() | ||
if password.endswith('\n'): | ||
password = password[:-1] | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bare except
is trouble. What kind of exceptions do you expect to run into here? E.g., if a user hits Ctrl-C at this specific point, it probably shouldn’t be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea, I stole this part from another script for the sake of seeing if the thing would work
res = self._get_statement_values("P648") | ||
if len(res) > 0: | ||
return res[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re using re.fullmatch(r"^OL\d+A$", ol_id)
on the return value of this method both of the places it is used. Maybe it would make sense to add that check in here directly?
Also, if there are more than one OL IDs returned, it should probably return the lowest (get_key_numeric()
). (Or the lowest of the highest available WD ranking (ie., preferred > normal > deprecated), but this is not possible currently. It might also be worth looking into making a separate PR to make _get_statement_values()
not include deprecated values – or maybe even only “preferred” values (if available). But that’s out of the scope of this PR.)
imdb: /^\w{2}\d+$/, | ||
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this part of this PR?
things = find_author(author) | ||
if author.get('entity_type', 'person') != 'person': | ||
return things[0] if things else None | ||
things = get_redirected_authors([web.ctx.site.get(k) for k in reply]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: break here
name = author["name"].replace("*", r"\*") | ||
queries = [ | ||
{"type": "/type/author", "name~": name}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: remove this
requirements.txt
Outdated
@@ -30,3 +32,4 @@ sentry-sdk==2.8.0 | |||
simplejson==3.19.1 | |||
statsd==4.0.1 | |||
validate_email==1.3 | |||
wikitextparser==0.56.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: put these back in the other PR
@@ -0,0 +1,54 @@ | |||
""" | |||
Copies all author identifiers from the author's stored Wikidata info into their remote_ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question for later: should this also scrape wikidata for authors that have an OL ID on their side but we don't have their wikidata json on our side? not sure if any of these actually exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure they exist, but I’d suggest leaving that out of this one, and consider it for a later PR. Better to keep commits/PRs as atomic as possible. :)
return authors | ||
|
||
# Look for OL ID first. | ||
if key := author.get("key"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a new feature to use author identifiers for author matching that is being mixed in with other work rather than being put forward as a clear stand alone pre-requisite.
#10029 feels like a planning step, without actually making the feature request because it is presented as a question. #9927 covers the first step, but it focuses on importing identifiers in the first place, which we mostly do.
The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"
This would involve extending the import schema to support populating the Author.remote_ids
values, which does seem to be missing at the moment, although the UI allows them to be edited after import.
I'm flagging this line in particular because I don't the the author OLID value should be called key
in the import schema. At this point it is just another identifier sourced from somewhere other than Open Library. It should be something like identifiers: {openlibrary: OL1234A}
Being clear about which identifiers are suitable for this purpose up front would be good too. I agree that Amazon ids aren't very 'strong' for example. VIAF, ISNI, and Wikidata are the core ones I believe have decent coverage in OL and are likely to be useful right now.
Obviously, when a general mechanism is in place, the list can be extended as new usecases are brought forward.
There was quite a bit of reading between the lines to figure out what the core new feature is here. Changing the import schema significantly in code without a clear driving feature or design had me worried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the importing piece out into a separate PR. That code is here because I'd already done a proof of concept for it, but I was wary about trying to push it through when the remote_ids we'd be matching to aren't filled out as much as they could be. So backfilling that info for existing authors and proactively filling it out for future WD fetches going forward would ideally be a prerequisite to incorporating that import change.
I should probably hold off on further work on this at least until an agreement is reached about which identifiers should be used, @Freso mentioned in the issue thread that "library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc please see also my mid-October comments on the related PR #9674
The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.
@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7724 from a year and a half ago is also relevant. Since MARC records have the highest volume of identifiers, I'd suggest starting with that.
@pidgezero-one general comments based on the examples in the description:
|
If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?
I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead? |
this should be squash merged to avoid conflicts with #9674, which split off from this PR
===This is a WIP===
Closes #10029
Closes #9927
Technical
TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)
The import model is expanded by adding some additional logic to write to the author's
remote_ids
when detected in the incoming json object, and search Infogami against those remote ids to detect if the author already exists. The incomingauthor
dict can store an optionalremote_ids
field to contain these (i.e. viaf stored inauthor["remote_ids"]["viaf"]
), except for OL ID, which is not a remote identifier, so it is expected atauthor["key"]
.Issues:
Testing
tested using the output from #9674
To test the import, I wasn't sure how to hit /api/import with user credentials, so I disabled the
if not can_write():
condition in openlibrary/plugins/importapi/code.py as well as theif not account_key:
condition in openlibrary/catalog/add_book/init.py, and copy-pasted the printed JSON records into a Postman request body.I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.
Example:
This responds successfully with:
Viewing this author key at
http://localhost:8080/authors/OL15A
shows that the strong identifiers were imported correctly:Editing the author verifies this as well:
I then created a test book record whose author uses the same VIAF but has a missing name:
The response shows that the author was successfully matched to an existing one by VIAF:
This also works for OL IDs, which uses a slightly different fetch query than the other strong identifiers do:
Importing optional cover images also works:
I added support for all strong identifiers found in identifiers.yml, except for Inventaire, which I couldn't find in Wikidata:
Screenshot
Stakeholders