-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: consolidate remote ids and wikisource identifiers #10092
base: master
Are you sure you want to change the base?
Changes from all commits
dca89fb
147784a
df423ca
604a5d8
34fe390
a307eca
1fe4e84
d7f4065
bcdcdb1
c504f6c
a996e64
a3c299e
0f2b113
e18ba52
4bdf428
57388b4
31b2a7f
9492403
dacef92
1186696
dd82901
47b57c8
3c82121
5e58373
2c268b2
be0d1a8
d52c109
df683a1
8e7cb38
66744ef
5c51bcc
b5e4319
34da18d
76a1724
2ce0e17
3645e9e
0e9650b
9663789
8f0810d
2a3e617
0268018
99f3c93
5390b54
b06b60d
d7561f8
3e10c5d
4bd8e54
e445276
d633c62
4053014
e1bd982
62d1798
25d9243
bbe242f
15aa2b2
cae28f9
cb2a959
5cb9dc9
5370e41
b96a07d
4d4d091
da00d4e
d116d6e
8d83830
f7e61c0
3018a5f
878051c
9135ff5
113e6a7
42c05d0
e9fef40
5193267
2075f4e
dbac756
916c3ae
d1683a5
287bfe0
4ad37de
e8fb019
d8452d8
0d15c54
db5c4a9
3c46c9a
31d84a7
3dbe7b9
d6d303e
aa73b6b
a8fdcfb
916dabc
bb3c30c
e28c8cd
f6f6c99
beaf68c
e6fe169
408da50
e7f714b
8f01b5b
77b23d8
2aca912
7bfc39b
461b02a
147fab3
b73b8d1
92065ed
0750cb2
5842e13
83a00f8
bc31426
2e99560
cb14b14
8d4dac0
7d85a0f
f1b0edd
5b57217
0bd52e8
05f3644
12b96f4
6a8234c
b9c36e4
2fd4569
5a8ea7a
268f055
60e5d00
b6b68f3
e02ae78
d7dd818
ee00b9a
0ecba61
f952c4f
0e2c510
1265681
216cb50
eb8d3d4
d298d39
8c17a14
8917ae8
a174609
ad263aa
ab624f6
320be8b
f07e5fb
3bfe2d9
a0aaf40
d97921d
565853b
d84c5fd
f2fdde1
d9e921d
24fbffd
cac80aa
df6273b
ea1284d
cfd0377
2ac22d5
80cad49
b6988e2
ca98443
59ba078
4d20241
eb3e872
fbd4b49
1f0cd7e
6ab7e5e
9142c67
c62caef
ed15ca9
fd98ed8
1ba81d8
f8dcfb3
61aa478
d3439c1
7c141d5
de56c2a
7b9b590
df4d471
e00eb4c
b8fe653
81e8282
21bd563
5123a82
6d8864b
15cd9fa
c01a17e
a103564
cdcb18f
1f2c984
0ff56b1
27b991d
bd8f0c2
9592016
5c61c6b
4fd1491
357886d
2a4e7b9
e30d440
c161403
af0ebf9
9255be9
bb50c86
4669fc3
8c50a19
60fcc1b
38e6311
fc5f8a3
cbd0e96
79ef6e1
a5889df
67736a8
0d99753
77aeed1
ec9fe38
a4a906d
bb62b15
df24f8d
089dd71
3e18b6b
9995782
479ad66
21b83cb
8b3c798
f018c7c
b710116
27eb3cc
c1edc35
e0a7700
85da93d
eb9ddcd
fc9b86e
4cecf7b
9f4992a
2598c30
52e8b8c
a7126ea
f779dfb
eea0c09
1ca2012
82d99a8
933c625
ea9931f
21e59f1
f32536b
65fa8d2
6c5006b
9f2f6a9
009749e
db20d32
c2b7908
9aa8835
41cc424
ad3bf1c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
import web | ||
from openlibrary.catalog.utils import flip_name, author_dates_match, key_int | ||
from openlibrary.core.helpers import extract_year | ||
from openlibrary.utils import uniq | ||
|
||
if TYPE_CHECKING: | ||
from openlibrary.plugins.upstream.models import Author | ||
|
@@ -142,10 +143,59 @@ def walk_redirects(obj, seen): | |
seen.add(obj['key']) | ||
return obj | ||
|
||
# Try for an 'exact' (case-insensitive) name match, but fall back to alternate_names, | ||
# then last name with identical birth and death dates (that are not themselves `None` or ''). | ||
def get_redirected_authors(authors: list["Author"]): | ||
if any(a.type.key != '/type/author' for a in authors): | ||
seen: set[dict] = set() | ||
all_authors = [ | ||
walk_redirects(a, seen) for a in authors if a['key'] not in seen | ||
] | ||
return all_authors | ||
return authors | ||
|
||
# Look for OL ID first. | ||
if key := author.get("key"): | ||
if reply := list(web.ctx.site.things({"type": "/type/author", "key~": key})): | ||
# Always match on OL ID, even if remote identifiers don't match. | ||
return get_redirected_authors([web.ctx.site.get(k) for k in reply]) | ||
# Try other identifiers next. | ||
if identifiers := author.get("identifiers"): | ||
queries = [] | ||
matched_authors = [] | ||
# Get all the authors that match any incoming identifier. | ||
for identifier, val in identifiers.items(): | ||
queries.append({"type": "/type/author", f"remote_ids.{identifier}~": val}) | ||
for query in queries: | ||
if reply := list(web.ctx.site.things(query)): | ||
matched_authors.extend( | ||
get_redirected_authors([web.ctx.site.get(k) for k in reply]) | ||
) | ||
matched_authors = uniq(matched_authors) | ||
# The match is whichever one has the most identifiers in common AND does not have more conflicts than matched identifiers. | ||
highest_matches = 0 | ||
selected_match = None | ||
for a in matched_authors: | ||
try: | ||
_, matches = a.merge_remote_ids(identifiers) | ||
if matches > highest_matches: | ||
selected_match = a | ||
highest_matches = matches | ||
elif matches == highest_matches and matches > 0: | ||
# Prioritize the lower OL ID when matched identifiers are equal | ||
selected_match = ( | ||
a | ||
if a.get_key_numeric() < selected_match.get_key_numeric() | ||
else selected_match | ||
) | ||
except: | ||
# Reject if too many conflicts | ||
# TODO: raise a flag to librarians here? | ||
pass | ||
if highest_matches > 0 and selected_match is not None: | ||
return [selected_match] | ||
# Fall back to name/date matching, which we did before introducing identifiers. | ||
name = author["name"].replace("*", r"\*") | ||
queries = [ | ||
{"type": "/type/author", "name~": name}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note to self: remove this |
||
{"type": "/type/author", "name~": name}, | ||
{"type": "/type/author", "alternate_names~": name}, | ||
{ | ||
|
@@ -155,37 +205,17 @@ def walk_redirects(obj, seen): | |
"death_date~": f"*{extract_year(author.get('death_date', '')) or -1}*", | ||
}, # Use `-1` to ensure an empty string from extract_year doesn't match empty dates. | ||
] | ||
things = [] | ||
for query in queries: | ||
if reply := list(web.ctx.site.things(query)): | ||
break | ||
|
||
authors = [web.ctx.site.get(k) for k in reply] | ||
if any(a.type.key != '/type/author' for a in authors): | ||
seen: set[dict] = set() | ||
authors = [walk_redirects(a, seen) for a in authors if a['key'] not in seen] | ||
return authors | ||
|
||
|
||
def find_entity(author: dict[str, Any]) -> "Author | None": | ||
""" | ||
Looks for an existing Author record in OL | ||
and returns it if found. | ||
|
||
:param dict author: Author import dict {"name": "Some One"} | ||
:return: Existing Author record if found, or None. | ||
""" | ||
assert isinstance(author, dict) | ||
things = find_author(author) | ||
if author.get('entity_type', 'person') != 'person': | ||
return things[0] if things else None | ||
things = get_redirected_authors([web.ctx.site.get(k) for k in reply]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note to self: break here |
||
match = [] | ||
seen = set() | ||
for a in things: | ||
key = a['key'] | ||
if key in seen: | ||
continue | ||
seen.add(key) | ||
orig_key = key | ||
assert a.type.key == '/type/author' | ||
if 'birth_date' in author and 'birth_date' not in a: | ||
continue | ||
|
@@ -195,10 +225,27 @@ def find_entity(author: dict[str, Any]) -> "Author | None": | |
continue | ||
match.append(a) | ||
if not match: | ||
return None | ||
return [] | ||
if len(match) == 1: | ||
return match[0] | ||
return pick_from_matches(author, match) | ||
return [match[0]] | ||
return [pick_from_matches(author, match)] | ||
|
||
|
||
def find_entity(author: dict[str, Any]) -> "Author | None": | ||
""" | ||
Looks for an existing Author record in OL | ||
and returns it if found. | ||
|
||
:param dict author: Author import dict {"name": "Some One"} | ||
:return: Existing Author record if found, or None. | ||
""" | ||
assert isinstance(author, dict) | ||
things = find_author(author) | ||
if "identifiers" in author: | ||
for index, t in enumerate(things): | ||
t.remote_ids, _ = t.merge_remote_ids(author["identifiers"]) | ||
things[index] = t | ||
return things[0] if things else None | ||
|
||
|
||
def remove_author_honorifics(name: str) -> str: | ||
|
@@ -246,9 +293,20 @@ def import_author(author: dict[str, Any], eastern=False) -> "Author | dict[str, | |
new['death_date'] = author['death_date'] | ||
return new | ||
a = {'type': {'key': '/type/author'}} | ||
for f in 'name', 'title', 'personal_name', 'birth_date', 'death_date', 'date': | ||
for f in ( | ||
'name', | ||
'title', | ||
'personal_name', | ||
'birth_date', | ||
'death_date', | ||
'date', | ||
'remote_ids', | ||
): | ||
if f in author: | ||
a[f] = author[f] | ||
# Import record hitting endpoing should list external IDs under "identifiers", but needs to be "remote_ids" when going into the DB. | ||
if "identifiers" in author: | ||
a["remote_ids"] = author["identifiers"] | ||
return a | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -136,9 +136,77 @@ def test_author_wildcard_match_with_no_matches_creates_author_with_wildcard( | |||||||||||||
new_author_name = import_author(author) | ||||||||||||||
assert author["name"] == new_author_name["name"] | ||||||||||||||
|
||||||||||||||
def test_first_match_priority_name_and_dates(self, mock_site): | ||||||||||||||
def test_first_match_ol_key(self, mock_site): | ||||||||||||||
""" | ||||||||||||||
Highest priority match is name, birth date, and death date. | ||||||||||||||
Highest priority match is OL key. | ||||||||||||||
""" | ||||||||||||||
self.add_three_existing_authors(mock_site) | ||||||||||||||
|
||||||||||||||
# Author with VIAF | ||||||||||||||
author = { | ||||||||||||||
"name": "William H. Brewer", | ||||||||||||||
"key": "/authors/OL3A", | ||||||||||||||
"type": {"key": "/type/author"}, | ||||||||||||||
"remote_ids": {"viaf": "12345678"}, | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
# Another author with VIAF | ||||||||||||||
author_different_key = { | ||||||||||||||
"name": "William Brewer", | ||||||||||||||
"key": "/authors/OL4A", | ||||||||||||||
"type": {"key": "/type/author"}, | ||||||||||||||
"remote_ids": {"viaf": "87654321"}, | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
mock_site.save(author) | ||||||||||||||
mock_site.save(author_different_key) | ||||||||||||||
|
||||||||||||||
# Look for exact match on OL ID, regardless of other fields. | ||||||||||||||
# We ideally shouldn't ever have a case where different authors have the same VIAF, but this demonstrates priority. | ||||||||||||||
searched_author = { | ||||||||||||||
"name": "William H. Brewer", | ||||||||||||||
"key": "/authors/OL4A", | ||||||||||||||
"identifiers": {"viaf": "12345678"}, | ||||||||||||||
} | ||||||||||||||
found = import_author(searched_author) | ||||||||||||||
assert found.key == author_different_key["key"] | ||||||||||||||
|
||||||||||||||
def test_second_match_strong_identifier(self, mock_site): | ||||||||||||||
""" | ||||||||||||||
Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc. | ||||||||||||||
Comment on lines
+174
to
+176
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tom Morris says that Goodreads and Amazon IDs aren’t “strong” identifiers 🙃 Maybe just use “identifier”? Or “remote identifier”? (Paralleling the
Suggested change
|
||||||||||||||
""" | ||||||||||||||
self.add_three_existing_authors(mock_site) | ||||||||||||||
|
||||||||||||||
# Author with VIAF | ||||||||||||||
author = { | ||||||||||||||
"name": "William H. Brewer", | ||||||||||||||
"key": "/authors/OL3A", | ||||||||||||||
"type": {"key": "/type/author"}, | ||||||||||||||
"remote_ids": {"viaf": "12345678"}, | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
# Another author with VIAF | ||||||||||||||
author_different_viaf = { | ||||||||||||||
"name": "William Brewer", | ||||||||||||||
"key": "/authors/OL4A", | ||||||||||||||
"type": {"key": "/type/author"}, | ||||||||||||||
"remote_ids": {"viaf": "87654321"}, | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
mock_site.save(author) | ||||||||||||||
mock_site.save(author_different_viaf) | ||||||||||||||
|
||||||||||||||
# Look for exact match on VIAF, regardless of name field. | ||||||||||||||
searched_author = { | ||||||||||||||
"name": "William Brewer", | ||||||||||||||
"identifiers": {"viaf": "12345678"}, | ||||||||||||||
} | ||||||||||||||
found = import_author(searched_author) | ||||||||||||||
assert found.key == author["key"] | ||||||||||||||
|
||||||||||||||
def test_third_match_priority_name_and_dates(self, mock_site): | ||||||||||||||
""" | ||||||||||||||
Next highest priority match is name, birth date, and death date. | ||||||||||||||
""" | ||||||||||||||
self.add_three_existing_authors(mock_site) | ||||||||||||||
|
||||||||||||||
|
@@ -201,7 +269,7 @@ def test_non_matching_birth_death_creates_new_author(self, mock_site): | |||||||||||||
assert isinstance(found, dict) | ||||||||||||||
assert found["death_date"] == searched_and_not_found_author["death_date"] | ||||||||||||||
|
||||||||||||||
def test_second_match_priority_alternate_names_and_dates(self, mock_site): | ||||||||||||||
def test_match_priority_alternate_names_and_dates(self, mock_site): | ||||||||||||||
""" | ||||||||||||||
Matching, as a unit, alternate name, birth date, and death date, get | ||||||||||||||
second match priority. | ||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,6 +35,8 @@ const identifierPatterns = { | |
lc_naf: /^n[bors]?[0-9]+$/, | ||
amazon: /^B[0-9A-Za-z]{9}$/, | ||
youtube: /^@[A-Za-z0-9_\-.]{3,30}/, | ||
imdb: /^\w{2}\d+$/, | ||
opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/, | ||
Comment on lines
+38
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this part of this PR? |
||
} | ||
export default { | ||
// Props are for external options; if a subelement of this is modified, | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -8,6 +8,7 @@ | |||||
import web | ||||||
import json | ||||||
import requests | ||||||
import re | ||||||
from typing import Any, TypedDict | ||||||
from collections import defaultdict | ||||||
from dataclasses import dataclass, field | ||||||
|
@@ -29,7 +30,7 @@ | |||||
from openlibrary.core.ratings import Ratings | ||||||
from openlibrary.utils import extract_numeric_id_from_olid | ||||||
from openlibrary.utils.isbn import to_isbn_13, isbn_13_to_isbn_10, canonical | ||||||
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity | ||||||
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity, REMOTE_IDS | ||||||
|
||||||
from . import cache, waitinglist | ||||||
|
||||||
|
@@ -217,6 +218,10 @@ def _get_d(self): | |||||
"l": self._get_lists_cached(), | ||||||
} | ||||||
|
||||||
def get_key_numeric(self): | ||||||
"""Returns just the numeric part of the key.""" | ||||||
return int(re.search(r'\d+', self.key)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that we know our own identifiers, we also know that they always start with
Suggested change
But also, what does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's used here to choose the lower of two OL IDs when identifiers are otherwise the same: https://github.com/internetarchive/openlibrary/pull/10092/files#diff-8a66754753640315d80bf708c3483a13439e24736f081161b22e2d59cee76314R183 I haven't tried yet if this will work with just a string, which was going to be part of the unit tests I add when I mark this PR as ready for review There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But why this method instead of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably nothing, I've just never seen that function before |
||||||
|
||||||
|
||||||
class ThingReferenceDict(TypedDict): | ||||||
key: ThingKey | ||||||
|
@@ -806,6 +811,25 @@ def get_edition_count(self): | |||||
def get_lists(self, limit=50, offset=0, sort=True): | ||||||
return self._get_lists(limit=limit, offset=offset, sort=sort) | ||||||
|
||||||
def merge_remote_ids( | ||||||
self, incoming_ids: dict[str, str] | ||||||
) -> tuple[dict[str, str], int]: | ||||||
output = {**self.remote_ids} | ||||||
if len(incoming_ids.items()) == 0: | ||||||
return output, 0 | ||||||
matches = 0 | ||||||
conflicts = 0 | ||||||
for identifier in REMOTE_IDS: | ||||||
if identifier in output and identifier in incoming_ids: | ||||||
if output[identifier] != incoming_ids[identifier]: | ||||||
conflicts = conflicts + 1 | ||||||
else: | ||||||
output[identifier] = incoming_ids[identifier] | ||||||
matches = matches + 1 | ||||||
if conflicts > matches: | ||||||
raise Exception("wikidata json conflicts with existing remote ids") | ||||||
return output, matches | ||||||
|
||||||
|
||||||
class User(Thing): | ||||||
DEFAULT_PREFERENCES = { | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a new feature to use author identifiers for author matching that is being mixed in with other work rather than being put forward as a clear stand alone pre-requisite.
#10029 feels like a planning step, without actually making the feature request because it is presented as a question. #9927 covers the first step, but it focuses on importing identifiers in the first place, which we mostly do.
The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"
This would involve extending the import schema to support populating the
Author.remote_ids
values, which does seem to be missing at the moment, although the UI allows them to be edited after import.I'm flagging this line in particular because I don't the the author OLID value should be called
key
in the import schema. At this point it is just another identifier sourced from somewhere other than Open Library. It should be something likeidentifiers: {openlibrary: OL1234A}
Being clear about which identifiers are suitable for this purpose up front would be good too. I agree that Amazon ids aren't very 'strong' for example. VIAF, ISNI, and Wikidata are the core ones I believe have decent coverage in OL and are likely to be useful right now.
Obviously, when a general mechanism is in place, the list can be extended as new usecases are brought forward.
There was quite a bit of reading between the lines to figure out what the core new feature is here. Changing the import schema significantly in code without a clear driving feature or design had me worried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the importing piece out into a separate PR. That code is here because I'd already done a proof of concept for it, but I was wary about trying to push it through when the remote_ids we'd be matching to aren't filled out as much as they could be. So backfilling that info for existing authors and proactively filling it out for future WD fetches going forward would ideally be a prerequisite to incorporating that import change.
I should probably hold off on further work on this at least until an agreement is reached about which identifiers should be used, @Freso mentioned in the issue thread that "library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hornc please see also my mid-October comments on the related PR #9674
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#9411
#9448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for those issue links @Freso , individual PRs that address #9411 and #9448 would be much easier to review and understand. I don't think either are really a pre-requisite to the Wikisource import feature, but #9411 and #9448 are clear and self contained, and seem reasonable to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7724 from a year and a half ago is also relevant. Since MARC records have the highest volume of identifiers, I'd suggest starting with that.