feat: consolidate remote ids and wikisource identifiers #10092

pidgezero-one · 2024-11-27T04:12:35Z

this should be squash merged to avoid conflicts with #9674, which split off from this PR

===This is a WIP===

Technical

fetching live wikidata will also save wikidata's author identifiers to the author's remote ids (at least any that don't conflict with existing ones that already exist for that author, and only if there aren't TOO many conflicts)
backfill this operation for all existing authors with backfill_author_identifiers.py
import pipeline matches by OL ID first, then remote_ids (which should have been filled out with wikidata identifiers via backfill) (and as long as there aren't too many conflicts), then fall back to existing logic of matching by name
import records can use an "identifiers" field in the author dict, which is a dict in the form of { "viaf": "blahblahblah" ...etc }. related schema update PR: Update author/import models openlibrary-client#419

TODO: unit tests
TODO: how do we flag conflicts to librarians?
TODO: maybe move the consolidation logic to the author object instead of the wikidata object
TODO: fix pre-commit linting problems (2nd TODO will take care of most of this)

The import model is expanded by adding some additional logic to write to the author's remote_ids when detected in the incoming json object, and search Infogami against those remote ids to detect if the author already exists. The incoming author dict can store an optional remote_ids field to contain these (i.e. viaf stored in author["remote_ids"]["viaf"]), except for OL ID, which is not a remote identifier, so it is expected at author["key"].

Issues:

Importing books returns a 200 success, but the author's page still says 0 works

Testing

tested using the output from #9674

To test the import, I wasn't sure how to hit /api/import with user credentials, so I disabled the if not can_write(): condition in openlibrary/plugins/importapi/code.py as well as the if not account_key: condition in openlibrary/catalog/add_book/init.py, and copy-pasted the printed JSON records into a Postman request body.

I copied a wikidata JSON (which included an OL ID) into the wikidata postgres table, and then added the author with ./copydocs.py. I then ran backfill_author_identifiers.py, and identifiers that existed in the Wikidata json but not the author's remote_ids began to show up on the author's page.

Example:

{
    "title": "Equitation",
    "source_records": [
        "wikisource:en:Equitation"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation",
    "publish_date": "1922",
    "authors": [
        {
            "name": "Henry L. de Bussigny",
            "remote_ids": {
                "wikidata": "Q16862522",
                "viaf": "305913238",
                "isni": "0000000424758764",
                "project_gutenberg": "40106"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

This responds successfully with:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "created"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL18M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL9W",
        "status": "created"
    }
}

Viewing this author key at http://localhost:8080/authors/OL15A shows that the strong identifiers were imported correctly:

Editing the author verifies this as well:

I then created a test book record whose author uses the same VIAF but has a missing name:

{
    "title": "Equitation test",
    "source_records": [
        "wikisource:en:Equitation_test"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test",
    "publish_date": "1922",
    "authors": [
        {
            "name": "viaf test",
            "remote_ids": {
                "viaf": "305913238"
            }
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

The response shows that the author was successfully matched to an existing one by VIAF:

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL19M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL10W",
        "status": "created"
    }
}

This also works for OL IDs, which uses a slightly different fetch query than the other strong identifiers do:

{
    "title": "Equitation test 2",
    "source_records": [
        "wikisource:en:Equitation_test_2"
    ],
    "identifiers": {
        "wikisource": [
            "en:Equitation_test_2"
        ]
    },
    "languages": [
        "eng"
    ],
    "ia_id": "wikisource:en:Equitation_test_2",
    "publish_date": "1922",
    "authors": [
        {
            "name": "author ol_id test",
            "key": "/authors/OL15A"
        }
    ],
    "publishers": [
        "Wikisource"
    ]
}

{
    "authors": [
        {
            "key": "/authors/OL15A",
            "name": "Henry L. de Bussigny",
            "status": "matched"
        }
    ],
    "success": true,
    "edition": {
        "key": "/books/OL23M",
        "status": "created"
    },
    "work": {
        "key": "/works/OL14W",
        "status": "created"
    }
}

Importing optional cover images also works:

I added support for all strong identifiers found in identifiers.yml, except for Inventaire, which I couldn't find in Wikidata:

Screenshot

Stakeholders

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

for more information, see https://pre-commit.ci

codecov-commenter · 2024-11-27T04:15:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.12%. Comparing base (347bff9) to head (41cc424).
Report is 91 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #10092   +/-   ##
=======================================
  Coverage   17.12%   17.12%           
=======================================
  Files          89       89           
  Lines        4752     4752           
  Branches      831      831           
=======================================
  Hits          814      814           
  Misses       3428     3428           
  Partials      510      510

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This reverts commit c2b7908.

Freso

Seems (mostly) good to me, but I did skim over some parts. Maybe I’ll look again later. 🙈

Freso · 2024-11-27T08:06:57Z

openlibrary/catalog/add_book/tests/test_load_book.py

+    def test_second_match_strong_identifier(self, mock_site):
+        """
+        Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.


Tom Morris says that Goodreads and Amazon IDs aren’t “strong” identifiers 🙃 Maybe just use “identifier”? Or “remote identifier”? (Paralleling the remote_ids property.)

Suggested change

def test_second_match_strong_identifier(self, mock_site):

"""

Next highest priority match is any other strong identifier, such as VIAF, Goodreads ID, Amazon ID, etc.

def test_second_match_remote_identifier(self, mock_site):

"""

Next highest priority match is any other remote identifier, such as VIAF, Goodreads ID, Amazon ID, etc.

Freso · 2024-11-27T08:14:21Z

openlibrary/core/models.py

@@ -217,6 +218,10 @@ def _get_d(self):
            "l": self._get_lists_cached(),
        }

+    def get_key_numeric(self):
+        """Returns just the numeric part of the key."""
+        return int(re.search(r'\d+', self.key))


Given that we know our own identifiers, we also know that they always start with OL and end with a single letter (currently A, W, or E). This means that, instead of doing an expensive regular expression, we could probably just some .strip()’ing – or even just slicing the string:

Suggested change

return int(re.search(r'\d+', self.key))

return int(self.key[2:-1])

But also, what does extract_numeric_id_from_olid from openlibrary.utils do? Just based on the name, it seems eerily similar to this method, but I didn’t look it up. :)

It's used here to choose the lower of two OL IDs when identifiers are otherwise the same: https://github.com/internetarchive/openlibrary/pull/10092/files#diff-8a66754753640315d80bf708c3483a13439e24736f081161b22e2d59cee76314R183

I haven't tried yet if this will work with just a string, which was going to be part of the unit tests I add when I mark this PR as ready for review

But why this method instead of openlibrary.utils.extract_numeric_id_from_olid() for this? What does this method do that the existing function doesn’t do?

probably nothing, I've just never seen that function before

Freso · 2024-11-27T08:25:42Z

openlibrary/core/wikidata.py

@@ -29,6 +31,25 @@
    }
 ]

+# The keys in this dict need to match their corresponding names in openlibrary/plugins/openlibrary/config/author/identifiers.yml
+# Ordered by what I assume is most (viaf) to least (amazon/youtube) reliable for author matching


Unless we plan to do anything with the ordering, it’d probably be better to just do something like alphabetic ordering so it’s easy to add new identifiers to this.

Also, an alternative approach could maybe be to add a wikidata item to identifiers.yml which could be read here? Otherwise this approach means that there are more places to edit when adding/editing identifiers (e.g., #9982 (pending) and #10052 (merged and live on prod, but the identifier is not included here)). This would also mean that we wouldn’t need to maintain and handle separate REMOTE_IDS lists for authors, editions, and works (e.g., musicbrainz and bookbrainz have different Wikidata properties depending on whether it’s an Author, Edition, or Work, which can’t be handled with this current structure).

Freso · 2024-11-27T08:51:50Z

scripts/backfill_author_identifiers.py

+        password = open(os.path.expanduser('~/.openlibrary_db_password')).read()
+        if password.endswith('\n'):
+            password = password[:-1]


Suggested change

password = open(os.path.expanduser('~/.openlibrary_db_password')).read()

if password.endswith('\n'):

password = password[:-1]

with pwfile as open(os.path.expanduser('~/.openlibrary_db_password')):

password = pwfile.read().strip('\n')

Freso · 2024-11-27T08:54:10Z

scripts/backfill_author_identifiers.py

+        password = open(os.path.expanduser('~/.openlibrary_db_password')).read()
+        if password.endswith('\n'):
+            password = password[:-1]
+    except:


A bare except is trouble. What kind of exceptions do you expect to run into here? E.g., if a user hits Ctrl-C at this specific point, it probably shouldn’t be ignored.

no idea, I stole this part from another script for the sake of seeing if the thing would work

Freso · 2024-11-27T09:03:11Z

openlibrary/core/wikidata.py

+        res = self._get_statement_values("P648")
+        if len(res) > 0:
+            return res[0]


You’re using re.fullmatch(r"^OL\d+A$", ol_id) on the return value of this method both of the places it is used. Maybe it would make sense to add that check in here directly?

Also, if there are more than one OL IDs returned, it should probably return the lowest (get_key_numeric()). (Or the lowest of the highest available WD ranking (ie., preferred > normal > deprecated), but this is not possible currently. It might also be worth looking into making a separate PR to make _get_statement_values() not include deprecated values – or maybe even only “preferred” values (if available). But that’s out of the scope of this PR.)

Freso · 2024-11-27T09:23:23Z

openlibrary/components/AuthorIdentifiers.vue

+    imdb: /^\w{2}\d+$/,
+    opac_sbn: /^\D{2}[A-Z0-3]V\d{6}$/,


Why is this part of this PR?

pidgezero-one · 2024-11-27T12:37:22Z

openlibrary/catalog/add_book/load_book.py

-    things = find_author(author)
-    if author.get('entity_type', 'person') != 'person':
-        return things[0] if things else None
+            things = get_redirected_authors([web.ctx.site.get(k) for k in reply])


note to self: break here

pidgezero-one · 2024-11-27T12:37:29Z

openlibrary/catalog/add_book/load_book.py

    name = author["name"].replace("*", r"\*")
    queries = [
+        {"type": "/type/author", "name~": name},


note to self: remove this

pidgezero-one · 2024-11-27T12:44:11Z

requirements.txt

@@ -30,3 +32,4 @@ sentry-sdk==2.8.0
 simplejson==3.19.1
 statsd==4.0.1
 validate_email==1.3
+wikitextparser==0.56.1


note to self: put these back in the other PR

pidgezero-one · 2024-11-27T14:05:15Z

scripts/backfill_author_identifiers.py

@@ -0,0 +1,54 @@
+"""
+Copies all author identifiers from the author's stored Wikidata info into their remote_ids.


question for later: should this also scrape wikidata for authors that have an OL ID on their side but we don't have their wikidata json on our side? not sure if any of these actually exist

Pretty sure they exist, but I’d suggest leaving that out of this one, and consider it for a later PR. Better to keep commits/PRs as atomic as possible. :)

hornc · 2024-11-27T21:02:12Z

openlibrary/catalog/add_book/load_book.py

+        return authors
+
+    # Look for OL ID first.
+    if key := author.get("key"):


This seems like a new feature to use author identifiers for author matching that is being mixed in with other work rather than being put forward as a clear stand alone pre-requisite.

#10029 feels like a planning step, without actually making the feature request because it is presented as a question. #9927 covers the first step, but it focuses on importing identifiers in the first place, which we mostly do.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

This would involve extending the import schema to support populating the Author.remote_ids values, which does seem to be missing at the moment, although the UI allows them to be edited after import.

I'm flagging this line in particular because I don't the the author OLID value should be called key in the import schema. At this point it is just another identifier sourced from somewhere other than Open Library. It should be something like identifiers: {openlibrary: OL1234A}

Being clear about which identifiers are suitable for this purpose up front would be good too. I agree that Amazon ids aren't very 'strong' for example. VIAF, ISNI, and Wikidata are the core ones I believe have decent coverage in OL and are likely to be useful right now.

Obviously, when a general mechanism is in place, the list can be extended as new usecases are brought forward.

There was quite a bit of reading between the lines to figure out what the core new feature is here. Changing the import schema significantly in code without a clear driving feature or design had me worried.

I can move the importing piece out into a separate PR. That code is here because I'd already done a proof of concept for it, but I was wary about trying to push it through when the remote_ids we'd be matching to aren't filled out as much as they could be. So backfilling that info for existing authors and proactively filling it out for future WD fetches going forward would ideally be a prerequisite to incorporating that import change.

I should probably hold off on further work on this at least until an agreement is reached about which identifiers should be used, @Freso mentioned in the issue thread that "library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps."

@hornc please see also my mid-October comments on the related PR #9674

The whole issue of author matching and strong identifier usage is much too important to be hidden in a PR about WikiSource importing.

@hornc should be involved and the main use case of MARC import of records containing VIAF, LCCN, etc identifiers should, in my opinion, be implemented and debugged first before addressing obscure use cases like WikiSource.

The desired feature appears to be "Use author identifiers on import to determine existing author record matches, (not just Name and date, which is the current method)"

#9411
#9448

Thank you for those issue links @Freso , individual PRs that address #9411 and #9448 would be much easier to review and understand. I don't think either are really a pre-requisite to the Wikisource import feature, but #9411 and #9448 are clear and self contained, and seem reasonable to do.

#7724 from a year and a half ago is also relevant. Since MARC records have the highest volume of identifiers, I'd suggest starting with that.

hornc · 2024-11-28T19:50:47Z

@pidgezero-one general comments based on the examples in the description:

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.
I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

pidgezero-one · 2024-11-28T19:58:06Z

@hornc

I don't recognised the ia_id field of value format, it's not a field in either the import or edition schema. archive.org identifier should be included, it is stored in ocaid in the edition schema. It's not actually listed in the import schema, and I think it is because archive.org items are generally imported directly from the item, so ocaid is only ever populated that way.

If I'm understanding correctly, this means that I just don't need to include that in the import record if it's coming from somewhere that isn't IA?

I don't think Wikisource should be listed as the publisher (I think the code is adding this by default), especially when the rest of the metadata is pointing to e,g, a book printed in 1922.

I added this on a recommendation for books that have no publisher info returned from WD or WS, since publisher is a required field. Is there a better default that could be used instead?

pidgezero-one and others added 30 commits August 1, 2024 17:45

use a class for imports

dca89fb

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

147784a

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

[pre-commit.ci] auto fixes from pre-commit.com hooks

df423ca

for more information, see https://pre-commit.ci

mypy fixes

604a5d8

merge

34fe390

more linting

a307eca

[pre-commit.ci] auto fixes from pre-commit.com hooks

1fe4e84

for more information, see https://pre-commit.ci

is this deprecated too?

d7f4065

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

bcdcdb1

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

is this deprecated too?

c504f6c

is this deprecated too?

a996e64

improved data model

a3c299e

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f2b113

for more information, see https://pre-commit.ci

reformat name formatter

e18ba52

ruff fix

4bdf428

improve infobox fetching

57388b4

[pre-commit.ci] auto fixes from pre-commit.com hooks

31b2a7f

for more information, see https://pre-commit.ci

uncomment

9492403

merge

dacef92

remove unnecessary print

1186696

[pre-commit.ci] auto fixes from pre-commit.com hooks

dd82901

for more information, see https://pre-commit.ci

uncomment imports

47b57c8

Merge branch '9671/feat/add-wikisource-import-script' of https://gith…

3c82121

…ub.com/pidgezero-one/openlibrary into 9671/feat/add-wikisource-import-script

better template check:

5e58373

publishers?

2c268b2

fix array

be0d1a8

unused import

d52c109

different wiki markup strip

df683a1

reduce image calls

8e7cb38

unstash

66744ef

pidgezero-one and others added 13 commits November 26, 2024 22:51

clean up some re-request loops

52e8b8c

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7126ea

for more information, see https://pre-commit.ci

addresses most PR comments

f779dfb

precommit

eea0c09

[pre-commit.ci] auto fixes from pre-commit.com hooks

1ca2012

for more information, see https://pre-commit.ci

fetches more author info, not sure how to format it yet

82d99a8

[pre-commit.ci] auto fixes from pre-commit.com hooks

933c625

for more information, see https://pre-commit.ci

brackets in wrong placE

ea9931f

format that works with /import/api

21e59f1

wip

f32536b

wikisource script goes in other PR

65fa8d2

merge

6c5006b

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f2f6a9

for more information, see https://pre-commit.ci

pidgezero-one added 2 commits November 26, 2024 23:16

unnecessary change

009749e

unnecessary change

db20d32

pidgezero-one mentioned this pull request Nov 27, 2024

feat: import books from Wikisource #9674

Open

pidgezero-one added 3 commits November 26, 2024 23:40

this code goes in the other PR

c2b7908

Revert "this code goes in the other PR"

9aa8835

This reverts commit c2b7908.

?

41cc424

Freso suggested changes Nov 27, 2024

View reviewed changes

Freso reviewed Nov 27, 2024

View reviewed changes

pidgezero-one commented Nov 27, 2024

View reviewed changes

hornc reviewed Nov 27, 2024

View reviewed changes

requirements.txt doesnt need to change here

ad3bf1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consolidate remote ids and wikisource identifiers #10092

feat: consolidate remote ids and wikisource identifiers #10092

pidgezero-one commented Nov 27, 2024 •

edited

Loading

codecov-commenter commented Nov 27, 2024 •

edited

Loading

Freso left a comment

Freso Nov 27, 2024

Freso Nov 27, 2024 •

edited

Loading

pidgezero-one Nov 27, 2024

Freso Nov 27, 2024

pidgezero-one Nov 27, 2024

Freso Nov 27, 2024

Freso Nov 27, 2024

Freso Nov 27, 2024

pidgezero-one Nov 27, 2024 •

edited

Loading

Freso Nov 27, 2024

Freso Nov 27, 2024

pidgezero-one Nov 27, 2024

pidgezero-one Nov 27, 2024

pidgezero-one Nov 27, 2024

pidgezero-one Nov 27, 2024

Freso Nov 28, 2024

hornc Nov 27, 2024

pidgezero-one Nov 27, 2024

tfmorris Nov 27, 2024

Freso Nov 28, 2024

hornc Nov 28, 2024

tfmorris Nov 28, 2024

hornc commented Nov 28, 2024

pidgezero-one commented Nov 28, 2024

	return int(re.search(r'\d+', self.key))
	return int(self.key[2:-1])

		@@ -0,0 +1,54 @@
		"""
		Copies all author identifiers from the author's stored Wikidata info into their remote_ids.

feat: consolidate remote ids and wikisource identifiers #10092

Are you sure you want to change the base?

feat: consolidate remote ids and wikisource identifiers #10092

Conversation

pidgezero-one commented Nov 27, 2024 • edited Loading

===This is a WIP===

Technical

Testing

Screenshot

Stakeholders

codecov-commenter commented Nov 27, 2024 • edited Loading

Codecov Report

Freso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Freso Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pidgezero-one Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hornc commented Nov 28, 2024

pidgezero-one commented Nov 28, 2024

pidgezero-one commented Nov 27, 2024 •

edited

Loading

codecov-commenter commented Nov 27, 2024 •

edited

Loading

Freso Nov 27, 2024 •

edited

Loading

pidgezero-one Nov 27, 2024 •

edited

Loading