Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language to ia metadata #161

Merged
merged 2 commits into from
Apr 19, 2023
Merged

Add language to ia metadata #161

merged 2 commits into from
Apr 19, 2023

Conversation

hornc
Copy link
Contributor

@hornc hornc commented Apr 3, 2023

closes #8
and partially #132

I'd like it if there was a 3 letter language code rather than a full name. (Hopefully done now with 42f73de)

Some of the existing code suggests there is a 3_letter_code (https://github.com/LibriVox/librivox-catalog/search?q=three_letter_code) , but I'm not completely sure whether this is guaranteed to exist, or where it comes from.

The archive.org fields to target are:

If I can determine a good source -> destination mapping, I can very likely make retrospective updates to the existing archive.org items happen.

I have not tested this code yet (not sure the best way to do so).

@hornc hornc marked this pull request as ready for review April 3, 2023 07:17
@hornc hornc changed the title Attempt to add language to ia metadata Add language to ia metadata Apr 3, 2023
@notartom
Copy link
Member

notartom commented Apr 15, 2023

So this looks correct, and I could deploy this on the staging server. @twinkietoes-on, if I deploy this on the staging server, would you be able to try uploading a few test files, to see if the language correctly appears in the metadata of the uploaded items?

Edit: So actually I've just gone ahead and deployed it, you can test at your own time :)

@twinkietoes-on
Copy link
Collaborator

twinkietoes-on commented Apr 15, 2023 via email

@twinkietoes-on
Copy link
Collaborator

I got into the test server and uploaded this project (saying it was Spanish): https://archive.org/details/joy_other_poems_2304_librivox
I don't see the language info having been imported.

@hornc
Copy link
Contributor Author

hornc commented Apr 18, 2023

@twinkietoes-on thank you for testing this! I don't see the language field there in the item either, or on the task that was submitted. I'll have a look to see whether there is a problem with the archive.org API payload. Specifically whether a x-archive-meta-language: is expected to be recognized by the API, that's something I can test independently.

@hornc
Copy link
Contributor Author

hornc commented Apr 18, 2023

https://archive.org/download/joy_other_poems_2304_librivox/joy_other_poems_2304_librivox_meta.xml Is where the metadata should appear, and it shows that it was updated with the recent date time: 2023-04-18 16:28:27. x-archive-meta-language: is the correct way to send the metadata using the ias3 API as documented here: https://archive.org/developers/ias3.html ... so I think the code is correct for this part.

For Spanish I would expect the 3-letter code spa to be sent. If something was wrong it might get set to the default eng, I'm surprised it's not being set at all though.

@notartom
Copy link
Member

notartom commented Apr 18, 2023

Lemme play around with the code on the staging server, now that I have a project to use for testing, and have figured out where the "upload to archive.org" button is :)

@twinkietoes-on
Copy link
Collaborator

twinkietoes-on commented Apr 18, 2023 via email

@notartom
Copy link
Member

OK, that was my bad, when @twinkietoes-on first tested in the staging environment, the wrong branch was checked out. When I attempted it again, I used the same project (https://archive.org/details/joy_other_poems_2304_librivox), and it looks like once uploaded, any further upload attempts do not change the metadata. I then tried again with https://archive.org/details/joy_other_poems_2304_librivox_second and https://archive.org/details/joy_other_poems_2304_librivox_third, and in both cases the language shows up correctly, so I think we're good.

@twinkietoes-on can you please delete all those testing projects from archive?

@notartom notartom merged commit 641d929 into LibriVox:master Apr 19, 2023
@twinkietoes-on
Copy link
Collaborator

twinkietoes-on commented Apr 19, 2023 via email

@twinkietoes-on
Copy link
Collaborator

The German ones are showing up as the 3-letter code, deu.
Example:
https://archive.org/details/reise_um_die_welt_erste_abt_2304_librivox

@hornc
Copy link
Contributor Author

hornc commented Apr 20, 2023

@twinkietoes-on archive.org has flexible, but unfortunately somewhat inconsistent handling of language codes. The preferred 3 letter code is from the MARC language code list, but ISO 639-3 and ISO 639-2 are also generally supported.

German, ger and deu are all valid in the metadata, and the LibriVox audio collection language filter now finds all of these using the language filter: https://archive.org/details/librivoxaudio?and[]=languageSorter%3A%22German%22 so that's one usecase that has definitely improved.

deu does appear to be left out from the item details page display logic though, which isn't great.

My feeling is that archive.org should handle ISO 639-3 better, rather than librivox have to translate the codes. Librivox seems to be consistent with its ISO 3 letter language codes, which work well. Many other archive.org items have ISO 639-3 codes already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include language metadata in Internet Archive uploads
3 participants