Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some gutenberg books have not been updated for a while and have bad names #841

Open
B-root74 opened this issue Feb 24, 2024 · 8 comments
Open
Assignees
Labels
Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website)

Comments

@B-root74
Copy link

Some gutenberg books have not been updated for a while (somewhere in 2022) and have bad names (with a period at the end).

gutenberg_ale_all_2022-08
gutenberg_ang_all_2022-08
gutenberg_bgs_all_2022-08
gutenberg_brx_all_2022-07
gutenberg_csb_all_2022-07
gutenberg_grc_all_2022-07
gutenberg_kha_all_2022-05
gutenberg_kld_all_2022-08
gutenberg_ko_all_2022-08
gutenberg_nai_all_2022-08
gutenberg_nav_all_2022-05

It looks like these languages are not available anymore at Gutenberg project, so I think these files might securely be deleted.

@eshellman
Copy link
Collaborator

These are the single book languages in PG. They're still there.
Just checking two of these:
https://www.gutenberg.org/browse/languages/kha
https://www.gutenberg.org/browse/languages/ko

@B-root74
Copy link
Author

Oh great, thank you!

Then it is just a matter of changing the ZIM metadata? Why are these ZIMs not updated by openZIM while all other are?

@Popolechien Popolechien added the Remove Asking for the removal of zim files from the download library label Feb 25, 2024
@eshellman
Copy link
Collaborator

maybe a zero-index bug?

@benoit74
Copy link
Contributor

@Popolechien why do you want to remove these files? did we made a decision in the past to not publish them anymore?

If @eshellman is right (and I know he probably is right), I see no reason to not publish these ZIMs, we "just" have to fix the scraper

Is it correct to say that all these books are referenced as having multiple languages? I'm quite inclined to believe this might be the issue, I don't think Gutenberg is capable to support multiple languages per book nicely.

@Popolechien
Copy link
Collaborator

@benoit74 I just saw the word "delete" and mindlessly hit the 'assign' button 😁

More seriously, the question stands as to why these are not being updated: the last run occured 11 hours ago and the latest ko / kha files still are from May 2022.

@benoit74
Copy link
Contributor

If everytime a random contributor (no offense @B-root74) states that something has to be deleted we delete it, we might have run into troubles 🤪

Anyway, we are all aligned, there is probably an issue in the scraper. openzim/gutenberg#218

@benoit74 benoit74 added Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website) and removed Remove Asking for the removal of zim files from the download library labels Feb 26, 2024
@benoit74
Copy link
Contributor

Oh, but the Remove label only means that someone is requesting to delete a file, not that we have accepted to delete it? I find it a bit misleading (at least I was mislead), I would rename it "Removal request".

@Popolechien
Copy link
Collaborator

Nevermind, it's on me. Should teach me not to check new issues on a Sunday ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Upstream For tickets which are waiting for an upstream modification (typically scrapper or target website)
Projects
None yet
Development

No branches or pull requests

4 participants