-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression: Missing HTML content #219
Comments
Who ... If you search for Same for Moby Dick, 2 versions. Probably same for other as well. I looked at the ePub (which are OK) and the content is slightly different, so there is clearly a difference between these books. I looked at https://aleph.pglaf.org/cache/epub/84/ and we see there is an HTML version of book 84, with illustrations. I will have to reproduce locally, but as usual it will take some time to rebuild the local database from rsync result. |
Ok I’ve tried the first two pages and about 2/3 of the books are missing. It gets better as one goes deeper, but it is the first impression that matters. I’ve tested a half dozen other languages, no problem there but there weren't many (or any) that had several versions of the same book. I have put the recipe on hold on Zimfarm @benoit74 can you please delete |
@Popolechien please make a removal request on zim-requests with appropriate flag otherwise we'll lost it and you'll open a ticket in 6 months asking where gutenberg files are 😉 |
I'm concerned about a general problem here which might lead to pausing all recipes... or do we have a chance to know exactly which ZIM are impacted? |
You mean, all gutenberg recipes, right? (there is only one btw) @Popolechien tests have shown that it looks like only @Popolechien: you wanna remove both ZIMs we have because you tested both and they both have the issue? Are we sure we want to do this (not provide the Should we run a new (temporary) recipe for |
This seems a pretty serious issue IMHO. If I get everything right, the wise thing would be to deactivate the gutenber recipe until this issue is closed and new release done. |
@rgaudin yeah my bad for some reason I thought the ticket was there already. I was not aware of the possibility of running the recipe with PDF adn ePubs only, but that seems acceptable, yes. |
You tested both |
Recipe for only epub+pdf is here: https://farm.openzim.org/recipes/gutenberg_mul_epub-pdf Can you confirm we want this and you did not spotted any stupid thing? I've activated the "multiple ZIM" mode, should we discover we have the issue in other languages as well, we will be happy to have ZIMs in all languages. It should take about 1 day to produce if I trust last run duration. |
This has been done more than one hour ago |
OK, so regarding the "real issue", I have debugged the scraper logic for book 84. Foreword: this scraper logic is a nightmare, I won't dive into details As you've probably already guessed, there are basically two issues:
scraper does not care that HTML version has not been found when it renders the UII suspected that first part could be a regression induced by #163 but I don't think so, at least it seems that situation has been enhanced by this PR but not fully fixed : before this PR, buttons where always displayed when the book was supposed to have a given format available according to RDF ; with the PR (now), the buttons are hidden if a given format is not requested ; we should go further and also hide the button if we do not achieve to download the requested format. scraper does not achieves to find the HTML version of bookFor book 84, the various versions present at https://www.gutenberg.org/files/84/84-h/84-h.htm or at https://www.gutenberg.org/cache/epub/84/pg84-images.html (also redirected here from "magic logic" from @eshellman which gives https://www.gutenberg.org/ebooks/84.html.images for this book HTML) are not among the 10s of potential URLs considered by the scraper (see code block below).
For book 41445 (which works), the HTML version is found at Full list of potential URLs for 41445 below:
what nextWe cannot add the I'm not very inclined to fix only the fact that scraper does not care that HTML version has not been found when it renders the UI, because as far as I've understood, HTML version is very important for our users (see comments on #161). Fixing only this could help as an interim solution to "at least build a relevant ZIM without buttons leading to nowhere", but I do not recommend this approach which is only putting lipstick on a pig. I think that at this point we need to invest time in seriously simplifying the scraper code to get rid of all "fallback" mechanisms we have and are only biting us now. In other words, finally implement what has been imagined and more or less prepared in #97 (I just renamed it, we won't move to OPDS catalog according to latest discussions in the issue):
WDYT? |
LGTM, thanks a lot. Regarding the interim recipe, I've disabled the multiple languages output (we would have duplicates files with almost the same name and for very limited added value, I find this confusing rather than helpful) - let's see this as an English problem and an English fix. I have changed the language settings (and recipe name) accordingly, please double check before launching the recipe. I have also disabled the bookshelves feature, apparently according to #184 the feature is not maintained by Gutenberg folks. |
Interim recipe started. Be aware that doing it only for English also means we will not provide the I don't get what the problem is about the mostly similar name, we already have this situation for Wikipedia with its flavor. Mostly same name, same title, same description, only size differ. It is only a UI issue. |
I'm fine with that, it's use case always seemed dubious to me in the first place. Regarding the Wikipedia example, that's exactly the problem I had in mind (the question comes regularly as to why these three and what the difference is, despite all the FAQ, message, etc.) |
Over the past 2-3 years, a lot of effort has been put into upgrading all 70,000 books in PG books to validated html5 and epub3. There are two trees in the file system, the "1/2/3/4/5" tree, and the "cache/epub" tree. The generated epub3 and html5 files are in the "cache/epub" tree. Both of these are in the aleph mirror. I don't remember how we were handling epub, but the generated HTML5 was not yet implemented when this was last implemented. as you might expect, the generated html5 is much more uniform in quality compared to the source files, which come in all sorts of htm and txt flavors! |
https://farm.openzim.org/recipes/gutenberg_en_epub-pdf did not produced the expected outcome, I forgot again that HTML format is mandatory (see #161), we can only request to not put epub or pdf in the ZIM ... I've disabled the recipe (we can probably delete it, it is only misleading) and the ZIM (still suffering the same HTML issue). @kelson42 do you consider this is a fast-track issue which needs to be fixed asap (i.e. with more priority than other projects I have)? |
I now really consider it is mandatory to do the necessary changes to fix #97 and have a scraper which is both faster, easier to maintain and producing a ZIM with more uniform quality |
@benoit74 How much work do you estimate to be able to bring things back to normal in good and substainable conditions? |
@kelson42 In man days, 5 to 10 days probably (including PoC, reviews, ...). In elapse ... |
Anything I can help with, let me know.
… On Mar 7, 2024, at 7:46 AM, benoit74 ***@***.***> wrote:
@kelson42 <https://github.com/kelson42> In man days, 5 to 10 days probably (including PoC, reviews, ...). In elapse ...
—
Reply to this email directly, view it on GitHub <#219 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMONDHXTJ4XKCFG2CLLYXBOT7AVCNFSM6AAAAABEG2D7RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGQZTMMZUHA>.
You are receiving this because you were mentioned.
|
I've added an update to #97 that I hope will help |
Thank you! |
In gutenberg_en_all_2024-02, out of the 10 books listed (all declaring offering an HTML version), only three do have an HTML version.
Either HTML version it is not present in the ZIM or the link is incorrect (it's same link in listing and in preview page). This is not limited to those 10 entries but it makes this 75GB ZIM look like garbage.
Initially reported by Offspot user.
The text was updated successfully, but these errors were encountered: