-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use libzim IndexData::getContent
to provide currated content to index.
#1810
Comments
@mgautierfr Can you please expkain:
|
See also #1725, which looks similar. Note comments there. |
The idea is that we want to index a content different that what we are storing. The problem is less visible than I expected on mwoffliner as we use the mobile version and it doesn't include all menus and side bars. But I have found this one : https://library.kiwix.org/viewer#search?books.name=wikipedia_en_physics_maxi_2023-02&pattern=gazette The results are not related to gazette. But as the references are coming from gazette, the articles seems relevant to xapian.
|
@mgautierfr perfectly agree, just that i see no straight relation to openzim/libzim#653. Depends on #1576 |
libzim provides a way for scrappers to provide a different content than the one stored for the indexation.
It allow a better indexation when a lot of content is not relevant about the subject of the content itself.
mwoffliner
should parse the html content and extract only the relevant information (so remove thing such has menu, footer, examples, ...)See comments in openzim/libzim#653
The text was updated successfully, but these errors were encountered: