Use libzim `IndexData::getContent` to provide currated content to index. #1810

mgautierfr · 2023-03-14T10:49:16Z

libzim provides a way for scrappers to provide a different content than the one stored for the indexation.

It allow a better indexation when a lot of content is not relevant about the subject of the content itself.

mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, examples, ...)

See comments in openzim/libzim#653

The text was updated successfully, but these errors were encountered:

kelson42 · 2023-03-14T17:52:06Z

@mgautierfr Can you please expkain:

the problem to fix, with at one concrete example
how your proposal would fix it
... because I don't get it.

Jaifroid · 2023-03-14T18:43:50Z

See also #1725, which looks similar. Note comments there.

mgautierfr · 2023-03-15T09:29:49Z

The idea is that we want to index a content different that what we are storing.
Some content don't have to be indexed. Some other content cannot be indexed (a video) and we want to provide a textual description (from subtitle ?) to index it anyway.

The problem is less visible than I expected on mwoffliner as we use the mobile version and it doesn't include all menus and side bars.

But I have found this one : https://library.kiwix.org/viewer#search?books.name=wikipedia_en_physics_maxi_2023-02&pattern=gazette

The results are not related to gazette. But as the references are coming from gazette, the articles seems relevant to xapian.

mwoffliner may decide to remove all reference from the indexed content (while keeping it in the content itself).
We could also decide to index only the beginning of the article (equivalent to indexing the nodet flavor) as the beginning has more chance to describe what is the subject of the article and the rest of the article may add "false positive" by going more in the details and make parallel with other subjects.
Or we could get the source of the article and index that while storing the rendered (html) content.
Or mwoffliner may make a specific request (if it exists) on mediawiki to get keyword/curated content from the search engine itself and index that.

kelson42 · 2023-03-17T05:43:46Z

@mgautierfr perfectly agree, just that i see no straight relation to openzim/libzim#653. Depends on #1576

kelson42 · 2023-03-17T05:45:59Z

@Jaifroid Thank you for remembering #1725, went actually out of my radar. This is indeed a duplicate of this one. We agree on the improvement potential and on rhe approach.

kelson42 closed this as not planned Won't fix, can't repro, duplicate, stale Mar 17, 2023

kelson42 added duplicate enhancement labels Mar 17, 2023

kelson42 self-assigned this Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libzim `IndexData::getContent` to provide currated content to index. #1810

Use libzim `IndexData::getContent` to provide currated content to index. #1810

mgautierfr commented Mar 14, 2023

kelson42 commented Mar 14, 2023

Jaifroid commented Mar 14, 2023

mgautierfr commented Mar 15, 2023

kelson42 commented Mar 17, 2023

kelson42 commented Mar 17, 2023

Use libzim IndexData::getContent to provide currated content to index. #1810

Use libzim IndexData::getContent to provide currated content to index. #1810

Comments

mgautierfr commented Mar 14, 2023

kelson42 commented Mar 14, 2023

Jaifroid commented Mar 14, 2023

mgautierfr commented Mar 15, 2023

kelson42 commented Mar 17, 2023

kelson42 commented Mar 17, 2023

Use libzim `IndexData::getContent` to provide currated content to index. #1810

Use libzim `IndexData::getContent` to provide currated content to index. #1810