Use libzim IndexData::getContent to provide currated content to index. #282

mgautierfr · 2023-03-14T10:50:52Z

libzim provides a way for scrappers to provide a different content than the one stored for the indexation.

It allow a better indexation when a lot of content is not relevant about the subject of the content itself.

mwoffliner should parse the html content and extract only the relevant information (so remove thing such has menu, footer, user information, links to other questions...)

See comments in openzim/libzim#653

stale · 2023-05-26T17:26:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2023-07-16T07:55:22Z

@rgaudin @benoit74 Would this approach, here, brings a real improvement? If "yes', which one?

rgaudin · 2023-07-21T14:53:18Z

Improvement would be marginal I think because we don't include much non-content text in the HTML.
A side effect would be parsing all our output using an in-scraper HTML parser versus letting libzim do it.

stale bot added the stale label May 26, 2023

kelson42 added the question label Jul 16, 2023

stale bot removed the stale label Jul 16, 2023

kelson42 added the enhancement label Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use libzim IndexData::getContent to provide currated content to index. #282

Use libzim IndexData::getContent to provide currated content to index. #282

mgautierfr commented Mar 14, 2023

stale bot commented May 26, 2023

kelson42 commented Jul 16, 2023

rgaudin commented Jul 21, 2023

Use libzim IndexData::getContent to provide currated content to index. #282

Use libzim IndexData::getContent to provide currated content to index. #282

Comments

mgautierfr commented Mar 14, 2023

stale bot commented May 26, 2023

kelson42 commented Jul 16, 2023

rgaudin commented Jul 21, 2023