-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing popularity/importance factor at indexing time #653
Comments
Thanks for this exploratory thread. A Short-Term Suggestion for "2022" : If the child searches for "apple", how about showing them the article they actually searched for? https://en.wikipedia.org/wiki/apple Or...the (identical after redirect) article: https://en.wikipedia.org/wiki/Apple Instead of accidentally/prominently advertising ~10 different Apple(TM) products to the young child! RECAP: Consider using the search string itself — to help populate the search dropdown — when an article exists with that very same title? |
I propose at index time an additional parameter called popularity which would be a value between 0 and 100 (100 been super popular). Open questions:
@rgaudin @mgautierfr @veloman-yunkan It’s architecture time, your feedbacks are required. |
@kelson42 My feeling is that introducing this feature may create some trouble (at least, initially).
If the popularity value is based on the page visit counts, then popularity of some pages may be proportional to their age rather than represent the objective interest in the information on that page. For example, some old questions on stack overflow are very highly rated but current interest in them is much lower since those technologies become outdated. |
@veloman-yunkan those are valid questions but those are scraper level ones… probably for each scraper. @kelson42 mentioned the WP1 data for mwoffliner. Should this popularity information only feed the indexer or should it create a sorted entry listing as well? |
First of all, I disagree with the
However adding a "importance" field is ok for me and it is probably not technically difficult. But the questions raised by @kelson42 are important one and not easy to answer. If we take the example of |
@rgaudin @mgautierfr I am opposed to the idea of embedding in ZIM files transient/dynamic/target-audience-dependent information like popularity, importance or whatever else one may call it. A first idea is to package that data as an addition/overlay to a ZIM file, so that the same ZIM file can be fine-tuned for different applications or user-bases (e.g. children, teachers, hikers, scientists in Antarctica or on the International Space Station, etc). |
My idea is to add a "usage neutral" importance in the xapian database which would help xapian to "correctly" sort the results. |
I didn't mean changing the ZIM format. My proposal was to have one or more separate files (similar to external subtitles) augmenting the ZIM content with popularity information. |
I would like to avoid discussion if the feature request is pertinent because there is no other ticket open to propose a solution to improve suggestion/search pertinence. If someone has a better idea, please open a ticket. I would also like to avoid the discussion to decide if what we put as coefficient makes sense. This is the role of the publisher with the scraper dev to decide this. |
There is no other ticket open to propose a solution to improve suggestion/search because not issue has been open discussing the issue without assuming a solution. The problem can be described as : Some articles may match a search query but the sorting order is discutable. Proposing a solution (add a popularity value) instead of describing the issue will de facto create a situation where no other ticket propose something else. But it doesn't mean we have to do the first solution proposed (and doesn't mean we don't have to do it neither)
We can be sure that we will be excluded from this discussion but we still will have to fix future indexation problem :) In fact, your comment make me take a step back and search a bit about how wikipedia searches and the popularity concept (yes, your comment had the inverse infect that it wanted) About the search on Apple itself : But if you look the popularity of the firsts pages with https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Apple|Apple_Inc.|Apple_(disambiguation)|Apples_to_Apples|Apple_Mac_OS_X|MacOS And popularity is really dependent of the current context. Today Putin has earn a lot of views with the invasion of Ukraine (this is not my words, but the ones in https://en.wikipedia.org/wiki/Wikipedia:Popular_pages#Political_leaders). I'm not sure we want to sort content which will stay for long with contextual information. But there is another way we can improve this without adding a new feature. We could also put in first result article with the exact title. It would put easily Other possibility would be to count how many links point to an articles (which is more how ranking is historically working). Today (and tomorrow also), only wikipedia has the popularity information. It means that the popularity factor will be used only for one website. It will not help the global indexing/search engine and in fact, it will interfere with it making debug really difficult ("Ranking in wikipedia zim not good" -> scrapper problem). I don't say that we should not make it, but I WANT to discuss it first. |
This is very very wrong, false for:
... and nobody knows the future |
Some educators (and librarians, and medical professionals) believe viral popularity is the problem — rather than the solution 😄 Either way, these criticisms of "viral popularity" (as demagogic if not mindless mob rule) are just 1 more reminder that Quality/Relevance indicators of {offline search results, offline content, etc} are so urgently needed by all "offline" communities (perhaps our most painfully hard design challenge!?) ✅ |
The problem is that the current indexing (both ft and suggestion) is only based on word occurrence. Even if this works good there is no easy way for the index to know that the Wikipedia "Apple" is more important than ".apple".
The idea would be to be avaible to slightly tweak the current algorithm by giving an external numerical factor at indexing time. That way we could make a ponderation and effectively give better search results.
In the case of Wikipedia such a number could be for example computed (and is actually already computed) within the WP1 project.
The text was updated successfully, but these errors were encountered: