You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.
I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.
Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?
That way, I could fall back to my internal text extraction algorithm.
The text was updated successfully, but these errors were encountered:
Well, technically Readability checks for content before returning the output but to do that we set a word threshold and then count the amount of characters that the result text has. This is currently set at 500 characters so every result that is above that number "technically" has content.
JS version has a isProbablyReadarable function that will determine if the text is parseable before going through the full algorithm, but I didn't port (or analyze) that section yet.
The first link you sent me is clearly a bug, but the second one works ok for me (Using the readability button on Firefox 57).
You could extract the scoring of the top node or maybe hijack the topCandidates array and inspect the text there to do your own analysis and decide if it's relevant to you.
Currently there's no way to get that information outside the HTMLParser class so you will have to hack your code around it.
Howdy,
I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.
https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything
Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?
That way, I could fall back to my internal text extraction algorithm.
The text was updated successfully, but these errors were encountered: