Skip to content
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.

Question: can the library tell me if the website is a good candidate for extraction? #28

Closed
NinoSkopac opened this issue Nov 16, 2017 · 2 comments

Comments

@NinoSkopac
Copy link
Contributor

Howdy,

I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.

https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything

Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?

That way, I could fall back to my internal text extraction algorithm.

@andreskrey
Copy link
Owner

Well, technically Readability checks for content before returning the output but to do that we set a word threshold and then count the amount of characters that the result text has. This is currently set at 500 characters so every result that is above that number "technically" has content.

JS version has a isProbablyReadarable function that will determine if the text is parseable before going through the full algorithm, but I didn't port (or analyze) that section yet.

The first link you sent me is clearly a bug, but the second one works ok for me (Using the readability button on Firefox 57).

You could extract the scoring of the top node or maybe hijack the topCandidates array and inspect the text there to do your own analysis and decide if it's relevant to you.

Currently there's no way to get that information outside the HTMLParser class so you will have to hack your code around it.

@NinoSkopac
Copy link
Contributor Author

I found some good info here: luin/readability#78

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants