Question: can the library tell me if the website is a good candidate for extraction? #28

NinoSkopac · 2017-11-16T09:03:18Z

Howdy,

I've been testing both your and the original Node libraries, and I've noticed that for some websites (eg workplace.stackexchange.com) the results are incorrect.

https://workplace.stackexchange.com/questions/102524/one-of-my-subordinates-child-passed-away-how-can-i-inform-my-team => skips the original post
https://workplace.stackexchange.com/questions/102692/how-to-deal-with-flaws-in-tests-of-potential-employers => skips everything

Is there a way to ask the library "hey, what do you think how did you do regarding extracting the content for this URL?" or, more plainly, "how confident are you that the content you extracted is relevant?" ? Something like an overall score?

That way, I could fall back to my internal text extraction algorithm.

andreskrey · 2017-11-16T10:47:39Z

Well, technically Readability checks for content before returning the output but to do that we set a word threshold and then count the amount of characters that the result text has. This is currently set at 500 characters so every result that is above that number "technically" has content.

JS version has a isProbablyReadarable function that will determine if the text is parseable before going through the full algorithm, but I didn't port (or analyze) that section yet.

The first link you sent me is clearly a bug, but the second one works ok for me (Using the readability button on Firefox 57).

You could extract the scoring of the top node or maybe hijack the topCandidates array and inspect the text there to do your own analysis and decide if it's relevant to you.

Currently there's no way to get that information outside the HTMLParser class so you will have to hack your code around it.

NinoSkopac · 2017-11-16T11:00:57Z

I found some good info here: luin/readability#78

andreskrey closed this as completed Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: can the library tell me if the website is a good candidate for extraction? #28

Question: can the library tell me if the website is a good candidate for extraction? #28

NinoSkopac commented Nov 16, 2017

andreskrey commented Nov 16, 2017

NinoSkopac commented Nov 16, 2017

Question: can the library tell me if the website is a good candidate for extraction? #28

Question: can the library tell me if the website is a good candidate for extraction? #28

Comments

NinoSkopac commented Nov 16, 2017

andreskrey commented Nov 16, 2017

NinoSkopac commented Nov 16, 2017