Skip to content

Latest commit

 

History

History
46 lines (30 loc) · 5.94 KB

FilteringWebPagesForTheGoodStuff.md

File metadata and controls

46 lines (30 loc) · 5.94 KB

Filtering Web Pages to Get the Good Stuff

Background & Motivation

I began this recipe with the idea that I could cobble a couple of utilities together that would avoid the dreadful shit that too many media outlets dump into their online "news articles". The idea began when I clicked a link in a Tweet, which loaded a page from The Telegraph in my browser, and then one from The Daily Mail Online. I was repulsed by the volume of junk, inappropriate advertisements, and demands for "registration" to actually see the article. And I couldn't even see the article because it was covered with an "overlay" that demanded I register before being allowed to view the content of the article.

My Search for Relief

I copied the URL of one of these blocked stories, and fed it into the curl utility on my macbook. curl, or cURL is an open-source software program that is invoked from the command line. Entering the command and a URL will cause that URL (a webpage in this case) to be downloaded to your computer. It has an option to download the webpage into a file so that you can load and read it in a web browser (Firefox, Chrome, Explorer, Safari, etc, etc). I reckoned this might get around the block, and it did! I've learned since then that it doesn't work on all media websites - but enough to make it worth a few seconds of effort. In case you want to try it:

curl -o ~/CurldWebPage.html http://www.someplace.com/SomeContent.html

The -o option specifies the output file that curl will use to save the content it downloads at the specified URL.

To lose all the ads and useless junk in the page I thought thatI could pipe the curl output through an HTML filter (perhaps Beautiful Soup) to get the content I wanted without all the annoying, revenue-generatingNote1 shit the publishers load into their pages.

A Discovery

And finally, to the point of this recipe: I learned it's not necessary to create a DIY project for this because there are some excellent, free and open-sourced tools that are already doing this. I use Firefox as my primary browser (because it seems to have a higher level of default privacy protection). I discovered an Add-on Extension for Firefox called Tranquility Reader that does a wonderful job at shit-filtering. For Chrome, there is a similar free and open-sourced Extension called Just Read that also does a very good job.

And so, if you're like me, weary of the constant stream of crap the media outlets and website operators would hose you with, you should know that considerable relief is available with almost no effort on your part. Tranquility Reader for Firefox, and Just Read for Chrome.


Note1: Contrary to the impression one might get from reading this, I have nothing at all against businesses who seek to generate revenue. It is after all, the sole motivation of most for-profit business formation, and the backbone of a healthy free-market economy. But note that we have not contracted with these swine to hoover up our personal data, deposit 'tracking cookies' on the computers we own and sell the data they collect on us to any third party who will pay the price. Consequently I have no compunction about avoiding their shitty tricks with some of my own!


REFERENCES:

  1. An overview of web page content extraction
  2. A Q&A at ResearchGate on extracting useful content from websites
  3. Tools to extract data from website
  4. Some Firefox "add ons" for improving readability
  5. I've just installed the "Tranquility" add on in Firefox, and it seems to work well
  6. A script or source code would support a customized solution (e.g. curl someURL | readability-script), and apparently source code may be extracted from add ons!)
  7. Other "filters" such as Readability, and Boilerpipe may be deprecated??
  8. Search for libraries or utilities for html scraping on linux
  9. Some webscraping tools...
  10. SE Q&A that is a "must-read" as the author of boilerpipe provides an answer on What is readability? Note: Recommend you read all of the answers.
  11. A brief overview of tools for extracting text from HTML pages
  12. I'm now using the Tranquility Reader as an extension (or add-on) in Firefox. I like what it does on "dirty" web pages... mostly news & weather outlets!
  13. I'm using the Just Read extension in Chrome.