webscraper-folha

Python web scraper, part of a Natural Language Processing (NLP) project, consisting of a linguistic study between two corporas. Both corporas were built with articles scraped from "Folha de São Paulo", ranging 1 month around the opening and with 20 years of difference between each corpora.
The project was developed as part of the evaluation method for "ILC - Introdução à Linguagem Computacional" (Introduction to Computational Linguistics), at UFABC - Universidade Federal do ABC.
The scraper was built to make a robust corpus for further textual analysis. It is built and tested to be used under a very specific set of conditions, and only works at folha.com.br, as the needs for the project.
The corpora consists of texts about Soccer World Cup from two distinct years, into a 1 month range around the '94 and '14 World Cup opening dates, containing the words "Copa do Mundo" and also, with the '94 text search options including "printed version".

IMPORTANT NOTE:

Script is not meant to be used in order to bypass the subscription system.

During the development process, the group had an account with a recurring payment in order to have full access to the website.

1st cell: imports;
2nd cell: first search page (must be in "Edição Impressa") and search result count (reference to calculate amount of pages to be swept);
3rd cell: save all page content and all page titles in separated lists;
4th cell: create files for every saved text in the input directory (both single files and a stacked version);