This is the script for scraping IITBBS website. The workflow of the repo is as follows:
- Run
link_scraper_dfs.py
to extract the links from the website.This stores the links tocrawled_urls.csv
. - Now run the
scraping script.py
to extract content from the links collected incawled_ursl.csv
and save it toscraped_data_dfs.json
. - Run
preprocessed_data.py
to remove unwanted characters from the collected data and store it tocleaned_json.py
.