disorientations scraper for Issuu #35

caseyg · 2020-05-25T03:15:47Z

@elazar this issue might be the perfect match for your expertise!

the way I pulled everything on the site so far down from Issuu was by crawling their search page for "disorientation" with https://www.parsehub.com, importing the CSV data from Parsehub into Google Sheets, deleting all the rows that weren't relevant, and pushing the data around until I could output it in a format that worked with the Issuu Downloader script.

basically, Disorientations come out every September (at the beginning of the school year), and though I hope to do some outreach so groups start contributing directly to this site, they will likely continue to be uploaded to PDF hosting sites like Issuu and Scribd for some time to come

I'd love to move towards something that:

can scrape issuu's search pages for a modifiable keywords ("disorientation", "diso", "disguide", "2020", "2021", etc.) without relying on parsehub
pulls down data including:
- title
- id
- author
- url
- date
- page count
- description
[dream feature] has some kind of way to review and discard irrelevant results (I did this manually by reviewing the Google Sheet with a checkbox)
[dream feature] can remember which URLs were flagged for downloading or discarded/irrelevant, so that the script can be run every so often and only return new/changed results
[dream feature] could do a thorough enough job that I could say for certain there aren't more disos hiding on Issuu. I think I scraped the first few pages of search results using Parsehub, but it wasn't exhaustive.

I looked into Scrapy and other tools but they're a bit over my head...

caseyg added collection tools labels May 25, 2020

caseyg assigned elazar May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disorientations scraper for Issuu #35

disorientations scraper for Issuu #35

caseyg commented May 25, 2020

disorientations scraper for Issuu #35

disorientations scraper for Issuu #35

Comments

caseyg commented May 25, 2020