Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disorientations scraper for Issuu #35

Open
caseyg opened this issue May 25, 2020 · 0 comments
Open

disorientations scraper for Issuu #35

caseyg opened this issue May 25, 2020 · 0 comments
Assignees

Comments

@caseyg
Copy link
Member

caseyg commented May 25, 2020

@elazar this issue might be the perfect match for your expertise!

the way I pulled everything on the site so far down from Issuu was by crawling their search page for "disorientation" with https://www.parsehub.com, importing the CSV data from Parsehub into Google Sheets, deleting all the rows that weren't relevant, and pushing the data around until I could output it in a format that worked with the Issuu Downloader script.

basically, Disorientations come out every September (at the beginning of the school year), and though I hope to do some outreach so groups start contributing directly to this site, they will likely continue to be uploaded to PDF hosting sites like Issuu and Scribd for some time to come

I'd love to move towards something that:

  • can scrape issuu's search pages for a modifiable keywords ("disorientation", "diso", "disguide", "2020", "2021", etc.) without relying on parsehub
  • pulls down data including:
    • title
    • id
    • author
    • url
    • date
    • page count
    • description
  • [dream feature] has some kind of way to review and discard irrelevant results (I did this manually by reviewing the Google Sheet with a checkbox)
  • [dream feature] can remember which URLs were flagged for downloading or discarded/irrelevant, so that the script can be run every so often and only return new/changed results
  • [dream feature] could do a thorough enough job that I could say for certain there aren't more disos hiding on Issuu. I think I scraped the first few pages of search results using Parsehub, but it wasn't exhaustive.

I looked into Scrapy and other tools but they're a bit over my head...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants