You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@elazar this issue might be the perfect match for your expertise!
the way I pulled everything on the site so far down from Issuu was by crawling their search page for "disorientation" with https://www.parsehub.com, importing the CSV data from Parsehub into Google Sheets, deleting all the rows that weren't relevant, and pushing the data around until I could output it in a format that worked with the Issuu Downloader script.
basically, Disorientations come out every September (at the beginning of the school year), and though I hope to do some outreach so groups start contributing directly to this site, they will likely continue to be uploaded to PDF hosting sites like Issuu and Scribd for some time to come
I'd love to move towards something that:
can scrape issuu's search pages for a modifiable keywords ("disorientation", "diso", "disguide", "2020", "2021", etc.) without relying on parsehub
pulls down data including:
title
id
author
url
date
page count
description
[dream feature] has some kind of way to review and discard irrelevant results (I did this manually by reviewing the Google Sheet with a checkbox)
[dream feature] can remember which URLs were flagged for downloading or discarded/irrelevant, so that the script can be run every so often and only return new/changed results
[dream feature] could do a thorough enough job that I could say for certain there aren't more disos hiding on Issuu. I think I scraped the first few pages of search results using Parsehub, but it wasn't exhaustive.
I looked into Scrapy and other tools but they're a bit over my head...
The text was updated successfully, but these errors were encountered:
@elazar this issue might be the perfect match for your expertise!
the way I pulled everything on the site so far down from Issuu was by crawling their search page for "disorientation" with https://www.parsehub.com, importing the CSV data from Parsehub into Google Sheets, deleting all the rows that weren't relevant, and pushing the data around until I could output it in a format that worked with the Issuu Downloader script.
basically, Disorientations come out every September (at the beginning of the school year), and though I hope to do some outreach so groups start contributing directly to this site, they will likely continue to be uploaded to PDF hosting sites like Issuu and Scribd for some time to come
I'd love to move towards something that:
I looked into Scrapy and other tools but they're a bit over my head...
The text was updated successfully, but these errors were encountered: