-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download whole site #8
Comments
Thanks, @aolko! I worry slightly that a feature like this would lead to unintentionally massive downloads. On the other hand, I totally understand the impetus/desire to make archiving the assets easier. I'm going to give this a bit more thought, and then perhaps take a shot at prototyping it. |
Wayback has copies of my old websites that I'd love to pull down in full. There are apps elsewhere to do this, so it makes sense to add the feature as an option for cases like these. |
Hi, @MechMykl. Makes sense. A couple of questions:
|
I've used SiteSucker in the past to pull down copies of sites - http://www.macupdate.com/app/mac/11634/sitesucker In one command, I'd love to pull down the first version of a website (complete with all HTML / images / etc.) and then only grab differences moving towards the most recent capture date. Ex: My Website 2010 [HTML / Images / JS] This would help me better organize and review the years and years of backups Wayback has without having to navigate through broken records so often! |
Indeed it would be great to download a whole website like some people did here https://raw.githubusercontent.com/typophile/typophile.github.io/master/wayback-typophile.pl to replicate a dead website. |
Agreed; doesn't seem technologically very difficult. Because |
@jsvine Sure sequential download is better for the servers 👍 By the way, here is a copy of Typophile's README & how they extract the uniques urls Obtaining the HTML filesThe Wayback Machine has a number of APIs, one of which is the CDX Server API.
This will take a couple of minutes, and will return the complete list of all 471,100 URLs that it knows about.
|
We prefer one-at-a-time downloads. Even if you're downloading something big via our 60 gigabit link, it's reasonably fair to our other users thanks to the one-at-a-time-ness. |
Saw that this got bumped to v0.3.4. Can't tell from the current readme, does Waybackpack pull down the site assets or just the linked HTML files? I'm excited to start grabbing my old sites but don't want to jump in yet if it's just the page code :) |
Thanks for checking in, @MechMykl. There's a nice pull request from @fgregg — #17 — that should provide a handy way to get all assets that match a wildcard search. One hitch: Unix filesystems are case-insensitive, while URL paths are not. (Try |
Um, they are definitely case-sensitive: $ touch foo
$ ls foo
foo
$ ls FOO
/bin/ls: cannot access 'FOO': No such file or directory
This is true, except for the percent-encoding triplets. |
Re. the case-sensitivity, I could have been clearer. Here's the issue:
$ mkdir mypath
$ touch mypath/index.html
$ mkdir MyPath
mkdir: MyPath: File exists |
@jsvine Are you using cygwin? because for sure Unix is not case-insensitive. |
Nope, I'm on OS X El Capitan. What machine are you on. And what happens when you run the commands in my previous comment? |
OK, so @jsvine's filesystem is most likely HFS+, which indeed is case-insensitive. $ mkdir mypath
$ mkdir MyPath
$ ls
MyPath mypath |
Many thanks, Jakub. That helps clarify the confusion on this thread; my apologies for my hand in it! My gut feeling is that the best approach to waybackpacking a full site (and, perhaps, even for single pages) is to dump all resources into a database (SQLite by default, but anything that plays well with |
@jsvine Possibility for download of PDFs, .doc, .xls etc is really important for researchers. Maybe just ban photo and video, if at all? |
@wumpus Could you please let us know more about Internet Archive policies on automated downloads? |
I no longer work at IA. |
Any progress on this? |
Unfortunately, I haven't been able to carve out time for this yet. But it's certainly still on my todo list. |
Hi jsvine, |
Any progress on this yet? |
The website i want to download has images hosted on seperate CDN url, is it possible to get that too? |
Not currently.
Not currently. |
Any progress on this? It would be great to follow the links recursively and download assets. |
Not currently, but you might have luck with the code in this PR: #17 |
Add a param to download whole site with assets, not just pages
(as of right now it only captures html page of the site)
The text was updated successfully, but these errors were encountered: