Download whole site #8

aolko · 2016-05-04T06:43:37Z

Add a param to download whole site with assets, not just pages
(as of right now it only captures html page of the site)

jsvine · 2016-05-05T02:29:27Z

Thanks, @aolko! I worry slightly that a feature like this would lead to unintentionally massive downloads. On the other hand, I totally understand the impetus/desire to make archiving the assets easier. I'm going to give this a bit more thought, and then perhaps take a shot at prototyping it.

MechMykl · 2016-05-05T20:38:22Z

Wayback has copies of my old websites that I'd love to pull down in full. There are apps elsewhere to do this, so it makes sense to add the feature as an option for cases like these.

jsvine · 2016-05-05T21:44:15Z

Hi, @MechMykl. Makes sense. A couple of questions:

Are you looking to, in one command, pull down every single version of your old website? Or are you looking to download specific snapshots? E.g., your entire website as it existed on Jan. 1, 2010?
What are the other apps you mention? Would love to see how they handle this sort of thing.

MechMykl · 2016-05-06T06:01:49Z

I've used SiteSucker in the past to pull down copies of sites - http://www.macupdate.com/app/mac/11634/sitesucker

In one command, I'd love to pull down the first version of a website (complete with all HTML / images / etc.) and then only grab differences moving towards the most recent capture date. Ex:

My Website 2010 [HTML / Images / JS]
My Website 2011 [HTML]
My Website 2012 [Images]

This would help me better organize and review the years and years of backups Wayback has without having to navigate through broken records so often!

Jolg42 · 2016-05-10T15:34:40Z

Indeed it would be great to download a whole website like some people did here https://raw.githubusercontent.com/typophile/typophile.github.io/master/wayback-typophile.pl to replicate a dead website.
In fact it doesn't look hard to do but I agree with you @jsvine it could have the same effect as a DDOS …

jsvine · 2016-05-10T15:50:14Z

Agreed; doesn't seem technologically very difficult. Because waybackpack downloads files sequentially (rather than in parallel), DDOS isn't so much a concern (unless I'm overlooking something) as massive bandwidth consumption. @wumpus: any thoughts on this?

Jolg42 · 2016-05-10T16:08:15Z

@jsvine Sure sequential download is better for the servers 👍

By the way, here is a copy of Typophile's README & how they extract the uniques urls

Obtaining the HTML files

The Wayback Machine has a number of APIs, one of which is the CDX Server API.
This lists all URIs archived from a given site, and can be called on typophile like this:

curl 'http://web.archive.org/cdx/search/cdx?url=*.typophile.com&fl=urlkey,timestamp' > urls.txt ;

This will take a couple of minutes, and will return the complete list of all 471,100 URLs that it knows about.
Since there are 1,796 duplicates we can remove them:

awk '{print $1}' urls.txt | sort | uniq > urls-unique.txt ;

wumpus · 2016-05-11T23:50:55Z

We prefer one-at-a-time downloads. Even if you're downloading something big via our 60 gigabit link, it's reasonably fair to our other users thanks to the one-at-a-time-ness.

MechMykl · 2016-09-14T14:50:29Z

Saw that this got bumped to v0.3.4. Can't tell from the current readme, does Waybackpack pull down the site assets or just the linked HTML files? I'm excited to start grabbing my old sites but don't want to jump in yet if it's just the page code :)

jsvine · 2016-09-17T01:46:42Z

Thanks for checking in, @MechMykl. There's a nice pull request from @fgregg — #17 — that should provide a handy way to get all assets that match a wildcard search. One hitch: Unix filesystems are case-insensitive, while URL paths are not. (Try mkdir TMP; ls -la TMP; ls -la tmp;.) So the current approach — storing each asset at the path derived from its URL — breaks on wildcard searches that return URLs with differently-cased subpaths. (E.g., example.com/BIG_KAHUNA/photo.jpg and example.com/big_kahuna/photo.jpg). One possible solution would be to store assets in a SQLite database (or somesuch), but that'd take a little re-engineering.

jwilk · 2016-12-13T10:38:49Z

Unix filesystems are case-insensitive

Um, they are definitely case-sensitive:

$ touch foo
$ ls foo
foo
$ ls FOO
/bin/ls: cannot access 'FOO': No such file or directory

while URL paths are not

This is true, except for the percent-encoding triplets.
Source:
RFC 3986 §6.2.2.1
RFC 2616 §3.2.3

jsvine · 2016-12-14T15:09:49Z

Re. the case-sensitivity, I could have been clearer. Here's the issue:

Let's say your website has pages at /mypath/index.html and /MyPath/index.html. These paths are case-sensitive, and point to different resources.
Here's the problem you'd encounter on Unix:

$ mkdir mypath
$ touch mypath/index.html
$ mkdir MyPath
mkdir: MyPath: File exists

n3storm · 2017-02-15T19:41:33Z

@jsvine Are you using cygwin? because for sure Unix is not case-insensitive.

jsvine · 2017-02-15T23:53:39Z

Nope, I'm on OS X El Capitan. What machine are you on. And what happens when you run the commands in my previous comment?

jwilk · 2017-02-16T11:22:20Z

OK, so @jsvine's filesystem is most likely HFS+, which indeed is case-insensitive.
But this is not traditional Unix filesystem sematics.
Here, on Linux with ext4 fs, it is:

$ mkdir mypath
$ mkdir MyPath
$ ls
MyPath  mypath

jsvine · 2017-02-16T14:46:30Z

Many thanks, Jakub. That helps clarify the confusion on this thread; my apologies for my hand in it!

My gut feeling is that the best approach to waybackpacking a full site (and, perhaps, even for single pages) is to dump all resources into a database (SQLite by default, but anything that plays well with sqlalchemy). That'd make the filesystem a moot point, and also hypothetically enable some neat features — e.g., merging "packs" — and querying. What do you all think?

Ninoninoninonino · 2017-03-17T21:04:38Z

@jsvine Possibility for download of PDFs, .doc, .xls etc is really important for researchers. Maybe just ban photo and video, if at all?

Ninoninoninonino · 2017-03-17T21:06:05Z

@wumpus Could you please let us know more about Internet Archive policies on automated downloads?

wumpus · 2017-03-18T03:39:52Z

I no longer work at IA.

JacobDB · 2017-09-06T16:43:10Z

Any progress on this?

jsvine · 2017-09-12T03:05:48Z

Unfortunately, I haven't been able to carve out time for this yet. But it's certainly still on my todo list.

EbuXa · 2020-08-15T12:17:32Z

Hi jsvine,
when i want to download the spesific snapshot of whole pages of my site it only download the index.html file, the command is: waybackpack example.com -d downfolder/ --follow-redirects --from-date 201901

lucky1804 · 2021-01-16T04:10:51Z

Any progress on this yet?

lucky1804 · 2021-01-16T04:12:46Z

The website i want to download has images hosted on seperate CDN url, is it possible to get that too?

jsvine · 2021-01-18T23:48:26Z

Any progress on this yet?

Not currently.

The website i want to download has images hosted on seperate CDN url, is it possible to get that too?

Not currently.

belisards · 2024-02-29T22:31:45Z

Any progress on this? It would be great to follow the links recursively and download assets.

jsvine · 2024-03-02T15:06:57Z

Any progress on this?

Not currently, but you might have luck with the code in this PR: #17

fgregg mentioned this issue May 20, 2016

Wildcard url queries #16

Closed

jsvine mentioned this issue Sep 12, 2017

Saving url query string in filename #29

Open

fgregg mentioned this issue Nov 3, 2017

Wildcard searching #17

Open

jsvine mentioned this issue Sep 24, 2020

Only Download Index.html file #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download whole site #8

Download whole site #8

aolko commented May 4, 2016 •

edited

Loading

jsvine commented May 5, 2016

MechMykl commented May 5, 2016

jsvine commented May 5, 2016

MechMykl commented May 6, 2016

Jolg42 commented May 10, 2016 •

edited

Loading

jsvine commented May 10, 2016

Jolg42 commented May 10, 2016

wumpus commented May 11, 2016

MechMykl commented Sep 14, 2016

jsvine commented Sep 17, 2016

jwilk commented Dec 13, 2016

jsvine commented Dec 14, 2016

n3storm commented Feb 15, 2017

jsvine commented Feb 15, 2017 •

edited

Loading

jwilk commented Feb 16, 2017

jsvine commented Feb 16, 2017

Ninoninoninonino commented Mar 17, 2017

Ninoninoninonino commented Mar 17, 2017 •

edited

Loading

wumpus commented Mar 18, 2017

JacobDB commented Sep 6, 2017

jsvine commented Sep 12, 2017

EbuXa commented Aug 15, 2020

lucky1804 commented Jan 16, 2021

lucky1804 commented Jan 16, 2021

jsvine commented Jan 18, 2021

belisards commented Feb 29, 2024

jsvine commented Mar 2, 2024

Download whole site #8

Download whole site #8

Comments

aolko commented May 4, 2016 • edited Loading

jsvine commented May 5, 2016

MechMykl commented May 5, 2016

jsvine commented May 5, 2016

MechMykl commented May 6, 2016

Jolg42 commented May 10, 2016 • edited Loading

jsvine commented May 10, 2016

Jolg42 commented May 10, 2016

Obtaining the HTML files

wumpus commented May 11, 2016

MechMykl commented Sep 14, 2016

jsvine commented Sep 17, 2016

jwilk commented Dec 13, 2016

jsvine commented Dec 14, 2016

n3storm commented Feb 15, 2017

jsvine commented Feb 15, 2017 • edited Loading

jwilk commented Feb 16, 2017

jsvine commented Feb 16, 2017

Ninoninoninonino commented Mar 17, 2017

Ninoninoninonino commented Mar 17, 2017 • edited Loading

wumpus commented Mar 18, 2017

JacobDB commented Sep 6, 2017

jsvine commented Sep 12, 2017

EbuXa commented Aug 15, 2020

lucky1804 commented Jan 16, 2021

lucky1804 commented Jan 16, 2021

jsvine commented Jan 18, 2021

belisards commented Feb 29, 2024

jsvine commented Mar 2, 2024

aolko commented May 4, 2016 •

edited

Loading

Jolg42 commented May 10, 2016 •

edited

Loading

jsvine commented Feb 15, 2017 •

edited

Loading

Ninoninoninonino commented Mar 17, 2017 •

edited

Loading