Releases · TaylorJadin/site-archiving-toolkit

07 Jun 20:38

TaylorJadin

2024-06-07

b1300dd

2024-06-07 Latest

Latest

Turn off the creation of .htaccess and redirect.php files for webrecorder archives by default. The way they are implemented can cause problems if they archives aren't at the webroot.
Fix permissions before zipping up files to make sure they are ready for cPanel

Assets 4

12 Mar 20:16

TaylorJadin

2024-03-12

bceea3b

2024-03-12

Automatically redirect out for pages that aren't found in the archive.

Thanks to Shannon for finding this!

If you don't want this behavior, you change true to false on line 8 of the index.html that gets created for each crawl. I do plan on making this an option in archive.ini in a future release.

Assets 4

23 Feb 18:35

TaylorJadin

2024-02-23

a5fcd75

2024-02-23

Two new features!

1. A basic archive.ini to set preferences!

On run, an archive.ini file gets created, which will let you set different options and crawl parameters. Simply delete the file if you want to return to the default settings, and a fresh file will get created on next run.

This file allows you to define your own settings for httrack and browsertrix crawler, or turn either of those tools off if you only want to use one or the other. It also enables or disables the other new feature!

2. Automatic redirects for archives made in browsertrix crawler!

The site archiving toolkit will now make a .htaccess file and a redirect.php file so that your webrecorder archives can handle other URLs that exist in the archive. This is super handy for archiving sites in place.

For example:
https://reclaimopen.com as been archived using browsertrix crawler but what if you want to visit https://reclaimopen.com/schedule-day-2? Normally, that URL would not work anymore, but now the redirect.php and .htaccess file will redirect your browser to the proper URL inside of the archive, in this example https://reclaimopen.com/#url=https://reclaimopen.com/schedule-day-2

Note, these automatic redirects will only work out-of-the-box on Apache-compatible servers with htaccess rules enabled. The archive also needs to be at the webroot. If these aren't the case in your setup, you may be able to modify the template to fit your needs!

Assets 4

23 Oct 16:55

TaylorJadin

2023-10-23

b096389

2023-10-23

The webrecorder script will now rename archive.wacz to unique names based on the domain name and current time. This clears up issues with replayweb.page and browser caching.
On macOS and Linux, automatically check if docker is not running or if crawls are already in progress and exit

Assets 4

04 Aug 18:12

TaylorJadin

2023-08-04

373be1a

2023-08-04

Big update!

The toolkit is now using the latest version of browsertrix-crawler which seems to fix some bugs where it would get stuck on certain sites. The toolkit is actually using browsertrix-crawler's latest tag, so it will always be the latest version. In addition to that, httrack is now split out into running in a separate Debian container, which really just means we are now running the latest version of httrack as well. Oh, and also I'm tagging releases just by date instead of version numbers.

Assets 4