Skip to content

Releases: TaylorJadin/site-archiving-toolkit

2024-06-07

07 Jun 20:38
Compare
Choose a tag to compare
  • Turn off the creation of .htaccess and redirect.php files for webrecorder archives by default. The way they are implemented can cause problems if they archives aren't at the webroot.
  • Fix permissions before zipping up files to make sure they are ready for cPanel

2024-03-12

12 Mar 20:16
Compare
Choose a tag to compare

Automatically redirect out for pages that aren't found in the archive.

Thanks to Shannon for finding this!

If you don't want this behavior, you change true to false on line 8 of the index.html that gets created for each crawl. I do plan on making this an option in archive.ini in a future release.

2024-02-23

23 Feb 18:35
Compare
Choose a tag to compare

Two new features!

1. A basic archive.ini to set preferences!

On run, an archive.ini file gets created, which will let you set different options and crawl parameters. Simply delete the file if you want to return to the default settings, and a fresh file will get created on next run.

This file allows you to define your own settings for httrack and browsertrix crawler, or turn either of those tools off if you only want to use one or the other. It also enables or disables the other new feature!

2. Automatic redirects for archives made in browsertrix crawler!

The site archiving toolkit will now make a .htaccess file and a redirect.php file so that your webrecorder archives can handle other URLs that exist in the archive. This is super handy for archiving sites in place.

For example:
https://reclaimopen.com as been archived using browsertrix crawler but what if you want to visit https://reclaimopen.com/schedule-day-2? Normally, that URL would not work anymore, but now the redirect.php and .htaccess file will redirect your browser to the proper URL inside of the archive, in this example https://reclaimopen.com/#url=https://reclaimopen.com/schedule-day-2

Note, these automatic redirects will only work out-of-the-box on Apache-compatible servers with htaccess rules enabled. The archive also needs to be at the webroot. If these aren't the case in your setup, you may be able to modify the template to fit your needs!

2023-10-23

23 Oct 16:55
Compare
Choose a tag to compare
  • The webrecorder script will now rename archive.wacz to unique names based on the domain name and current time. This clears up issues with replayweb.page and browser caching.
  • On macOS and Linux, automatically check if docker is not running or if crawls are already in progress and exit

2023-08-04

04 Aug 18:12
Compare
Choose a tag to compare

Big update!

The toolkit is now using the latest version of browsertrix-crawler which seems to fix some bugs where it would get stuck on certain sites. The toolkit is actually using browsertrix-crawler's latest tag, so it will always be the latest version. In addition to that, httrack is now split out into running in a separate Debian container, which really just means we are now running the latest version of httrack as well. Oh, and also I'm tagging releases just by date instead of version numbers.

v0.4

04 May 18:41
Compare
Choose a tag to compare
  • fixed the replay template
  • adjusted httrack settings further
  • switched web server to restart: unless-stopped

v0.3

05 Apr 13:22
Compare
Choose a tag to compare

Pin browsertrix crawler to v0.8.1 and don't delete log files after crawl completes.

v0.2

15 Mar 22:00
1455bd0
Compare
Choose a tag to compare

Have httrack use the -n option when crawling sites.

v0.1

01 Mar 02:20
Compare
Choose a tag to compare
script