Releases: TaylorJadin/site-archiving-toolkit
2024-06-07
- Turn off the creation of
.htaccess
andredirect.php
files for webrecorder archives by default. The way they are implemented can cause problems if they archives aren't at the webroot. - Fix permissions before zipping up files to make sure they are ready for cPanel
2024-03-12
Automatically redirect out for pages that aren't found in the archive.
Thanks to Shannon for finding this!
If you don't want this behavior, you change true
to false
on line 8 of the index.html
that gets created for each crawl. I do plan on making this an option in archive.ini
in a future release.
2024-02-23
Two new features!
1. A basic archive.ini to set preferences!
On run, an archive.ini
file gets created, which will let you set different options and crawl parameters. Simply delete the file if you want to return to the default settings, and a fresh file will get created on next run.
This file allows you to define your own settings for httrack and browsertrix crawler, or turn either of those tools off if you only want to use one or the other. It also enables or disables the other new feature!
2. Automatic redirects for archives made in browsertrix crawler!
The site archiving toolkit will now make a .htaccess
file and a redirect.php
file so that your webrecorder archives can handle other URLs that exist in the archive. This is super handy for archiving sites in place.
For example:
https://reclaimopen.com as been archived using browsertrix crawler but what if you want to visit https://reclaimopen.com/schedule-day-2? Normally, that URL would not work anymore, but now the redirect.php
and .htaccess
file will redirect your browser to the proper URL inside of the archive, in this example https://reclaimopen.com/#url=https://reclaimopen.com/schedule-day-2
Note, these automatic redirects will only work out-of-the-box on Apache-compatible servers with htaccess rules enabled. The archive also needs to be at the webroot. If these aren't the case in your setup, you may be able to modify the template to fit your needs!
2023-10-23
- The webrecorder script will now rename archive.wacz to unique names based on the domain name and current time. This clears up issues with replayweb.page and browser caching.
- On macOS and Linux, automatically check if docker is not running or if crawls are already in progress and exit
2023-08-04
Big update!
The toolkit is now using the latest version of browsertrix-crawler which seems to fix some bugs where it would get stuck on certain sites. The toolkit is actually using browsertrix-crawler's latest
tag, so it will always be the latest version. In addition to that, httrack is now split out into running in a separate Debian container, which really just means we are now running the latest version of httrack as well. Oh, and also I'm tagging releases just by date instead of version numbers.
v0.4
- fixed the replay template
- adjusted httrack settings further
- switched web server to restart: unless-stopped
v0.3
Pin browsertrix crawler to v0.8.1 and don't delete log files after crawl completes.
v0.2
Have httrack use the -n
option when crawling sites.
v0.1
script