Hoardy-Web
is a suite of tools that helps you to passively capture, archive, and hoard your web browsing history.
Not just the URLs, but also the contents and the requisite resources (images, media, CSS
, fonts, etc) of the pages you visit.
Not just the last 3 months, but from the beginning of time you start using it.
Practically speaking, you install the Hoardy-Web
browser extension/add-on into your web browser and just browse the web normally while Hoardy-Web
passively, in background, captures and archives web pages you visit for later offline viewing, mirroring, and/or indexing.
Hoardy-Web
has a lot of configuration options to help you tweak what should or should not be archived and a very low memory footprint, keeping you browsing experience snappy even on ancient hardware (unless explicitly configured otherwise to, e.g., minimize writes to disk instead).
To start saving your browsing history, you can start using the Hoardy-Web
extension independently of other tools that are being developed in this repository.
To view/replay your archived data over HTTP
(a-la a local Wayback Machine), generate static website mirrors from it (a-la wget -mpk
), search, inspect, organize, manipulate, programmatically extract values from, and run scripts over your collected data you will need to install at least the accompanying hoardy-web
tool.
See the next section for a little storyboard explaining the above with pictures.
To learn more:
- See the "Supported use cases" section below for the setups
Hoardy-Web
is designed for. - See the "Why does
Hoardy-Web
exists?" section below for why you might want to do this. - See the "Highlights" section below for a longer description of what
Hoardy-Web
does and does not do. - See the "Alternatives" below for in-depth comparisons to alternatives.
- See the "Frequently Asked Questions" section of the extension's
Help
page for the answers to those. Additionally, see the "Quirks and Bugs" section there for a full list of currently known quirks and issues/bugs. These two sections of theHelp
page answer questions about most common issues you can encounter while usingHoardy-Web
extension/add-on. - See the "Quickstart" section for setup instructions.
Hoardy-Web
was previously known as "Personal Private Passive Web Archive" aka "pwebarc".
If you are reading this on GitHub, be aware that this repository is a mirror of a repository on the author's web site.
In author's humble opinion, the rendering of the documentation pages there is superior to what can be seen on GitHub (its implemented via pandoc
there).
So, for illustrative purposes, I installed the Hoardy-Web
extension and visited a Wikipedia page:
Also note that I had enabled limbo mode before visiting it so that I could first look at the page before deciding if I want to archive it.
(This is most useful for dynamically generated pages that update all the time with only some versions deserving being archived.)
So, then, I decided I do want to save it, so I pressed the lower of "In limbo" check-mark buttons there to collect and archive everything from that tab to my hoardy-web serve
archiving server instance.
Then, I pressed the "Replay" button to switch to the replay page generated by hoardy-web serve
:
... and closed the browser.
Then, later, I reopened it, restored the last session, and that tab was restored back with zero requests to the Internet.
See there for more screenshots, including those with Hoardy-Web
running on Chromium.
Regular users are expected to capture data using the Hoardy-Web
browser extension/add-on, producing HTTP
request+response dumps in WRR
file format, archive those to disk using one of the supported methods, and then use various sub-command of the hoardy-web
tool to
-
replay (a subset of) your archives over
HTTP
by feeding them tohoardy-web serve
and then using a normal web browser and navigating to links like http://127.0.0.1:3210/web/2/https://archiveofourown.org/works/3733123;this is similar to what Wayback Machine, heritrix, and pywb do;
-
generate a static offline website mirror from (a subset of) your archives by feeding them to
hoardy-web mirror
to produce a bunch of interlinkedHTML
,CSS
, images, and other files, which you can then view using a normal web browser, or with a e-book reader, or you can instead feed them to recoll or some other desktop search engine, etc;this is similar to what
wget -mpk
(wget --mirror --page-requisites --convert-links
) does, excepthoardy-web mirror
has a ton of cool optionswget
does not (e.g. it canscrub
generated pages in various ways, de-duplicate the files it generates, including between different websites and different generated mirrors, etc), and should you discover you dislike the generated result for some reason, you can change some or all of those options and re-generate the mirror without re-downloading anything.
See hoardy-web
's "Quickstart" section for more info.
Users that want to get more out of their archives and are not afraid of the command-line interface, can use
hoardy-web find
to find a subset of archives matching a specified criteria orhoardy-web organize
to instead maintain a tree of symlinks pointing to each or latest archive matching a specified criteria (e.g., maintain a tree of per-domain pointers to the latest visit to each URL, or links to all ever archived images, orPDF
files, etc)
and then use one of the ready-made scripts on the results to, e.g.,
- view
HTML
documents viapandoc
piped intoless
in your favorite tty emulator, - listen their contents with a TTS engine via
spd-say
, - open files stored inside
WRR
dumps viaxdg-open
(so, e.g., you can view images stored inside without first runninghoardy-web mirror
), - etc.
Alternatively, you can use the hoardy-web get
sub-command of the hoardy-web
tool to make your own scripts for processing archived web pages and files in arbitrary ways, similarly to the ready-made scripts.
Or, you can programmatically extract values from large subsets of your data by using the hoardy-web stream
sub-command to filter
and map
your captures with pre-defined filters and functions, or with your own custom code, and thus produce streams of interesting values in a raw textual format, and/or as structured JSON
and/or CBOR
(RFC8949) values, which you can then pipe somewhere else.
This, for instance, allows you to easily query things like "get me a list of domains that ever used CloudFlare when I visited them", or anything else you can imagine.
Or, you can use hoardy-web find
to find paths of files matching a specified criteria and then just dump those into JSON
s, or parse the original CBOR
-formatted WRR
files yourself with readily-available libraries.
Since the whole of the Hoardy-Web
project adheres to the philosophy described below, the simultaneous use of the Hoardy-Web
extension and the hoardy-web
tool helps immensely when developing scrapers for uncooperative websites: you just visit them via your web browser as normal, then, possibly years later, use the hoardy-web
tool to organize your archives and conveniently programmatically feed the archived data into your scraper without the need to re-fetch anything.
Given how simple the WRR
file format is, you can modify any HTTP
library to generate WRR
files, thus allowing you to use the hoardy-web
tool with data captured by other software, and use data produced by the Hoardy-Web
extension as inputs to your own tools.
Which is why, personally, I patch some of the commonly available FLOSS website scrapers to dump the data they fetch as WRR
files so that in the future I could write my own better scrapers and indexers and test them on a huge collected database of already collected inputs immediately.
Also, as far as I'm aware, hoardy-web
is a tool that can do more useful stuff to your WRR
archives than any other tool can do to any other file format for HTTP
dumps with the sole exception of WARC
.
So, you wake up remembering something interesting you saw a long time ago. Knowing you won't find it in your normal browsing history, which only contains the URLs and the titles of the pages you visited in the last 3 months, you try looking it up on Google. You fail. Eventually, you remember the website you seen it at, or maybe you re-discovered the link in question in an old message to/from a friend, or maybe a tool like recoll or Promnesia helped you. You open the link… and discover it offline/gone/a parked domain. Not a problem! Have no fear! You go to Wayback Machine and look it up there… and discover they only archived an ancient version of it and the thing you wanted is missing there.
Or, say, you read a cool fanfiction on AO3 years ago, you even wrote down the URL, you go back to it wanting to experience it again… and discover the author made it private... and Wayback Machine saved only the very first chapter.
"If it is on the Internet, it is on Internet forever!" they said. They lied!
Things vanish from the Internet all the time, Wayback Machine of the Internet Archive is awesome, but
- you need to be online to use it,
- it has no full-text search (some of their database can be searched, but not all, this is probably a privacy feature by this point),
- they only archive a subset of the public web, and only what can be reached with
HTTP
GET
requests, - a website you might want archived might be banned by the Wayback Machine,
- or the website might ban Wayback Machine's crawler instead,
- and they appear to remove/hide already archived data when threatened by policy makers and/or lawsuits.
(On that latter point, you should probably go and donate to the Internet Archive to help them fight it.)
Meanwhile, Hoardy-Web
solves all of the above issues for the pages you visited before.
Say, there is a web page that can not be easily reached via curl
/wget
(because it is behind a paywall or complex authentication method that is hard to reproduce outside of a browser) but for accessibility or just simple reading comfort reasons each time you visit that page you want to automatically feed its source to a script that strips and/or modifies its HTML
markup in a website-specific way and feeds it into a TTS engine, a Braille display, or a book reader app.
With most modern web browsers you can do TTS either out-of-the-box or by installing an add-on (though, be aware of privacy issues when using most of these), but tools that can do website-specific accessibility without also being website-specific UI apps are very few.
Meanwhile, Hoardy-Web
with some scripts can do it.
Say, there's a web page/app you use (like a banking app), but it lacks some features you want, and in your browser's Network Monitor you can see it uses JSON RPC
or some such to fetch its data, and you want those JSON
s for yourself (e.g., to compute statistics and supplement the app output with them), but the app in question has no public API and scraping it with a script is non-trivial (e.g., the site does complicated JavaScript
+multifactor-based auth, tries to detect you are actually using a browser, and bans you immediately if not).
Or, maybe, you want to parse those behind-auth pages with a script, save the results to a database, and then do interesting things with them (e.g., track price changes, manually classify, annotate, and merge pages representing the same product by different sellers, do complex queries, like sorting by price/unit or price/weight, limit results by geographical locations extracted from text labels, etc).
Or, say, you want to fetch a bunch of pages belonging to two recommendation lists on AO3 or GoodReads, get all outgoing links for each fetched page, union sets for the pages belonging to the same recommendation list, and then intersect the results of the two lists to get a shorter list of things you might want to read with higher probability.
Or, more generally, say, you want to tag web pages referenced from a certain set of other web pages with some tag in your indexing software, and update it automatically each time you visit any of the source pages.
Or, say, you want to combine a full-text indexing engine, your browsing and derived web link graph data, your states/ratings/notes from org-mode, messages from your friends, and other archives, so that you could do arbitrarily complex queries over it all, like "show me all GoodReads pages for all books not marked as DONE
or CANCELED
in my org-mode
files, ever mentioned by any of my friends, ordered by undirected-graph Pagerank algorithm biased with my own book ratings (so that books sharing GoodReads lists with the books I finished and liked will get higher scores)".
So, basically, you want a private personalized Bayesian recommendation system.
"Everything will have a RESTful API!" they said.
They lied!
A lot of useful stuff never got RESTful APIs, those RESTful APIs that exists are frequently buggy, you'll probably have to scrape data from HTML
s anyway.
"Semantic Web will allow arbitrarily complex queries spanning multiple data sources!" they said. Well, 25 years later ("RDF Model and Syntax Specification" was published in 1999), almost no progress there, the most commonly used subset of RDF does what indexing systems in 1970s did, but less efficiently and with a worse UI.
Meanwhile, Hoardy-Web
provides some of the tools to help you build your own little local data paradise.
The Hoardy-Web
browser extension runs under desktop versions of both Firefox- and Chromium-based browsers as well as under Firefox-for-Android-based browsers.
Hoardy-Web
's main workflow is to passively collect and archive HTTP
requests and responses (and, if you ask, also DOM
snapshots, i.e. the contents of the page after all JavaScript
was run) directly from your browser as you browse the web.
Therefore, the extension allows you to
-
trivially archive web pages hidden behind CAPTCHAs, requiring special cookies, multi-factor logins, paywalls, anti-scraping/
curl
/wget
measures, and etc (after all, the website in question only interacts with your normal web browser, not with a custom web crawler); -
archive most
HTTP
-level data, not just web pages, and not just things available viaHTTP GET
requests (e.g., it can archive answer pages of web search engines fetched viaHTTP POST
,AJAX
data,JSON RPC
calls, etc; though, at the moment, it can not archiveWebSockets
data);
all the while
- being invisible to websites you are browsing;
- downloading everything only once, not once with your browser and then the second time with a separate tool like ArchiveBox (or with an extension like SingleFile, which can re-download some invalidated cached data when you ask it to save a page);
- freeing you from worries of forgetting to archive something because you forgot to press a button somewhere.
The extension can archive collected data
- into browser's local storage (the default),
- into files saved to your local file system (by generating fake-Downloads containing bundles of
WRR
-formatted dumps), - to a self-hosted archiving server (like the trivial archival-only
hoardy-web-sas
or the more advanced archival+replayhoardy-web serve
), - any combination of the above.
Also, unless configured otherwise, the extension will dump and archive collected data immediately, to both prevent data loss and to free the used RAM as soon as possible, keeping your browsing experience snappy even on ancient hardware.
You can then view your archived data with some help from the hoardy-web
tool by
- replaying your archives over
HTTP
, similar to Wayback Machine, heritrix, and pywb; - generating local offline static website mirrors, similar to what
wget
does; - using one of the ready-made scripts; or
- making you own scripts built on top of
hoardy-web
.
Hoardy-Web
allows you to search your archives
- directly from
hoardy-web serve
by using glob-URL links like http://127.0.0.1:3210/web/*/https://archiveofourown.org/works/[0-9]*, a-la Wayback Machine; - via
hoardy-web find
.
(The latter option can also do full-text search via the --*grep*
options, but, at the moment, it's rather slow since there is no built-in full-text indexing.
You can, however, full-text index you data by hoardy-web mirror
ing it first and then feeding the result to an arbitrary desktop search engine, or by using hoardy-web get
as a filter for recoll.)
In other words, Hoardy-Web
is your own personal private Wayback Machine which passively archives everything you see.
However, unlike the original Wayback Machine, it also
- archives
HTTP POST
requests and responses, and most otherHTTP
-level data, - makes other uses other than the conventional browser-only reading-only workflow pretty easy.
In short, compared to most of its alternatives, Hoardy-Web
DOES NOT:
-
force you to use a Chromium-based browser (you can use
Hoardy-Web
with Firefox, Tor Browser, LibreWolf, Fenix aka Firefox for Android, Fennec, Mull, etc, which is not a small thing, since if you tried using any of the close alternatives running under Chromium-based browsers, you might have noticed that the experience there is pretty awful: the browser becomes even slower than usual, large files don't get captured, random stuff fails to be captured at random times because Chromium randomly detaches its debugger from its tabs... none of these problems exist on Firefox-based browsers because Firefox does not fight ad-blocking and hardcore ad-blocking extensions andHoardy-Web
use the same browser APIs); -
require you to capture, collect, and archive recorded data one page/browsing session at a time (the default behaviour is to archive everything completely automatically, though it implements optional limbo mode which delays archival of collected data and provides optional manual/semi-automatic control if you want it);
-
require you to download the data you want to archive twice or more (you'd be surprised how commonly other tools will either ask you to do that explicitly, or just do that silently when you ask them to save something);
-
send any of your data anywhere (unless you explicitly configure it to do so);
-
send any telemetry anywhere;
-
require you to store all the things in browser's local storage where they can vanish at any moment (though, saving to local storage is the default because it simplifies on-boarding, but switching to another archival method takes a couple of clicks and re-archival of old data from browser's local storage to elsewhere is easy);
-
require you to run a database server;
-
require you to run a web browser to view the data you've already archived.
Technically, the Hoardy-Web
project is most similar to
- DownloadNet project, but with collection, archival, and replay all (optionally) independent from each other, with an advanced command-line interface, and not limited to Chromium;
- mitmproxy project, but
Hoardy-Web
leaves SSL/TLS layer alone and hooks into browser's runtime instead, and its tooling is designed primarily for web archival purposes, not traffic inspection and protocol reverse-engineering; - archiveweb.page project, but following "capture and archive everything with as little user input as needed now, figure out what to do with it later" philosophy, and also not limited to Chromium;
- pywb project, but with collection, archival, and replay all (optionally) independent from each other, with a simpler web interface, and more advanced command-line interface.
In fact, an unpublished and now irrelevant ancestor project of Hoardy-Web
was a tool to generate website mirrors from mitmproxy
stream captures.
(If you want that, hoardy-web
tool can do that for you. It can take mitmproxy
dumps as inputs.)
But then I got annoyed by all the sites that don't work under mitmproxy
, did some research into the alternatives, decided there were none I wanted to use, and so I started adding stuff to my tool until it became Hoardy-Web
.
For more info see the list of comparisons to alternatives.
-
The
Hoardy-Web
browser extension that captures allHTTP
requests and responses (andDOM
snapshots) your browser fetches, dumps them intoWRR
format, and then exports them by generating fake-Downloads containing bundles of those dumps, submits them to the specified archiving server (byPOST
ing them to the specified URL), or saves the to browser's local storage.The extension is
-
stable while running under desktop versions of both Firefox- and Chromium-based browsers;
-
beta while running under Fenix-based (Firefox-for-Android-based) browsers.
See the "Quirks and Bugs" section of extension's
Help
page for known issues. Also,Hoardy-Web
is tested much less on Chromium than on Firefox. -
-
The
hoardy-web-sas
simple archiving server that simply dumps everything theHoardy-Web
extension submits to it to disk, one file perHTTP
request+response.This is only useful if you did not yet install the full-featured
hoardy-web
tool. Or if you are feeling paranoid and you want archival and replay to be done by separate processes.The simple archiving server is stable (it's so simple there hardly could be any bugs there).
-
The
hoardy-web
tool, to quote from there:hoardy-web
is a tool to inspect, search, organize, programmatically extract values and generate static website mirrors from, archive, view, and replayHTTP
archives/dumps inWRR
("Web Request+Response", produced by theHoardy-Web
Web Extension browser add-on, also there) andmitmproxy
(mitmdump
) file formats.That is, yes, it can also play the role of an advanced archiving server with replay support for the
Hoardy-Web
browser extension instead of the basic archiving server thehoardy-web-sas
script is.hoardy-web
tool is deep in its beta stage. At the moment, it does about 85% of the stuff I want it to do, and the things it does it does not do as well as I'd like. See the TODO list for more info.
-
A patch for Firefox to allow
Hoardy-Web
extension to collect requestPOST
data as-is. This is not required and even without that patchHoardy-Web
will collect everything in most cases, but it could be useful if you want to correctly capturePOST
requests that upload files.See the "Quirks and Bugs" section of extension's
Help
page for more info.
Hoardy-Web
is designed to
- be simple (as in adhering to the Keep It Stupid Simple principle),
- be efficient (as in running well on ancient hardware),
- capture data from the browser as raw as possible (i.e., not try to fix any web browser quirks before archival, just capture everything as-is),
- ensure that all captured and collected data gets actually archived to disk,
- treat the resulting archives as read-only files,
- view, convert to other formats, extract useful values, and perform any expensive computations lazily and on-demand,
- make it easy to use tools other than a web browser to do interesting things with your archived data.
To conform to the above design principles
-
the
Hoardy-Web
Web Extension browser add-on does almost no actual work, simply generatingHTTP
request+response dumps, archiving them, and then freeing the memory as soon as possible (unless you enable limbo mode, but then you asked for it), thus keeping your browsing experience snappy even on ancient hardware; -
also the
Hoardy-Web
extension collects data as browser gives it, without any data normalization and conversion, when possible; -
the dumps are generated using the simplest, trivially parsable with many third-party libraries, yet most space-efficient on-disk file format representing separate
HTTP
requests+responses there currently is (akaWeb Request+Response
,WRR
), which is a file format that is both more general and more simple thanWARC
, much simpler than thatmitmproxy
uses, and much more efficient thanHAR
; -
the
Hoardy-Web
extension can write the dumps it produces to disk by itself by generating fake-Dowloads containing bundles ofWRR
dumps, but because of limitations of browser APIs,Hoardy-Web
can't tell if a file generated this way succeeds at being written to disk; -
which is why, for users who want write guarantees and error reporting, the extension has other archival methods, which includes archival by submission via
HTTP
;server-side part of submission via
HTTP
can be done either-
via the
hoardy-web-sas
simple archiving server, which is tiny (less than 250 lines of code) pure-Python script that provides anHTTP
interface for archival of dumps given viaHTTP POST
requests; -
or via the
hoardy-web serve
, which is not tiny at all, but it can combine both archival and replay;
-
-
all of the
Hoardy-Web
extension,hoardy-web-sas
, andhoardy-web serve
write those dumps to disk as-is, with optional compression for data storage efficiency; -
meanwhile, viewing/replay of, generation of website mirrors from, organization and management, data normalization (massaging), post-processing, other ways of extraction of useful values from archived
WRR
files --- i.e. basically everything that is complex and/or computationally expensive --- is delegated tohoardy-web
tool; -
the
hoardy-web
tool is very easy to use in your own scripts; -
by default, none these tools ever overwrite any files on disk (to prevent accidental data loss);
this way, if something breaks, you can always trivially return to a known-good state by simply copying some old files from a backup as there's no need to track versions or anything like that.
-
See extension's
Help
page (or theHelp
button in the extension's UI, which will make it interactive) for a long detailed description of what the extension does step-by-step.It is a must-read, though instead of reading that file raw I highly recommend you read it via the
Help
button of the extension's UI, since doing that will make the whole thing pretty interactive.-
See the "Frequently Asked Questions" section of extension's
Help
page for the answers to the frequently asked questions, including those about common quirks you can encounter while using it. -
See the "Quirks and Bugs" section of extension's
Help
page for more info on quirks and limitations ofHoardy-Web
when used on different browsers.
-
-
See below for a long list of comparisons to its alternatives.
-
Then notice that
Hoardy-Web
is the best among them, and go follow "Quickstart" section for setup instructions. ( •̀ ω •́ )✧ -
To follow the development:
-
If you want to learn to use the
hoardy-web
tool, see its README, which has a bunch of extended and explained usage examples.-
Also, a lot of info from that page can be seen by running
hoardy-web --help
. -
See example scripts to learn how to do various interesting things with your archived data.
-
-
In the unlikely case you have problems with the
hoardy-web-sas
simple archiving server, see its README. Or you can readhoardy-web-sas --help
instead. -
If you want to build the extension from source, see its README.
-
If you are a developer, see all the
hoardy-web
-related links above, and also see the description of the on-disk file format used by all these tools. -
If your questions are not unanswered by these, then open an issue on GitHub or get in touch otherwise.
Yes, as of December 2024, I archive all of my web traffic using Hoardy-Web
, without any interruptions, since October 2023.
Before that my preferred tool was mitmproxy.
After adding each new feature to the hoardy-web
tool, as a rule, I feed at least the last 5 years of my web browsing into it (at the moment, most of it converted from other formats to .wrr
, obviously) to see if everything works as expected.
-
On Firefox, Tor Browser, LibreWolf, Fenix aka Firefox for Android, Fennec, Mull, etc: Install the extension from addons.mozilla.org.
-
On Chromium, Google Chrome, Ungoogled Chromium, Brave, etc: see Installing on Chromium-based browser.
Unfortunately, this requires a bit more work than clicking
Install
button on Chrome Web Store, yes. "Why isn'tHoardy-Web
on Chrome Web Store?" I'm not a lawyer, but to me it looks likeHoardy-Web
violates Chrome Web Store's "Terms of Use". Specifically, the "enables the unauthorized download of streaming content or media" clause.In my personal opinion, any content you web browser fetches while you are browsing the web normally you are "authorized" to download. This is especially true for
Hoardy-Web
since, unlike most of its alternatives, it does not generate any requests itself, it only captures the data that a web page generates while you browse it. But given the context of that clause in that document, I feel like Google would disagree with my the interpretation above. Even though, technically speaking, separation between "streamed" and "downloaded" content or media is a complete delusion.Meanwhile,
Hoardy-Web
tries its best to collect all web traffic you browser generates, which, obviously, includes streaming content.On Chromium (but not on Firefox), technically, at the moment,
Hoardy-Web
does fail to collect streaming media properly because Chromium has a bug that prevents collection of the first few KiB of all audio and video files, and its API design prevents collection of large files in general, but if we are talking about YouTube, then most of the data of those streaming media files Chromium will fetch while you watch a video there will, in fact, get collected even on Chromium.So, in principle, to conform to those terms,
Hoardy-Web
would need to either disallow archival of all embedded media (which would defeat the point of it existing), or come bundled with a blacklist of URLs Google would claim you would be "unauthorized" to save to disk, and forcefully disallow archival for those. Even if such a blacklist could be made somehow (How do you expect somebody make such a thing, Google-san?), it would be very annoying and time-consuming to maintain.(Meanwhile, on Firefox,
Hoardy-Web
will just silently collect everything you browser fetches. And addons.mozilla.org's policies do not restrict this.)(Also, Chrome Web Store actually requires developers to pay Google to host their add-ons there while Mozilla's service is free. Meaning, you should go and donate to any free add-ons that do not violate your privacy and sell your data you installed from there. Their authors paid Google so that you could conveniently install their add-ons with a single click. Who else will pay the authors? Also, consider donating to Mozilla, things would be much, much worse without them.)
-
Alternatively, build it from source.
Now load any web page in your browser. The extension will report if everything works okay, or tell you where the problem is if something is broken.
Assuming the extension reported success: Congratulations! You are now collecting and archiving all your web browsing traffic originating from that browser. Repeat extension installation for all browsers/browser profiles as needed.
Technically speaking, if you just want to collect everything and don't have time to figure out how to use the rest of this suite of tools right this moment, you can stop here and figure out how to use the rest of this suite later.
It took me about 6 months before I had to refer back to previously archived data for the first time when I started using mitmproxy to sporadically collect my HTTP
traffic in 2017.
So, I recommend you start collecting immediately and be lazy about the rest.
(Also, I learned a lot about nefarious things some of the websites I visit do in background by inspecting the logs Hoardy-Web
produces.
You'd be surprised how many big websites generate HTTP
requests with evil tracking data at the moment you close the containing tab.
They do this because such requests can't be captured and inspected with browser's own Network Monitor, so most people are completely unaware.)
In practice, though, your will probably want to install at least the hoardy-web-sas
simple archiving server (see below for instructions) and switch Hoardy-Web
to Submit dumps via 'HTTP'
mode pretty soon because it is very easy to accidentally loose data using other archival methods and, assuming you have Python installed on your computer, it is also the most convenient archival method there is.
Or, alternatively, you can use the combination of archiving by saving of data to browser's local storage (the default) followed by semi-manual export into WRR
bundles.
Or, alternatively, you can switch to Export dumps via 'saveAs'
mode by default and simply accept the resulting slightly more annoying UI (on Firefox, it can be fixed with a small about:config
change) and the facts that you can now lose some data if your disk ever gets out of space or if you accidentally mis-click a button in your browser's Downloads
UI.
Or you can just install the hoardy-web
tool and run hoardy-web serve
, which can also do replay and search your archives.
Next, you should read extension's Help
page.
It has lots of useful details about how it works and quirks of different browsers.
If you open it by clicking the Help
button in the extension's UI, then hovering over or clicking on links in there will highlight relevant settings.
See "Setup recommendations" section for best practices for configuring your system and browsers to be used with Hoardy-Web
.
-
Install
Python 3
:- On Windows: Download Python from the official website.
- On a conventional POSIX system like most GNU/Linux distros and MacOS X: Install
python3
via your package manager. Realistically, it probably is installed already.
-
Install the Pythonic parts:
-
Test the results actually work:
-
On Windows with unconfigured
PATH
environment variable:python3 -m hoardy_web_sas --help python3 -m hoardy_web --help
-
On configured Windows and POSIX systems:
hoardy-web-sas --help hoardy-web --help
-
Alternatively, on a system with Nix package manager
-
Install everything by running
nix-env -i -f ./default.nix
-
Test the results work:
hoardy-web-sas --help hoardy-web --help
-
Also, instead of installing the add-on from
addons.mozilla.org
or from Releases on GitHub you can take freshly built XPI and Chromium ZIPs fromls ~/.nix-profile/Hoardy-Web*
instead. See the extension's README for more info on how to install them manually.
-
You can add
hoardy_web_sas.py
to Autorun or start it from your~/.xsession
,systemd --user
, etc. -
You can also make a new browser profile specifically for archived browsing, run Firefox as
firefox -ProfileManager
to get to the appropriate UI. On Windows you can just edit your desktop or toolbar shortcut to target"C:\Program Files\Mozilla Firefox\firefox.exe" -ProfileManager
or similar by default to switch between profiles on browser startup.
-
In case you plan to ever share some of your captures, it is highly recommended you make separate browser profiles for anonymous and logged-in browsing with separate extension instances either
-
pointing to separate archiving server instances dumping data to different directories on disk;
-
or, alternatively, using the same archiving server, but using different default
Bucket
values in theHoardy-Web
extension.
Set the "anonymous" browser profile to always run in
Private Browsing
mode to prevent login persistence there. If you do accidentally login in "anonymous" profile, move those dumps out of the "anonymous" directory immediately.This way you can easily share dumps from the "anonymous" instance without worrying about leaking your private data or login credentials.
-
When using Hoardy-Web
with Tor Browser, you probably want to configure it all in such a way so that all of the machinery of Hoardy-Web
is completely invisible to web pages running under your Tor Browser, to prevent fingerprinting.
So, in the mostly convenient yet sufficiently paranoid setup, you would only ever use Hoardy-Web
extension configured to archive captured data to browser's local storage (which is the default) and then export your dumps manually at the end of a browsing session, see re-archival intructions.
Yes, this is slightly annoying, but this is the only absolutely safe way to export data out of Hoardy-Web
without using submission via HTTP
, and you don't need to do this at the end of each and every browsing session.
You can also simply switch to using Export dumps via 'saveAs'
by default instead.
I expect this to work fine for 99.99% of the users 99.99% of the time, but, technically speaking, this is unsafe.
Also, by default, browser's UI will be slightly annoying, since Hoardy-Web
will be generating new "Downloads" all the time, but that issue can be fixed with a small about:config
change.
In theory, running ./hoardy_web_sas.py
listening on a loopback IP address should prevent any web pages from accessing it, since the browsers disallow such cross-origin requests, thus making the normal Submit dumps via 'HTTP'
mode setup quite viable.
However, Tor Browser is configured to proxy everything via the TOR network by default, so you need to configure it to exclude the requests to ./hoardy_web_sas.py
from being proxied.
A slightly more paranoid than normal way to do this is:
- Run the server as
./hoardy_web_sas.py --host 127.0.99.1
or similar. - Go to
about:config
in your Tor Browser and add127.0.99.1
tonetwork.proxy.no_proxies_on
. - Set the
Server URL
setting in the extension tohttp://127.0.99.1:3210/pwebarc/dump
.
Why?
When using Tor Browser, you probably don't want to use 127.0.0.1
and 127.0.1.1
as those are normal loopback IP addresses used by most things, and you probably don't want to allow any JavaScript
code running in Tor Browser to (potentially, if there are any bugs) access to those.
Yes, if there are any bugs in the cross-domain check code, with this setup JavaScript
could discover you are using Hoardy-Web
(and then, in the worst case, DOS your system by flooding your disk with garbage dumps), but it won't be able to touch the rest of your stuff listening on your other loopback addresses.
So, while this setup is not super-secure if your Tor Browser allows web pages to run arbitrary JavaScript
(in which case, let's be honest, no setup is secure), with JavaScript
always disabled, to me, it looks like a completely reasonable thing to do.
In theory, you can have the benefits of both invisibility of archival to local storage and convenience, guarantees, and error reporting of archival to an archiving server at the same time:
- Run the server as
./hoardy_web_sas.py --host 127.0.99.1
or similar. - But archive to browser's local storage while browsing.
- Then, at the end of the session, after you closed all the tabs, set
network.proxy.no_proxies_on
, enable submission viaHTTP
while disabling saving to local storage, re-archive, your local storage should now be empty, unsetnetwork.proxy.no_proxies_on
again.
In practice, doing this manually all the time is prone to errors. Automating this away is on the TODO list.
Then, you can improve on this setup even more by running both the Tor Browser and ./hoardy_web_sas.py
in separate containers/VMs.
Sorted by similarity to Hoardy-Web
, most similar projects first.
"Cons" and "Pros" are in comparison to the main workflow of Hoardy-Web
.
A self-hosted web crawler and web replay system written in Node.js
.
Of all the tools known to me, DownloadNet
is most similar to the intended workflow of the Hoardy-Web
.
Similarly to the combination of Hoardy-Web
extension and hoardy-web serve
and unlike pywb
, heritrix
, and other similar tools discussed below, DownloadNet
captures web data directly from browser's runtime.
The difference is that Hoardy-Web
does this using webRequest
WebExtensions
API and Chromium's debugger
API while DownloadNet
is actually a web crawler that crawls the web by spawning a Chromium browser instance and attaching to it via its debug protocol (which are not the same thing).
This is a bit weird, but it does work, and it allows you to use DownloadNet
to archive everything passively as you browse, similarly to Hoardy-Web
, since you can just browse in that debugged Chromium window and it will archive the data it fetches.
Pros:
- it's very similar to what
Hoardy-Web
aims to do, except
Cons:
- it's Chromium-only;
- it uses a custom archive format but gives no tools to inspect or manage those archives;
- you are expected to do everything from the web UI.
Same issues:
-
A bunch of Chromium's bugs make many things pretty annoying and somewhat flaky.
Those issues have no workarounds known to me except for "switch to Firefox-based browser", which you can do with
Hoardy-Web
.
A Man-in-the-middle SSL proxy.
Hoardy-Web
was heavily inspired by mitmproxy
and, essentially, aims to be to an in-browser alternative to it.
I.e., unlike other alternatives discussed here, both Hoardy-Web
and mitmproxy
capture mostly-raw HTTP
traffic, not just web pages.
Unlike mitmproxy
, however, Hoardy-Web
is designed primarily for web archival purposes, not traffic inspection and protocol reverse-engineering, even though you can do some of that with Hoardy-Web
too.
Pros:
- after you set it up, it will capture absolutely everything completely automatically;
- including WebSockets data, which
Hoardy-Web
add-on currently does not capture.
Cons:
- it is rather painful to setup, requiring you to install a custom SSL root certificate; and
- websites using certificate pinning will stop working; and
- some websites detect when you use it and fingerprint you for it or force you to solve CAPTCHAs; and
mitmproxy
dump files are flat streams ofHTTP
requests and responses that use custom frequently changing between versions data format, so you'll have to re-parse them repeatedly usingmitmproxy
's own parsers to get to the requests you want;- it provides no tools to use those dumped
HTTP
request+response streams for Wayback Machine-like replay and generation of website mirrors.
Though, the latter issue can be solved via this project's hoardy-web
tool as it can take mitmproxy
dumps as inputs.
But you could just enable request logging in your browser's Network Monitor and manually save your data as HAR
archives from time to time.
Cons:
- to do what
Hoardy-Web
does, you will have to manually enable it for each browser tab; - opening a link in a new tab will fail to archive the first page as you will not have Network Monitor open there yet; and then
- you will have to check all your tabs for new data all the time and do ~5 clicks per tab to save it; and then
HAR
s areJSON
, meaning all that binary data gets encoded indirectly, thus making resultingHAR
archives very inefficient for long-term storage, as they take a lot of disk space, even when compressed.
And then you still need something like this suite to look into the generated archives.
Pros:
- after you set it up, it will capture absolutely everything completely automatically;
- it captures WebSockets data, which
Hoardy-Web
add-on currently does not.
Cons:
- it is really painful to setup; and then
- you are very likely to screw it up, loose/mismatch encryption keys, and make your captured data unusable; and even if you don't,
- it takes a lot of effort to recover
HTTP
data from thePCAP
dumps; and PCAP
dumps are IP packet-level, thus also inefficient for this use case; andPCAP
dumps of SSL traffic can not be compressed much, thus storing the raw captures will take a lot of disk space.
And then you still need something like this suite to look into the generated archives.
Browser extensions similar to the Hoardy-Web
extension in their implementation, though not in their philosophy and intended use.
Overall, Hoardy-Web
and archiveweb.page
extensions have a similar vibe, but the main difference is that archiveweb.page
and related tools are designed for capturing web pages with the explicit aim to share the resulting archives with the public, while Hoardy-Web
is designed for private capture of personally visited pages first.
In practical terms, archiveweb.page
has a "Record" button, which you need to press to start recording a browsing session in a separate tab into a separate WARC
file.
In contrast, Hoardy-Web
, by default, in background, captures and archives all successful HTTP
requests and their responses from all your open browser tabs.
Pros:
- they produce and consume archives in
WARC
format, which is a de-facto standard; - their replay is more mature than what
Hoardy-Web
currently has.
Cons:
- they produce and consume archives in
WARC
format, which is rather limited in what it can capture (compared toWRR
,HAR
,PCAP
, andmitmproxy
); - they are Chromium-only;
- to make
archiveweb.page
archive all of your web browsing likeHoardy-Web
does:- you will have to manually enable
archiveweb.page
for each browser tab; and then - opening a link in a new tab will fail to archive the first page, as the archival is per-tab;
- you will have to manually enable
archiveweb.page
also requires constant manual effort to export the data out.
Differences in design:
archiveweb.page
captures whole browsing sessions, whileHoardy-Web
captures separateHTTP
requests and responses;archiveweb.page
implements "Autopilot", whichHoardy-Web
will never get (if you want that,Hoardy-Web
expects you to use UserScripts instead).
Same issues:
-
Both
Hoardy-Web
andarchiveweb.page
store captured data internally in the browser's local storage/IndexedDB by default.This is both convenient for on-boarding new users and helps in preserving the captured data when your computer looses power unexpectedly, your browser crashes, you quit from it before everything gets archived, or the extension crashes or gets reloaded unexpectedly.
On the other hand, this is both inefficient and dangerous for long-term preservation of said data, since it is very easy to accidentally loose data archived to browser's local storage (e.g., by uninstalling the extension).
Which is why
Hoardy-Web
hasSubmit dumps via 'HTTP'
mode which will automatically submit your dumps to an archiving server instead. -
When running under Chromium, a bunch of Chromium's bugs make many things pretty annoying and flaky, which --- if you know what to look for --- you can notice straight in the advertisement animation on their "Usage" page.
Those issues have no workarounds known to me except for "switch to Firefox-based browser", which you can do with
Hoardy-Web
.
Browser add-ons that capture whole web pages by taking their DOM
snapshots and saving all requisite resources the captured page references.
Capturing a page with SingleFile
generates a single (usually, quite large) HTML
file with all the resources embedded into it.
WebScrapBook
saves its captures to browser's local storage or to a remote server instead.
Pros:
- very simple to use;
- they implement annotations, which
Hoardy-Web
currently does not.
Cons:
- to make them archive all of your web browsing like
Hoardy-Web
does, you will have to manually capture each page you want to save; - they only captures web pages, you won't be able to save
POST
request data or JSONs fetched by web apps; - since they do not track and save
HTTP
requests and responses, capturing a page will make the browser re-download non-cached page resources a second time; - the resulting archives take a lot of disk space, since they duplicate requisite resources (images, media,
CSS
, fonts, etc) for each web page to make each saved page self-contained.
Differences in design:
- they capture
DOM
snapshots, whileHoardy-Web
capturesHTTP
requests and responses (though, it can captureDOM
snapshots too).
A browser extension that implements an alternative mechanism to browser bookmarks.
Saving a web page into Memex saves a DOM
snapshot of the tab in question into an in-browser database.
Memex then implements full-text search engine for saved snapshots and PDF
s.
Pros:
- pretty, both in UI and in documentation;
- it implements annotations, which
Hoardy-Web
currently does not; - it has a builtin full-text search engine with indexing;
meanwhile, at the moment,
Hoardy-Web
only has the non-indexedhoardy-web * --*grep*
options; though, you can use recoll withhoardy-web
as an input filter; - lots of other features.
Cons:
- to make it archive all of your web browsing like
Hoardy-Web
does, you will have to manually save each page you visit; - it only captures web pages and
PDFs
, you won't be able to savePOST
request data or JSONs fetched by web apps; - compared to
Hoardy-Web
, it is very fat --- it's.xpi
is more than 40 times larger; - it takes about 7 times more RAM to do comparable things (measured via
about:performance
); - it is slow enough to be hard to use on an older or a very busy system;
- it injects content scripts to every page you visit, making your whole browsing experience much less snappy;
- it performs a lot of
HTTP
requests to third-party services in background (Hoardy-Web
does none of that); - you are expected to do everything from the web UI;
- the resulting archives take a lot of disk space.
Differences in design:
- it captures
DOM
snapshots andPDF
s, whileHoardy-Web
capturesHTTP
requests and responses (though, it can captureDOM
snapshots too); - it has a builtin synchronization between instances, while
Hoardy-Web
expects you to use normal file backup tools for that.
A web archive replay system with a builtin web crawler and HTTP
proxy.
Brought to you by the people behind the Wayback Machine and then adopted by the people behind archiveweb.page
.
A tool similar to hoardy-web serve
.
Pros:
- it produces and consumes archives in
WARC
format, which is a de-facto standard; - its replay capabilities are more mature than what
hoardy-web serve
currently has; - it can update its configuration without a restart and re-index of given inputs.
Cons:
-
it produces and consumes archives in
WARC
format, which is rather limited in what it can capture (compared toWRR
,HAR
,PCAP
, andmitmproxy
); -
it has no equivalents to most other sub-commands of
hoardy-web
tool; -
compared to
hoardy-web serve
, it's much more complex, it has a builtin web crawler (aka "pywb
Recorder", which does not work for uncooperative websites anyway), and can also do capture by trying to be anHTTP
proxy (which also does not work for many websites);I assume it has all these features because
archiveweb.page
is Chromium-only, which forces it to be rather unreliable (see a list of relevant Chromium's bugs) and annoying to use; also, it has no equivalent ofproblematic
reqres status ofHoardy-Web
, making its dumps slightly flaky;meanwhile, in
Hoardy-Web
, the extension work perfectly well under Firefox-based browsers, and captures are pretty reliable there;also, under Chromium-based browsers,
problematic
reqres tracking ofHoardy-Web
helps immensely; -
since
hoardy-web serve
uses a much simpler and faster to parseWRR
file format, it is able to add new dumps to its index synchronously with their archival, allowing for their immediate replay.
The crawler behind the Wayback Machine. It's a self-hosted web app into which you can feed the URLs for them to be archived, so to make it archive all of your web browsing:
A tool similar to hoardy-web serve
.
Pros:
- it produces and consumes archives in
WARC
format, which is a de-facto standard; - stable, well-tested, and well-supported.
Cons:
- it produces and consumes archives in
WARC
format, which is rather limited in what it can capture (compared toWRR
,HAR
,PCAP
, andmitmproxy
); - it has no equivalents to most other sub-commands of
hoardy-web
tool; - you have to run it, and it's a rather heavy Java app;
- to make it archive all of your web browsing like
Hoardy-Web
does, you'll need to write a separate browser plugin to redirect all links you click to your local instance's/save/
REST
API URLs (which is not hard, but I'm unaware if any such add-on exists); - and you won't be able to archive your
HTTP
POST
requests with it; - as with other similar tools, an
HTTP
server of a web page that is being archived can tell it is being crawled.
A web crawler and self-hosted web app into which you can feed the URLs for them to be archived.
Pros:
- it produces and consumes archives in
WARC
format, which is a de-facto standard; - it has a very nice web UI;
- it it's an all-in-one archiving solution, also archiving YouTube videos with yt-dlp,
git
repos, etc; - stable, well-tested, and well-supported.
Cons:
- it produces and consumes archives in
WARC
format, which is rather limited in what it can capture (compared toWRR
,HAR
,PCAP
, andmitmproxy
); - to make it archive all of your web browsing like
Hoardy-Web
does,- it requires you to setup
mitmproxy
with archivebox-proxy plugin; - alternatively, you can run archivefox add-on and explicitly archive pages one-by-one via a button there;
- it requires you to setup
- in both cases, to archive a URL, ArchiveBox will have to download it by itself in parallel with your browser, thus making you download everything twice, which is hacky and inefficient; and
- websites can easily see, fingerprint, and then ban you for doing that;
- and you won't be able to archive your
HTTP POST
requests with it.
Still, probably the best of the self-hosted web-app-server kind of tools for this ATM.
A system similar to ArchiveBox
, but has a bulit-in tagging system and archives pages as raw HTML
+ whole-page PNG
rendering/screenshot --- which is a bit weird, but it has the advantage of not needing any replay machinery at all for re-viewing simple web pages, you only need a plain simple image viewer, though it will take a lot of disk space to store those huge whole-page "screenshot" images.
Pros and Cons are almost identical to those of ArchiveBox
above, except it has less third-party tools around it so less stuff can be automated easily.
Pros:
- both are probably already installed on your POSIX-compliant OS,
wget
can produce archives inWARC
format, which is a de-facto standard.
Cons:
- to do what
Hoardy-Web
does, you will have to manually capture each page you want to save; - many websites will refuse to be archived with
wget
and makingwget
play pretend at being a normal web browser is basically impossible; - similarly with
curl
,curl
also doesn't have the equivalent towget
's-mpk
options; - can't archive dynamic websites;
- changing archival options will force you to re-download a lot.
wget -mpk
done right.
Pros:
- it can pause and resume fetching;
- it can archive many dynamic websites via PhantomJS;
- it produces archives in
WARC
format, which is a de-facto standard and has a lot of tooling around it; - stable, well-tested, and well-supported.
Cons:
- to do what
Hoardy-Web
does, you will have to manually capture each page you want to save; - you won't be able to archive your
HTTP POST
requests with it; - does not have replay capabilities, just generates
WARC
files.
A simple web crawler built on top of wpull
, presented to you by the ArchiveTeam, a group associated with the Wayback Machine which appears to be the source of archives for the most of the interesting pages I find there.
Pros:
- it produces archives in
WARC
format, which is a de-facto standard and has a lot of tooling around it; - stable, well-tested, and well-supported.
Cons:
- to do what
Hoardy-Web
does, you will have to manually capture each page you want to save; - it can't really archive dynamic websites;
- you won't be able to archive your
HTTP POST
requests with it; - it does not have replay capabilities, just generates
WARC
files.
Stand-alone tools doing the same thing SingleFile add-on does: generate single-file HTML
s with bundled resources viewable directly in the browser.
Pros:
- simple to use.
Cons:
- to make them archive all of your web browsing like
Hoardy-Web
does, you will have to manually capture each page you want to save; - they can't really archive dynamic websites;
- you won't be able to archive your
HTTP POST
requests using them; - changing archival options will force you to re-download everything again.
Stand-alone tool based on SingleFile
, using a headless browser to capture pages.
A more robust solution to do what monolith
and obelisk
do, if you don't mind Node.js
and the need to run a headless browser.
A self-hosted wiki that archives pages you link to in background.
ArchiveBox wiki has a long list or related things.
It's an awesome personal private archival system adhering to the same philosophy as Hoardy-Web
, but it's basically an abstraction replacing your file system with a content-addressed store that can be rendered into different "views", including a POSIXy file system.
It can do very little in helping you actually archive a web page, but you can start dumping new Hoardy-Web
.wrr
files with compression disabled, decompress you existing .wrr
files, and then feed them all into Perkeep to be stored and automatically replicated to your backup copies forever.
(Perkeep already has a better compression than what Hoardy-Web
currently does and provides a FUSE FS interface for transparent operation, so compressing things twice would be rather counterproductive.)
See CHANGELOG.md
.
See the bottom of CHANGELOG.md
.
GPLv3+, some small library parts are MIT.
Contributions are accepted both via GitHub issues and PRs, and via pure email.
In the latter case I expect to see patches formatted with git-format-patch
.
If you want to perform a major change and you want it to be accepted upstream here, you should probably write me an email or open an issue on GitHub first. In the cover letter, describe what you want to change and why. I might also have a bunch of code doing most of what you want in my stash of unpublished patches already.