It is highly recommended you view this page by clicking the Help
button in the extension’s own UI.
Doing that will make this page interactive: the settings popup will be displayed on the right on this page and hovering over or clicking on any links pointing to popup.html
will highlight those elements in the popup.
See screenshots if you want to see how it will look.
You can still read this page outside of the extension’s UI, but be prepared for all links pointing to popup.html
to be useless.
Also, the version hosted on the author’s web site is superior to what GitHub’s web UI renders (this pages is written in org-mode
markup language, converting it to GitHub Markdown will make things much harder, since it uses a lot of advanced markup features of org-mode
to simplify things, and GitHub does not render org-mode
files very well at the moment).
Hoardy-Web
is a browser extension (add-on) that passively captures and collects dumps of HTTP
requests and responses as you browse the web, and then archives them using one or more of the following methods:
- by generating fake-Downloads containing either
- separate dumps (one dump of an
HTTP
request+response per file, also there) or - bundles of them (many dumps in a single file),
- separate dumps (one dump of an
- by archiving separate dumps to your own private archiving server (like the =hoardy-web-sas= simple archiving server, also there, or the more advanced =hoardy-web serve=, also there), or
- by archiving separate dumps to your browser’s local storage.
To view your archived data, see the accompanying =hoardy-web= tool (also there).
When you open this document by clicking the Help
button in extension’s UI, this page has two parts: this help text, and an iframe
with a completely unrolled popup UI in it.
The whole page will switch between single- and two-column layouts depending on available viewport width (which depends on device width and zoom level). In single-column layout the popup UI is placed after the end of the help text. In two-column layout they are placed side-by-side.
In both layouts:
- links that look like this are references to other parts of this page; clicking these links will scroll this page to the place they point to and then highlight the relevant part;
- links that look like this are references to elements of the popup UI,
- in single-column layout, clicking such a link will scroll the whole page to the corresponding element in the popup UI
iframe
and then highlight it; - in two-column layout, clicking or hovering over such a link will scroll only the popup UI
iframe
around to put the corresponding referenced element into view and then highlight it; - the highlighted element will stop being highlighted if you click anywhere else on the page;
- in single-column layout, clicking such a link will scroll the whole page to the corresponding element in the popup UI
- links that look like this are references to other internal pages of
Hoardy-web
; clicking them will navigate this tab there; - finally, links like this are references to external URLs.
In cases when clicking on a link scrolls this page around or navigates to another page, pressing the “Back” button of your browser will get you back to the exact link you clicked and then highlight it, making it easy to get back to reading from the exact place you left off.
**Go forth and try it by clicking one or more of the above links.**
The above rules are also apply on all other internal pages of Hoardy-Web
, e.g. the =Changelog= page.
- A reqres (REQuest + RESponse) is an internal object containing captured information about an
HTTP
request and its response, including their headers and data, and some meta-information (whether it originates from an extension,tabId
it originates from, its state, etc).
Reqres change their internal states according to the following state diagram (which is explained below):
(start) -> (request sent) -> (nIO) -> (headers received) -> (nIO) --> (body recived) | | | | | v v v | (no_response) (incomplete) (complete) | | | | | \ | | |\---> (canceled) ----\ \ | | | \ \ \ | |\-> (incomplete_fc) ---\ \ \ v | >------>---------------------------->-----> (finished) |\--> (complete_fc) ----/ / | | / / | \----> (snapshot) ----/ /- (collected) <--------- (picked) <--/ | / ^ | | (stashIO?) <----/ | v v | \-- (in_limbo) <- (stashIO?) <- (dropped) v | | (queued) <--------------------\ | | / | ^ \ \ \-----> (discarded) <-----/ (exported) <-/ | | \-------------------\ \ ^ | | | \ \ | | /---/ \-----------------\ \ \ | | | | \ \ | | v | \ \ | |\-> (srvIO) -> (stashIO?) -> (unarchived) | \ | | | ^ / | | | | | /--/ | | | v | v | | | (sumbitted) --------------> (saveIO) --> (saved) | {{!saving}} | \ | \-------->-----------------------------------------------/
Hoardy-Web
attaches to your browser’s runtime and tracks progress of HTTP
requests and their responses, capturing both their request and response headers and data at appropriate times in the browser’s request and response processing pipeline.
Whether Hoardy-Web
will track a given request depends on the Track new requests
toggles in the settings popup, e.g:
- this toggle allows you to disable tracking of newly spawned
HTTP
requests globally, thus essentially disablingHoardy-Web
, - this one controls whether
Hoardy-Web
will track new requests originating from the currently active tab, - this one controls whether it will track new requests originating from new tabs opened from the currently active tab (aka “children tabs”, e.g. via middle mouse click, context menu, etc),
- while this one controls whether it will track new requests originating from new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar,
Control+T
, menu item, etc), - and so forth for the others (press
?
symbols to see a tooltip explaining what each of them does).
Disabling any of these toggles does not stop tracking of already initiated requests, it only stops new requests controlled by that toggle from being tracked.
As shown on the above diagram, a new reqres, i.e. a new HTTP
request and response pair, proceeds through the following networking states:
start
: the starting state;request sent
, (response)headers received
, (response)body recived
: these are the normal stages ofHTTP
request and response tracking via =webRequest= sub-API ofWebExtensions
API;nIO
: normal network IO performed by the browser in betweenHTTP
request stages;canceled
: the request was canceled before it was sent because- you canceled it manually using the browser’s
Stop
button; - an ad-blocking extension like
uBlock Origin
blocked it; - the browser canceled it by itself, e.g. when redirecting an
http://
URL to anhttps://
URL inHTTPS
-only mode; - etc;
unsent
would have probably been a better name for this, but all browsers call itcanceled
internally, soHoardy-Web
follows that convention;- you canceled it manually using the browser’s
no_response
: the request was sent, but no response was received because- you pressed the
Stop
button before it got a response; - a connection to the target server was rejected;
- the server decided to ignore the request completely;
- network timeout was reached;
- etc;
- you pressed the
incomplete
: the request was sent, its response headers were received, but then the loading was interrupted before all of the response body was received;incomplete_fc
: only on Firefox-based browsers: the browser loaded the response data of this reqres directly from its cache, but did not give it toHoardy-Web
;this is just how Firefox handles things sometimes; usually, this only happens for images;
this is a separate state, because usually this means this URL was successfully archived before; if it was not, reload the page with
Control+F5
;complete
: the reqres was completed successfully;complete_fc
: the reqres was completed successfully from browser’s cache;snapshot
: this reqres was produced by taking aDOM
(Document Object Model) snapshot (using one of the appropriate-buttons in the popup), i.e. it was produced by capturing a rawHTML
orXML
of the current state of the tab/frame, not by capturing a network request;finished
: the terminal state of this step, no new events for this reqres will come from the browser.
In principle, at reaching finished
state the primary objective of Hoardy-Web
with respect of that reqres is now complete, so it could be written to disk and forgotten about.
Unfortunately for Hoardy-Web
, browsers do not allow web apps and extensions to simply write files to user’s file system, and all existing browser APIs that do allow for persistence to disk in some way all have different limitations.
Also, it is quite useful to have more states after finished
to improve the UI and allow for various conditional workflows.
Which is why Hoardy-Web
has more states after finished
and more steps after this one.
- An /in-flight reqres/ (current tab) is a reqres that did not reach the
finished
state yet, in history-log such reqres will be shown to be inin_flight
state.These two stats are represented as sums of two numbers:
- the number of reqres that are still being tracked via
webRequest
ordebugger
API; and - the number of reqres that have finished being tracked and are now waiting for all their events to finish processing.
On Firefox, nothing should ever get stuck, if something seems to be stuck in
in_flight
state, it’s probably still loading (or it is a bug in the browser, which does happen, very rarely).On Chromium, limitations of the Chromium’s debugging interface mean a request can get stuck among the reqres represended by the first number above. If the first number is zero, however, then the second should also rapidly become zero, at most after two times this many seconds.
If some reqres got stuck in one of the
in_flight
states, you can forcefully move them out of that state using this and/or that popup buttons. - the number of reqres that are still being tracked via
- A finished reqres is a reqres that reached the
finished
state. - Final networking state is the last state a reqres had before it
finished
: i.e.complete
,incomplete
,canceled
, etc.
When a reqres reaches the finished
state it gets classified using algorithms described below.
The results of these computations influence which of the next reqres processing steps get taken for that reqres and what gets displayed to the user.
Conventional web browsers provide no explicit indication when a part of a web page fails to load properly.
Apparently, you are expected to actually look at the page with your eyes, notice something looking broken, and reload it manually if so.
Obviously, this can be quite inconvenient when you want to be sure that the whole page with all of its resources was archived.
Especially when parts of a dynamically loaded page might simply silently fail to be rendered by associated JavaScript
because some of the HTTP
requests that JavaScript
did in background failed, or, on a static web page, layout and CSS
might have made some of the incompletely loaded parts of the page invisible (by design or by accident).
So, to provide such an indicator, Hoardy-Web
keeps track of reqres that fail to load properly and marks them with a problematic
flag (NOT a state) which influences
- toolbar button’s icon, badge, and title, all of which depend on the numbers of currently
problematic
reqres; - history-log page, which shows
problematic
reqres in a separate section; - when this option is enabled — notifications, generating a new one each time a new
problematic
reqres appears in a tab for which this option is set.
What gets marked as problematic
is controlled by =Mark reqres as ‘problematic’ when they finish= options.
By default, HTTP
requests that failed to get a response, those that have incomplete response bodies, and those for which the browser reported potentially problematic errors but then Hoardy-Web
picked
them anyway, will be marked as problematic
.
Problematic errors are errors like
- “this request failed because of a networking issue”,
- “this request was aborted because the
JavaScript
function making it decided to cancel it when you moved your mouse cursor away from a video thumbnail it was needed for”, - and similar things that probably imply some part of the page was left unfetched,
but NOT errors like
- “fetching of this request was aborted because the server redirected it to a URL blocked by
uBlock Origin
”, - “the browser decided against rendering of this data”,
- “the browser failed to render this data because this image file is broken”,
- and similar errors where the data was properly fetched.
(In principle, Hoardy-Web
could have been designed to never record the errors of the latter category in the first place, thus simplifying the above bit, but Hoardy-Web
is designed to follow the philosophy or “collect everything as browser gives it, as raw as possible, do all the post-processing logic separately, allow for no logic at all, if the user asks for it”.)
The raw error strings reported by the browser for each reqres can be seen in the history-log.
If you don’t care about the problematic
flag in a select tab and those notifications annoy you, you should disable this option.
If they annoy you in general, you can disable global one instead.
You should probably not, however, disable too many of the options under =Mark reqres as ‘problematic’ when they finish= settings.
This way, even with notifications disabled, you could then still see the number of problematic
reqres in extension’s toolbar button’s badge.
Note, however, is that the problematic
flag is purely a UI thing, **it does not influence archival or any of the other step described below in any way**.
In contrast, to the above, each new finished
reqres advances either to the picked
or the dropped
states, which does influence the actions Hoardy-Web
performs in the next steps.
Which of those two states gets selected is decided based on the =Pick reqres for archival when they finish= options.
By default, all complete
and complete_fc
reqres get picked
, regardless of their HTTP
response status codes, while the rest get dropped
.
- A /problematic reqres/ (current tab) is a finished reqres that satisfies the conditions set by =Mark reqres as ‘problematic’ when they finish= settings.
- A /picked reqres/ (current tab) is a finished reqres that satisfied the conditions controlled by =Pick reqres for archival when they finish= settings on entering the
finished
state. - A /dropped reqres/ (current tab) is a finished reqres that did NOT satisfy the conditions controlled by =Pick reqres for archival when they finish= settings on entering the
finished
state.
On exit from the finished
state each reqres gets split into
- a
loggable
, which is a hollowreqres
structure without any request or response data, i.e. it only keeps the metadata used by history-log, and - a
dump
, which is a =WRR=-formatted dump (also there) of the originalreqres
structure.
Since those tuples can be reconstructed back into the original reqres
structures, the following will continue to refer to them as if nothing changed when the fact they are now being internally represented by those tuples is not relevant.
Normally, picked
reqres proceed to the collected
state and get queued
for archival while dropped
reqres proceed to being discarded
from memory.
When =Archive ‘collected’ reqres= toggle is enabled, those queued
reqres proceed directly to the next step.
However, sometimes you might want to actually look at a web page before deciding if you want to archive it or not.
The naive way to do it would be to load a page with capture disabled first, look at it, and then, if you want to save it, enable it, and reload the page again with browser’s cache disabled via Control+F5
(and it has to be Control+F5
, not just F5
, because otherwise some URLs, on Firefox, might produce reqres in incomplete_fc
state, and on Chromium, their re-fetching could be silently skipped).
Obviously, this is both annoying and will force you to fetch everything twice.
Which is why Hoardy-Web
implements “limbo mode”.
With one of the limbo mode options enabled, Hoardy-Web
will instead capture everything as normal, but then, instead of sending the newly captured reqres to collected
or discarded
states immediately, it will put them into in_limbo
state where they would linger until you collect or discard them manually by pressing the appropriate-buttons, or until =Closed tabs= options make a decision semi-automatically for you.
A picked
reqres will be put into in_limbo
when =Pick into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.
Similarly, a dropped
reqres will be put into in_limbo
when =Drop into limbo= setting is enabled in the currently active tab or when one-of-the-other settings is enabled for other reqres sources.
(This latter option mainly exists for debugging.)
If this option is enabled and there are more than this number reqres in_limbo
or the total size of all dumps in_limbo
is more than this size (in MiB), Hoardy-Web
will complain to remind you to collect or discard some of them so that your browser does not waste too much memory (and so that you won’t loose too much data if something crashes while =Stash ‘collected’ reqres into local storage= option discussed below is disabled).
- A /collected reqres/ (current tab) is a reqres that was (either automatically or manually) sent to the
collected
state. - A /discarded reqres/ (current tab) is a reqres that was (either automatically or manually) sent to the
discarded
state. - An /in-limbo reqres/ (current tab) is a reqres that is being held
in_limbo
until you manually collect or discard it. - A /queued reqres/ (displayed on the Queued/Failed line) is a
collected
reqres that is stillqueued
for archival.
The stashed
reqres status is, essentially, a flag that says this reqres was temporarily backed up to browser’s local storage.
In other words, stashing exists to prevent loss of successfully captured but yet unarchived data in situations where
- you quit or restart your browser,
- your computer unexpectedly looses power,
Hoardy-Web
gets reloaded (e.g., on updates) or crashes
before you collected
or discarded
everything from in_limbo
or Hoardy-Web
has successfully archived everything from its archiving queue.
In particular:
- When =Archive ‘collected’ reqres= option is disabled but =Stash ‘collected’ reqres into local storage= option is enabled, instead of archiving newly
queued
reqres,Hoardy-Web
will stash their(loggable, dump)
tuples into browser’s local storage. - Similarly, when both =Stash ‘collected’ reqres into local storage= option and one-of-the-per-source-settings is enabled for a reqres source, then all newly generated
in_limbo
reqres from that source will also get immediately stashed into browser’s local storage.
Moreover, the following section will discuss how Hoardy-Web
will try stashing unarchived
reqres into browser’s local storage too.
Note however, that even with stashing enabled Hoardy-Web
will skip disk IO whenever possible: e.g., if both =Archive ‘collected’ reqres= and [[./popup.html#div-config.archiveSubmitHTTP][=Submit dumps via ‘HTTP’]] options discussed below are enabled, =Hoardy-Web
will first try to archive each new collected
reqres straight from memory to the archiving server and only if that process fails will it attempt stashing them to local storage instead.
Meaning that
- stashing of non-=in_limbo= reqres is usually completely free and so you should probably keep that option always enabled;
- stashing of
in_limbo
reqres via-one-of-the-those options is not free, so if you almost never archive from limbo then keeping those options enabled will waste disk IO, so you might want to disable at least some of them in that case.
The above also implies that, technically, stashing is not a silver bullet against data loss.
To try and make it such would mean unconditional immediate stashing of all captured data, which would waste a lot of disk IO on most Hoardy-Web
configurations.
When both =Archive ‘collected’ reqres= option and =Stash ‘collected’ reqres into local storage= option are disabled, then, after a new reqres gets queued
, Hoardy-Web
will generate a new notification complaining about it, unless that option is disabled too.
You can also forcefully stash all currently queued
, in_limbo
, and unarchived
reqres by pressing this button.
It stashes everything immediately and unconditionally, ignoring all other stashing settings.
When reloading the extension via the =Reload= button or via =Auto-reload on updates= option, this action will be run automatically.
- A stuck queued reqres is a
queued
reqres that got stuck in the archival queue, e.g. because it got queued while =Archive ‘collected’ reqres= option was disabled. - A /stashed reqres/ is a reqres that was temporarily
stashed
(backed-up) into browser’s local storage while it is still being kept inHoardy-Web
’s memory. I.e., the stash is a persistent on-disk backup for in-memory reqres. - A /failed to stash reqres/ is a reqres that is currently
unstashed
, i.e. a reqres that failed to be stashed into browser’s local storage. Note that reqres for which stashing was not even attempted are not included in this set. It is also a part of the sum of the “Failed” part of the Queued/Failed line.You can retry stashing these by pressing this button.
On entering collected
or discarded
state, loggable
metadata of each reqres is copied into the recent reqres history-log and is kept there until the size of the log reaches this many elements, at which point the older elements of the log start being elided automatically.
You can also ask Hoardy-Web
to forget all history manually by pressing this button, or to forget history of reqres generated by the currently active tab by pressing that button instead, or do the same by using similar buttons in the-log.
Using the-log also allows the use of reqres filtering options available there for doing this, allowing you to selectively forget parts of history.
Note, however, that problematic
reqres will not get automatically elided from the log, nor forgotten by using the above buttons.
To forget about them, you will have to first unset the problematic
flag on the respective reqres via this button, or that button, or use similar buttons in the-log.
When =Archive ‘collected’ reqres= toggle is enabled, Hoardy-Web
will pop queued
reqres from the archival queue one by one and then perform one or more of the following (in order they are listed):
- if =Export dumps via ‘saveAs’= option is enabled,
Hoardy-Web
will - if =Submit dumps via ‘HTTP’= option is enabled,
Hoardy-Web
will submit thedump
to the archiving server at =Server URL= setting by making anHTTP POST
request with thedump
as request body (which is denoted bysrvIO
states on the diagram above); - if any of the above fails
Hoardy-Web
will- move the reqres into the
unarchived
state, - if =Stash ‘collected’ reqres into local storage= option is enabled, it will try stashing the
(loggable, dump)
tuple into browser’s local storage (which is denoted bystashIO
states on the diagram above) and record but ignore any errors produced while doing that, and - stop processing this reqres;
- move the reqres into the
- otherwise, if =Save reqres into local storage= option is enabled,
Hoardy-Web
will- try to save the
(loggable, dump)
tuple into browser’s local storage (which is denoted bysaveIO
states on the diagram above), - if saving fails, it will move the reqres into the
unarchived
state instead, and stop processing this reqres;
- try to save the
- finally, if =Save reqres into local storage= option is disabled or if saving to local storage succeeds,
Hoardy-Web
will discard the reqres from memory.
You can enable more than one archival method at the same time.
For a given loggable
, Hoardy-Web
will remember and skip previously successful archival methods if the loggable
ever returns to the archival queue again (e.g., when one of the archival methods fails and you later ask Hoardy-Web
to retry the archival, or when you re-queue a reqres from local storage from the =Saved in Local Storage= page).
Note the difference between stashed and saved reqres:
- stashed reqres are kept in memory until they get successfully archived by all configured archival methods (or until you manually discard them, in case they were stashed
in_limbo
); - saved reqres get dumped into browser’s local storage and, if that succeeds, discarded from memory (until you manually load them back from there).
Sometimes you might want to semi-automatically split your collected archives into separate disjoint sets.
Say, for instance, you want to split out archives generated by a select tab into a separate set you plan to share with somebody else.
In Hoardy-Web
such sets are called buckets.
WARC
-based tools sometimes call these “collections” instead.
To implement this, for each reqres in the archival queue, Hoardy-Web
takes a bucket
value from a corresponding “Bucket” setting:
- this one will be used for requests originating from the currently active tab,
- this one will be used for requests originating from new child tabs opened from the currently active tab (e.g. via middle mouse click, context menu, etc),
- while this one will be used for new tabs opened via browser’s “New Tab” browser action (i.e. the plus sign in the tab bar,
Control+T
, menu item, etc), - and so forth for the others (press
?
symbols to see a tooltip explaining what each of them does).
Evaluation of bucket
is done just before each archival attempt, so if the queue is not yet empty, and you disable =Archive ‘collected’ reqres=, edit some of the “Bucket” settings, and enable it again, Hoardy-Web
will start using the new setting immediately.
When exporting via saveAs
, bucket
value will be used in the file name of the generated fake-Download .wrrb
file and the dumps will be split into separate fake-Download files by said bucket
.
I.e., internally, the bundle
discussed above is actually a set of per-=bucket= bundle
’s.
When submitting to an HTTP
server, Hoardy-Web
will specify bucket
as a query parameter (named “profile”, for historical reasons) to each HTTP POST
request, which will cause the configured archiving server to put those WRR
files into a directory with the same name.
When stashing or saving to local storage, Hoardy-Web
will record the value of bucket
into each loggable
before saving data to disk.
If you restart your browser, thus starting a new Hoardy-Web
session, Hoardy-Web
will use the old stashed/saved bucket
values for all new attempted archivals of old reqres generated by previous sessions.
So, for example, if you want to share a subset of your captures, you can
- set this option in a tab to, say, “share”,
- clear your browser’s cache,
- then and only then navigate to a web page archives of which you plan to share,
- thus separating out those reqres into separate
saveAs
bundle
’s with “share” in their name and/or putting them into a separate archiving server directory named “share”, depending on archival methods.
As noted above, if any of the archival methods fail, the reqres in question will be moved into the unarchived
state.
Submissions of reqres that unarchived
because of networking issues will be retried automatically every 60 seconds.
Archivals of reqres rejected by the archiving server or those that failed to be saved to browser’s local storage will not be retried automatically as those usually happen when there is no space left on the device you are archiving to.
You can retry all archiving failures by pressing one of this or that buttons. You can also use them to nudge the archiving sub-process awake if some things got stuck in the queue by accident. E.g., after the extension got reloaded with a non-empty queue, or if you previously quit your browser before everything was archived.
If this option is enabled and a new reqres recently moved to the unarchived
state, a new notification will be generated.
If this option is enabled, a new notification will be generated when the archival queue gets empty the very first time or after any failures.
- A /failed to archive reqres/ is a reqres that is currently
unarchived
, i.e. a reqres that failed to be archived by one of the enabled archival methods. It is also a part of the sum of the “Failed” part of the Queued/Failed line.You can retry archiving these by pressing this button.
- An /exported reqres/ is a reqres that was successfully
exported
by generating a fake-Download containing itsdump
. - A /submitted reqres/ is a reqres that was successfully
submitted
to the archiving server and thus was discarded from memory. - A /saved reqres/ is a reqres that was successfully
saved
by being archived into browser’s local storage. - An archived reqres is either exported, submitted, or saved reqres.
If you archived some data by saving it into local storage and you now want to re-archive the same data using another method, do the following:
- enable the option for your desired archival method (e.g., =Export dumps via ‘saveAs’=),
- but keep the =Save reqres into local storage= option enabled,
- if you are re-archiving by =exporting them via ‘saveAs’= option, you should probably temporarily set this timeout to
0
to prevent idle waiting, - press the =Show= button on
Saved in LS
line in the popup to open theSaved in Local Storage
page; - set the filter of you desired archival method there to
false
(red) to make it only display reqres that were not yet archived using that archival method (e.g., setExported via 'saveAs'
tofalse
); - then re-queue the data saved in local storage:
- press the
Re-queue
button; - wait for
Hoardy-Web
to finish archival of newly re-queued reqres; - (if you are re-archiving by =exporting them via ‘saveAs’= option while running on a truly ancient hardware and the above process is slow, you can disable the =GZip outputs= option; though, the resulting
WRR
bundles will take a lot of disk space in this case); - (also, if you are re-archiving by =exporting them via ‘saveAs’= option, then after each
Re-queue
you should wait for the browser to save the resulting generatedWRR
bundles to disk and then confirm that each generated fake-Download did not fail; why? because if you re-archive a lot of data, thus generating manyWRR
bundles at once, and you run out of disk space in the process, the browser might fail a random subset of the generated fake-Downloads without tellingHoardy-Web
anything about it; this is not an issue when archiving by =… submitting them via ‘HTTP’=, because archiving servers report their errors properly); Re-queue
more data, repeat until everything is re-archived.
- press the
- set this timeout to its previous value, if you changed it.
If after you confirming everything was properly re-archived you now want to wipe that re-archived data from local storage, do the following:
- press the =Show= button on
Saved in LS
line in the popup to open theSaved in Local Storage
page; - set the filter of you desired archival method there to
true
(green) to make it only display reqres that were already archived using that archived method; - press the
Delete
button there repeatedly, until everything is deleted.
Sometimes, you might want to block a select tab from performing new HTTP
requests.
Say, for instance, you opened a URL in a new tab, then you forgot about that tab for a while, but then you returned to it again, and you now want to read that page.
But then you discover that the font size is too small for you, and so you want to change that tab’s zoom level.
Changing zoom level will change tab’s viewport size, which, if the page uses responsive CSS
, will likely force your browser to generate new HTTP
requests to fetch data used by previously inactive parts of the layout.
Essentially, this will notify the page’s origin server that you are now interacting with that page.
Some websites do this on purpose to track users that run with JavaScript
disabled.
Meanwhile, normally, when using the =hoardy-web= tool (also there), pages of static website mirrors generated by its mirror
sub-command and HTTP
replay pages generated by its serve
sub-command remap all URLs of page requisites to point to local files and replay URLs.
(Though, it is configurable.)
But HTML5
specification is quite large and gets updated all the time, interactions between remapped pages and some browser extensions can sometimes break things, and hoardy-web
can have bugs in its remapping code.
So, remapping of some of those URLs can fail sometimes.
Say, however, you want to ensure that
- your browser won’t notify a page’s origin server when you start interacting with it long after you loaded it, and
- your browser won’t try to access the Internet when you open one of
hoardy-web mirror
‘ed orhoardy-web serve
‘d pages.
In some cases you might even feel paranoid enough to want to prevent your browser from opening non-remapped jump-links (a href
), even when you click them (by accident).
Desktop versions of Firefox-based browsers have File > Work Offline
option that can solve most of this, but it disables all new requests browser-wise, which is quite inconvenient and error-prone if you want to keep some of your tabs offline while not restricting others, and it will break replay over HTTP
with hoardy-web serve
.
Chromium-based browsers do not appear to have such a feature at all.
To solve this issue — and to add an equivalent of File > Work Offline
to Chromium-based browsers — Hoardy-Web
implements its own Work offline
mode controlled via the following toggles:
- the global toggle is pretty much equivalent to the Firefox’s own option and enables canceling of all new requests browser-wise;
- this toggle enables “Work offline” mode in the currently active tab, thus also preventing you from navigating to any Internet URLs by clicking any links that open in the same tab;
- this toggle enables it for the currently active tab’s new children, thus also preventing you from opening any Internet URLs by spawning new tabs from it;
- there is also a toggle for controlling the default value of the above two options in newly spawned root tabs,
- as well as toggles controlling “Work offline” mode for background requests and requests generated by extensions.
Unlike the File > Work Offline
option of Firefox, enabling any of these toggles:
- does not break
hoardy-web serve
replay URLs; - does not break any requests that are already in-flight;
- does not prevent generation of new
canceled
reqres when a correspondingTrack new requests
toggle is also enabled, and they can be seen in the history-log.
In the latter case, those newly generated canceled
reqres will also be marked as problematic
if that option is enabled.
So, for convenience, there is also a toggle that controls whether toggling Work offline
options (from the popup or with keyboard shortcuts) should also automatically set the corresponding Track new requests
option to the opposite value.
Finally, there is also a bunch options that automatically enable “Work offline” mode in tabs with various classes of URLs.
By default, “Work offline” mode is enabled for file:
URLs to stop any pages generated by hoardy-web mirror
to accessing the Internet.
Hoardy-Web
provides a bunch of keyboard and context menu shortcuts to allow using it in more efficient ways.
- On Firefox-based browsers, you can see and edit all keyboard shortcuts via
Add-ons and themes
(about:addons
) -> the gear icon ->Manage Extension Shortcuts
. - On Chromium-based browsers, you can see and edit all keyboard shortcuts via the menu ->
Extensions
->Manage Extensions
(chrome://extensions/
) ->Keyboard shortcuts
(on the left).
When your archiving server supports it and this option is enabled, Hoardy-Web
enables its integration with replay over HTTP
.
At the moment, this includes two buttons which re-navigate all tabs or the currently active tab (respectively) to their replay pages as well as keyboard shortcuts and context menu actions described below.
Hoardy-Web
provides shortcuts to:
- open the =Internal State and Logs= page, {{{shortcut(showState)}}};
- open the
Internal State and Logs
page, scrolled to the end of the log, {{{shortcut(showLog)}}}; - open the =Internal State and Logs= page narrowed to the currently active tab’s data, {{{shortcut(showTabState)}}};
- open the
Internal State and Logs
page narrowed to the currently active tab’s data, scrolled to the end of the log, {{{shortcut(showTabLog)}}}; - toggle inclusion of the currently active tab in global snapshots, then set inclusion of its children to the same value, {{{shortcut(toggleTabConfigSnapshottable)}}};
- toggle inclusion of the currently active tab’s children in global snapshots, {{{shortcut(toggleTabConfigChildrenSnapshottable)}}};
- toggle work offline mode in the currently active tab, then, if impure, set tracking to the opposite value, then set the same options in its children to the same values, {{{shortcut(toggleTabConfigWorkOffline)}}};
- toggle work offline mode in the currently active tab’s new children, then, if impure, set tracking to the opposite value, {{{shortcut(toggleTabConfigChildrenWorkOffline)}}};
- toggle tracking of newly spawned
HTTP
requests in the currently active tab, then set tracking in its children to the same value, {{{shortcut(toggleTabConfigTracking)}}}; - toggle tracking of newly spawned
HTTP
requests currently active tab’s children, {{{shortcut(toggleTabConfigChildrenTracking)}}}; - toggle limbo mode in the currently active tab, then set limbo mode in its children to the same value, {{{shortcut(toggleTabConfigLimbo)}}};
- toggle limbo mode in currently active tab’s children, {{{shortcut(toggleTabConfigChildrenLimbo)}}};
- unmark all problematic reqres, {{{shortcut(unmarkAllProblematic)}}};
- unmark all current tab’s problematic reqres, {{{shortcut(unmarkAllTabProblematic)}}};
- collect all reqres from limbo, {{{shortcut(collectAllInLimbo)}}};
- collect all reqres from limbo for the currently active tab, {{{shortcut(collectAllTabInLimbo)}}};
- discard all reqres from limbo, {{{shortcut(discardAllInLimbo)}}};
- discard all reqres from limbo for the currently active tab, {{{shortcut(discardAllTabInLimbo)}}};
- take
DOM
snapshot of all tabs for which =Include in global snapshots= setting is enabled, {{{shortcut(snapshotAll)}}}; - take
DOM
snapshot of the currently active tab, {{{shortcut(snapshotTab)}}}. - replay all tabs for which =Include in global replays= setting is enabled, {{{shortcut(replayAll)}}};
- replay the currently active tab, {{{shortcut(replayTabBack)}}}.
Hoardy-Web
provides context menu actions to:
- open a link in a new tab with currently active tab’s tracking in children tabs setting negated.
I.e.,
- right-mouse clicking while pointing at a link and
- selecting
Hoardy-Web > Open Link in New Tracked/Untracked Tab
menu item,
is equivalent to
- do the same thing, but opening it in a new window;
- open a replay of a link in a new tab.
`Hoardy-Web` can't establish a connection to the archiving server at `<URL>`
,The archiving server at `<URL>` appears to be defunct
,The archiving server at `<URL>` does not allow archiving, it appears to be a replay-only instance
,Failed to archive <N> items because `Hoardy-Web` can't establish a connection to the archiving server
, andFailed to archive <N> items because this archiving server is defunct
Are you running the archiving server script or a =hoardy-web serve= instance?
In the case of
hoardy-web serve
, does that server instance allow archiving? is it replay-only, maybe?If you fixed it and the error persists, press this button.
Replay is forbidden by the "Replay from the archiving server" option.
Un-disable this option.
Replay is impossible because the archiving server at `<URL>` is unavailable or defunct.
andThe archiving server at `<URL>` does not support replay.
Are you running =hoardy-web serve=?
At the moment, that’s the only archiving server that supports this.
If you fixed it and the error persists, press this button.
Failed to archive <N> items because requests to the archiving server failed with: <STATUS> <REASON>: <RESPONSE>
Your archiving sever is returning
HTTP
errors whenHoardy-Web
is trying to archive data to it. See your archiving server’s console for more information.Some common reasons it could be failing:
- No space left on the device you are archiving to.
- It’s a bug, {{{reportit()}}}.
Failed to stash <N> items becase <reason>
orFailed to archive <N> items becase <reason>
Stashing or archiving failed for some other reason.
Some common reasons it could be failing:
- No space left on the device your browser saves its local storage to.
- It’s a bug, {{{reportit()}}}.
Failed to open/create a database via `IndexedDB` API, all data persistence will be done via `storage.local` API instead. This is not ideal, but not particularly bad. However, the critical issue is that it appears Hoardy-Web previously used `IndexedDB` for archiving and/or stashing reqres.
So, it worked before, but why doesn’t it work now? The most likely reason is: you are running
Hoardy-Web
under a browser based on an older version of Firefox and you have recently enabledAlways use private browsing mode
setting in your browser’s config. Older versions of Firefox forbid the use ofIndexedDB
API when that setting is set.To make archives currently saved in
IndexedDB
accessible toHoardy-Web
underAlways use private browsing mode
you need to:- Disable
Always use private browsing mode
browser setting and restart the browser, thus allowingHoardy-Web
access toIndexedDB
again. - Ensure =Prefer ‘IndexedDB’ API= setting is disabled.
- Ensure =Save reqres into local storage= option is enabled.
- Ensure =Archive ‘collected’ reqres= is enabled.
- Open the =Saved in Local Storage= page.
- Set
In 'storage.local'
filter there tofalse
(red). - Press
Re-queue
button there to re-archive all those saved reqres fromIndexedDB
tostorage.local
. - Now, you can re-enable the
Always use private browsing mode
browser setting and restart you browser again.
All old data should be available from the =Saved in Local Storage= page now.
- Disable
Failed to process <N> items becase <reason>
It’s a bug, {{{reportit()}}}.
- Other error notifications should be completely self descriptive. If they are not, {{{reportit()}}}.
Most error codes are produced by attaching one of the following prefixes to the raw error code given by the browser:
webRequest::
prefix is prepended to errors produced by the code working withwebRequest
API;debugger::
prefix is prepended to errors produced by the code working with Chromium’s Debugger API;filterResponseData::
prefix is prepended to errors produced bywebRequest.filterResponseData
API (these can usually be ignored, since Firefox generates normalwebRequest::
codes for those reqres too, when it was an actual error; butHoardy-Web
still collects them, adhering to “collect everything as browser gives it, when possible” philosophy).
In particular, webRequest::NS_
prefix on Firefox, and webRequest::net::
and debugger::net::
prefixes on Chromium signify various issues produced by the networking stacks of those browsers.
For instance:
webRequest::NS_ERROR_ABORT
on Firefox andwebRequest::net::ERR_ABORTED
on Chromium signify that this request was aborted before it finished, e.g. because the originator tab was closed before it was fully loaded; Firefox also uses this code to mean what Chromium signifies with variousBLOCKED
codes;webRequest::net::ERR_BLOCKED_BY_CLIENT
on Chromium signifies that an extension blocked it;debugger::net::ERR_BLOCKED::
is a prefix for other errors when the request was blocked, e.g. by CSP;webRequest::NS_ERROR_NET
prefix on Firefox andwebRequest::net::ERR_FAILED
error on Chromium signify various networking issues.
The exception to the above rule of keeping everything as raw as possible are webRequest::capture::
and debugger::capture::
prefixes which signify various errors produced by Hoardy-Web
itself in its webRequest
- or debugger
-handling code, respectively.
In particular:
webRequest::capture::EMIT_FORCED::BY_USER
anddebugger::capture::EMIT_FORCED::BY_USER
are produced when you forcefully advance a reqres from in-flight state by pressing this or that button;debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER
is produced when Chromium debugger gets detached from its tab while a reqres inside that tab is still in flight;debugger::capture::EMIT_FORCED::BY_CLOSED_TAB
is produced when a tab gets closed while a reqres inside of it is still in flight;debugger::capture::NO_RESPONSE_BODY::
is a prefix for errors produced when getting request’s response body from Chromium’s debugger fails for various reasons;webRequest::capture::CANCELED::NO_DEBUGGER
is produced when a non-main-frame request is canceled byHoardy-Web
because no debugger is available to capture it; in the case of a main frame request,Hoardy-Web
will cancel the request and reload the tab, as discussed there, so this error will not be produced; but it can happen if a page tries to load a sub-frame (likeiframe
) while the debugger for the tab (and, thus, the main frame) did not attach yet (which only happens for pages where Chromium disallows debugging, or whenHoardy-Web
gets enabled after the page in question already started loading, e.g. the very first page after the browser starts); also, this can happen when the debugger gets detached after the main frame was captured but its resources are still loading.webRequest::capture::CANCELED::BY_WORK_OFFLINE
is produced when the reqres was canceled by one of “Work offline” options, i.e. as a result of one or more of this-this-this-this-or-that options being set.-
webRequest::capture::RESPONSE::BROKEN
is produced when some response metadata is unavailable.At the moment, this only appears to happen on Firefox when a request gets fulfilled by a service or shared worker after Firefox had already sent it to the server. Firefox then interrupts the networking code and generates
NS_ERROR_NET_ON_*
error about the event failing to supply the response metadata generated by the service/shared worker.
If you are reading this page outside of the extension’s UI be sure to read the very top of this page first.
Hoardy-Web
does not implement collection of WebSockets data on any of the supported browsers.(Firefox does not support it. Chromium does support it, in theory, but I have not tried using that API, so I have no idea how well it works.)
This is low-priority issue since you can simply take a
DOM
snapshot instead of capturing and later replaying WebSocket messages to in-pageJavaScript
. Also, capturing and archiving aDOM
snapshot will free you from needing to run anyJavaScript
at all when you decide to return to view the archived page later, which is nice.- On Chromium, response data of background requests and requests made by other extensions does not get collected, since there’s no tab to attach a debugger to, and I have not figured out how to attach debugger to other things yet.
- On Firefox, fetches that spawn new downloads will be marked as
problematic
by default, since Firefox’s implementation ofwebRequest.filterResponseData
API does not provide their contents to the extension and I have not figured out how to distinguish them from other fetches yet.
- When
Hoardy-Web
is reloaded without using the =Reload= button or =Auto-reload on updates= option, e… whenHoardy-Web
is reloaded by clicking the “Reload” button in browser’s extension list, then all per-tab setting of all tabs will be reset to the values used by the newly spawned root tabs.This issue is not applicable in the case when the reload happens because the extension was updated, in that case the browser will notify
Hoardy-Web
about it andHoardy-Web
will handle it properly, see the help string of the =Reload= button for more info.But in the case of
Reload
buttons, the browser does not ask the extension nicely, so all unsaved internal state will be lost. - If an
HTTP
server supplies the same header multiple times — which happens sometimes, most commonly withSet-Cookie
headers — then the archived response headers will usually become weird, with multiple headers squished into a single value, separated by newline symbols.This is just the way both Firefox (usually) and Chromium (always) supply those headers to extensions and
Hoardy-Web
does not try to undo it.
Known issues that are consequences of issues of Firefox-based desktop browsers: Firefox, Tor Browser, LibreWolf, etc
- On Firefox-based browsers, without the patch (also there), the browser only supplies
formData
towebRequest.onBeforeRequest
handlers, thus making impossible to recover the actual request body for aPOST
request.Hoardy-Web
will mark such requests as having a “partial request body” and try its best to recover the data fromformData
structure, but if aPOST
request was uploading files, they won’t be recoverable fromformData
(in fact, it is not even possible to tell if there were any files attached there), and so your archived request data will be incomplete even afterHoardy-Web
did its best.Disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.
With the above patch applied, small
POST
requests will be archived completely and correctly.POST
requests that upload large files and only those will be marked as having a “partial request body”. If-Modified-Since
andIf-None-Match
headers never get archived, because the browser never supplies them to the extensions. Thus, you can get304 Not Modified
reqres response to a seemingly normalGET
request.- Reqres of already cached media files (images, audio, video, except for svg and favicons) will end in
incomplete_fc
state becausewebRequest.filterResponseData
API does not provide response bodies for such requests. This toggle controls if such reqres should bepicked
.By default,
Hoardy-Web
willdrop
them. Usually this is not a problem since such media will be archived on first (non-cached) access. But if you want to force everything on the page to be archived, you can reload the page without the cache withControl+F5
. - Firefox fails to run
onstop
method forwebRequest.filterResponseData
filter for the very firstHTTP/2
request the browser makes after you start it, thus making the reqres of that requestincomplete
. If this option is enabled,Hoardy-Web
will transparently work around this bug by redirecting the very first navigation request toabout:blank
and then reloading the tab with its original URL. - Firefox-based browsers provide no API for archiving WebSockets data at the moment, unfortunately.
Known issues that are consequences of issues of Firefox-based mobile browsers: Fenix aka Firefox for Android, Fennec, Mull, etc
All of the above apply, moreover:
- Archival by exporting using =saveAs= is not supported at the moment because of this bug.
Known issues that are consequences of issues of Chromium-based desktop browsers: Chromium, Chrome, etc
On Chromium-based browsers, there is no way to get HTTP
response data without attaching Chromium’s debugger to a tab from which a request originates from.
This makes things a bit tricky, for instance:
- With this and this option enabled, new tabs will be reset to this value (
about:blank
by default) because the default ofchrome://newtab/
does not allow attaching debugger to the tabs withchrome:
URLs. - Requests made before the debugger is attached will get canceled by
Hoardy-Web
. So, for instance, when you middle-click a link, Chromium will open a new tab, butHoardy-Web
will block the requests from there until the debugger gets attached and then automatically reload the tab after. As side-effect of this, Chromium will showRequest blocked
page until the debugger is attached and the page is reloaded, meaning it will get visually stuck onRequest blocked
page if fetching the request ended up spawning a download instead of showing a page. The download will proceed as normal, though. - You will get an annoying notification bar constantly displayed in the browser while =Hoardy-Web= is enabled.
Closing that notification will detach the debugger.
Hoardy-Web
will reattach it immediately because it assumes you don’t want to lose data and closing that notification on accident is, unfortunately, quite easy.However, closing the notification will make all in-flight requests lose their response data.
All alternatives to
Hoardy-Web
that work with Chromium suffer from the same issue.If you disable this option the debuggers will get detached only after all requests finish. But even if there are no requests in-flight the notification will not disappear immediately. Chromium takes its time updating the UI after the debugger is detached.
Moreover, Chromium has the following long-standing issues/bugs making things difficult:
- Chromium will automatically detach a debugger from a tab if it tries to save too much data into its debugger state.
Which means that a tab that loads too much data too fast will get its debugger detached.
Chromium does this to try and save memory, but this, among other issues, means that large images will fail to be properly archived, and any page that loads such files is likely to fail to be archived too.
This is a design limitation of Chromium debugging interface, there appears to be no work-around for this at the moment.
Meanwhile, on Firefox,
Hoardy-Web
useswebRequest.filterResponseData
API (not available no Chromium, because it greatly enhances browser’s ad-blocking capabilities) which does not suffer from this problem. - Chromium will occasionally detach debuggers from some tabs at random.
It just happens.
Fortunately,
Hoardy-Web
will mark the resulting broken reqres as problematic by default as they match the conditions of at least one of this, this, or that options. - Chromium handling of media files (audio and video) within its debugging interface is very strange.
When Chromium encounters a media file, it immediately loads a first few frames of it, then cancels the rest of the download, generates a networking error debugging event, but forgets to give the already loaded data to it, and then, when the user clicks the play button, continues the download by requesting the rest of the file as normal.
Thus, on Chromium, for media files
Hoardy-Web
will only ever get206 Partial Content
HTTP
responses with the first few kilobytes of file data missing. This bug has no good workaround, all alternatives toHoardy-Web
that work with Chromium work it around by silently re-downloading the file the second time in background. - Similarly to unpatched Firefox, Chromium-based browsers do not supply contents of files in
POST
request data. They do, however, provide a way to see if files were present in the request, soHoardy-Web
will mark such and only such requests as having a “partial request body”. There is no patch for Chromium to fix this, nor do I plan to make one (feel free to contribute one, though).As with Firefox, disabling this toggle will disable archiving of such broken requests. This is not recommended, however, as archiving some data is usually better than archiving none.
- Chromium fails to provide
openerTabId
to tabs created withchrome.tabs.create
API so in the unlikely case of opening two or more new tabs/windows in rapid succession viaHoardy-Web
context menu actions and not giving them time to initializeHoardy-Web
could end up mixing up settings between the newly created tabs/windows. This bug is impossible to trigger unless your system is very slow or you are clicking things with automation tools likeAutoHotKey
orxnee
. - To properly collect all the data about a reqres,
Hoardy-Web
has to use both the data generated bywebRequest
API and Chromium’s own debugging API events, using only one of those is usually insufficient. But Chromium generates different request IDs for events generated by these two different APIs and also generates those events in arbitrary order. Therefore,Hoardy-Web
tracks reqres generated by both sets of APIs separately and then matches those two lists against each other heuristically, merging matching reqres together. Which is ugly enough. But then Chromium sometimes generates debugging API events and forgets to produce the correspondingwebRequest
API events, or vice versa, thus leaving some of those reqres unmatched.To work around that,
Hoardy-Web
waits this many seconds for new events to arrive, and if none do, forcefully finishes all unmatched but network-completein_flight
reqres. Yes, this means that some minor metadata fields (likedocument_url
) of those reqres might be missing, but waiting more time usually won’t fix it, soHoardy-Web
can’t do anything else there. - However, sometimes Chromium forgets to generate both
loading-complete
andloading-failed
debugging events. This usually happens when a request gets started and then canceled by a page’sJavaScript
, or when you navigate between pages too fast.In that case,
Hoardy-Web
can’t tell if a reqres is just slow at being loaded or if Chromium forgot about it, so those reqres will get stuck in thein_flight
state indefinitely, at least until their originator tab gets closed, or until you press one of this or that buttons.Hoardy-Web
might get another workaround for this bug later.
If you are reading this page outside of the extension’s UI be sure to read the very top of this page first.
Hoardy-Web
only ever sends your data to the archiving Server URL=]] you specify when the [[./popup.html#div-config.archiveSubmitHTTP][=Submit dumps via 'HTTP'
option is enabled.
Nowhere else. Never else.
For your convenience, Hoardy-Web
saves some global stats across restarts (e.g., the Collected, Discarded, Picked, and Dropped lines).
However, none of those are ever sent anywhere and you can reset them at any time.
No. I (the author) hate non-consensual data collection.
In fact, as you might have noticed, Hoardy-Web
, unlike most other browser extensions, is almost trivial to reproducible-build from source on a POSIX-compliant system with a Nix package manager installed, and it has a privately operated source code mirror.
This is by design, I expect a chunk of Hoardy-Web
users to be paranoid enough to only ever build it from source and install the results manually into their LibreWolf or some such, leaving zero telemetry fingerprints anywhere.
<all_urls>
permission is used so thatHoardy-Web
could capture all URLs.webRequest
andwebRequestBlocking
permissions are used to track and captureHTTP
requests and responses; on Chromium the latter also requires thedebugger
permission, whichHoardy-Web
also asks for there.tabs
permission is used for tracking per-tab state and stats, makingHoardy-Web
’s toolbar icon show per-tab state, takingDOM
snapshot of all tabs, buttons switching to a related tab in the-log, etc.webNavigation
permission is used to apply =Tabs with special URLs= options when a tab navigates to a different URL, see above for more info.storage
permission is used to save extension config and stats.unlimitedStorage
permission is used for stashing and archival of captured data to browser’s local storage.menus
(contextMenus
on Chromium) permission is used to add context-menu shortcut actions for links.notifications
permission is used to send notifications, which is mostly used for reporting various issues.
Yes.
This is why =DOM=-snapshot buttons exist, see the following question.
In principle, Hoardy-Web
will capture everything your browser fetches from the network as you browse the web, except for, at the moment, WebSockets data.
So, web pages using only simple UI-related JavaScript
code will work fine when you start replaying them “from scratch” via =hoardy-web mirror= (also there) or some such.
However, in the most general case, “from scratch” replay of pages dynamically generated via JavaScript
is not guaranteed.
For example, consider a web page with a JavaScript
code that generates a random number, then queries a remote server with that number, and then renders the result somehow.
Obviously, such a web page can not be replayed “from scratch” since it will generate a new random number and your archive probably won’t have the corresponding server’s response for it.
Can I use Hoardy-Web
to capture a web page as it currently is, after all JavaScript
was run, not as it was when it was last fetched from the network?
Yes, you can capture DOM
(Document Object Model) snapshots of all frames of the currently active tab by pressing this button in the popup.
Doing that will generate and capture snapshots of raw HTML
’s or XML
’s for each frame contained in the currently active tab.
(Reqres-wise they will be 200 OK
responses, but with protocol
set to SNAPSHOT
and method
set to DOM
.)
You can also do that for all open tabs for which this setting is enabled all at once by pressing that button.
How can I make Hoardy-Web
capture a web page completely, especially when parts of it are loaded lazily?
In the most general case, you will have to scroll the page around and click random buttons and media elements.
Hoardy-Web
has no “autopilot” for doing this, nor will it ever get one, at least as part of Hoardy-Web
extension, since “autopiloting” is very website-specific.
So, at the moment, the most general semi-automated solution is to run a website-specific UserScript via Tampermonkey or some such, wait until everything finishes loading, and then take a snapshot.
(Hoardy-Web
will get an integration for automating that, eventually.)
On the other hand, if you
- run
Hoardy-Web
under Firefox, - just want to load all lazily-loaded images the page already has (NOT load more stuff), and
- the page in question uses modern HTML5 lazy loading attributes instead of using
JavaScript
to do the same,
then you can simply go to about:config
and toggle dom.image-lazy-loading.enabled
to false
.
All images will start being loaded eagerly after that.
Can I use Hoardy-Web
to capture a web page without archiving it, look at it, decide if I want to save it, and archive it only if I do, all without reloading the page a second time?
Yes. This is why =Pick into limbo= setting exists. See above for more info.
In combination with =Closed tabs= options you can implement any of the following workflows:
- archive everything by default, but allow to exclude some things by manually discarding them from limbo;
- only archive things that are explicitly manually collected, discard everything else by default.
Why do pages under https://addons.mozilla.org/ and https://chromewebstore.google.com/ can not be captured by Hoardy-Web
?
Browsers prevent extensions from running on extension store pages to prevent them from manipulating ratings, reviews, and etc such things.
However, you can archive https://addons.mozilla.org/ pages by running Hoardy-Web
under Chromium and https://chromewebstore.google.com/ pages by running Hoardy-Web
under Firefox.
When running Hoardy-Web
under Chromium, a lot of my captures fail with debugger::capture::EMIT_FORCED::BY_DETACHED_DEBUGGER
, debugger::capture::NO_RESPONSE_BODY::DETACHED_DEBUGGER
, webRequest::capture::CANCELED::NO_DEBUGGER
, and similar errors. What do I do?
You are either
- pressing the
Cancel
orClose
(cross) buttons in the Chromium’s popup-toolbar telling you about the debugger being enabled, and so Chromium detaches it, breaking everything (see there); - pressing
Space
orEscape
keyboard keys when doing things in Chromium’s UI, but nothing at that particular moment reacts to the key you pressed, except there is that popup-toolbar… and so Chromium decides it must mean you want to pressCancel
button there … and detaches the debugger, breaking everything (again);yes, this is really annoying, and this is a common problem for me, since I usually page-down using
Space
and pressEscape
a lot (usually to cancel selection, but sometimes also as a trauma of a long-time Vim user);the only solution to this I know of is to just not touch the keyboard at all, at least while things are still loading; i.e. just click on stuff using the mouse/track-point/touch-pad/touchscreen/etc, wait for the
T
(“Tracking”) to vanish from the extension’s badge, and only then let your (grabby and impatient for exercise via keyboard shortcuts) fingers to touch the keyboard;even then, Chromium will detach debuggers from time to time seemingly at random, but at least it will be rare enough that you won’t need to reload much;
- trying to capture large or media files; as discussed there, this has no workaround, run
Hoardy-Web
under Firefox instead.
Also, Chromium will occasionally detach its debugger at random, it just happens.
When running Hoardy-Web
under Firefox, some of my captures fail with webRequest::capture::RESPONSE::BROKEN
. What do I do?
This is a rare error caused by a race condition between webpage’s service/shared worker and browser’s networking code.
Usually, you can ignore this error, since loading another related page is likely to fulfill the same URL.
However, if this happens a lot to you, or if it annoys you, you can go to about:config
, toggle dom.serviceWorkers.enabled
to false
, and restart the browser.
Alternatively, you can use NoScript
or some such extension to disable JavaScript
, and thus the offending service/shared workers, on the page in question.
Did you read the notes on the bugs of the browser you are using?
Most notably:
- both Firefox- and Chromium-based browsers in their default builds fail to properly supply
POST
request data to their extensions; for Firefox-based browsers there exists a patch that fixes it, mostly; Chromium users are out of luck at the moment; - on a Chromium-based browser, because of limitations of the Chromium’s debugging interface, it is impossible to properly capture media files (both audio and video) and large files in general; this issue has no good work-around and, AFAIK, all alternatives to
Hoardy-Web
running on Chromium-based browser suffer from it (and work around it by silently re-downloading said files the second time in background); try usingHoardy-Web
under a Firefox-based browser instead.
The documentation claims that all Hoardy-Web
archival methods except for submission via =HTTP= are unsafe. Why?
Archival by exporting using =saveAs= (generation of fake-Downloads) can fail and **lose a bit of your collected data at a time** if you press a wrong button in you browser’s UI, mis-reconfigure your browser a bit, or your disk gets out of space unexpectedly.
Archival to browser’s local storage (which is what Hoardy-Web
is doing by default) can **loose all your collected data at the same time** if you uninstall the extension by accident.
Meanwhile, archival by submission via =HTTP= has none of these problems:
Hoardy-Web
will keep each reqres in memory until the archiving server responds with200 OK
for that reqres;- the archiving will only respond with
200 OK
response toHoardy-Web
after the dump is written andfsync
-ed to disk; - the archiving server never deletes any of your archived data; by using an archiving server, you can only loose your archived data if you go to its directory and delete some of it yourself, or if your disk dies, or if your file system gets corrupted; all of those problems are solved by regular backups.
Archival to browser’s local storage was added because it was very easy to implement after the-stash was added. It is the default because it usually works fine, it properly reports errors, has the most consistent behaviour across all browsers, and does not require the user to install any Python code, which helps with on-boarding.
In the ideal world, browsers would provide a better saveAs
API which would have a less annoying UI for the user and would return out-of-disk-space errors to the extension, in which case exporting via =saveAs= would be the default.
As it is now, the only way to be absolutely sure you data is properly forever-saved to disk when the extension reports it archived is to use submission via =HTTP=.
When running Hoardy-Web
under Firefox, enabling export via =saveAs= makes the browser’s UI quite annoying. Can it be fixed?
Yes, go to about:config
and toggle browser.download.alwaysOpenPanel
to false
.
If the whole content of this page (not just this section, did you try searching for stuff with Control+F
? there’s a lot of info here) does not explain your problem, {{{reportit()}}}.