-
Notifications
You must be signed in to change notification settings - Fork 75
PageGeometryCaptureSolutions
This is an editable place to discuss https://github.com/mozilla/fathom/issues/69.
Partway through implementing https://github.com/mozilla/fathom/issues/18, I noticed that it takes more than simply HTML to reconstitute the geometry of a page. That is, a mere headless browser does not suffice to deliver the additional signal we wanted. Let's choose a solution.
- Don't require the network at test-running or tuning time.
- Have a plan for expanding for arbitrary future pieces of data, not just geometry.
Here are some possibilities:
One solution would be to include the CSS and images and such in our test data, but that would necessarily change the attrs and such, which a ruleset might want to examine.
- Can see the original page in all its visual glory at any time, to orient rule developers and help them intuit new rules to try
- Can extract any new, unanticipated data from old pages for trying new rules.
- Repository bloat (mitigated by breaking the product detection work off into its own repo, which isn't a bad idea anyway)
- Have to run tuning and at least some tests in a (possibly headless) browser, with the attendant complexity of karma and sockets and timeouts and leaks and and and....
Safari does a darn good job of this, even catching (as Firefox doesn't), CSS-based background images. It stuffs everything into a huge plist file, recording the original URL and base64'd contents of each resource. We could extract from it trivially, but I'm not sure how to get it reconstituted in anything but Safari. As an extra-special bonus, img src attrs appear to still be the remote URLs when examined from JS, but they load just the same.
Chrome does a middling job. It saves as a regular HTML page plus a folder full of assets. It also pulls down background-image resources. However, it is bad at showing them again. For example, an Amazon product page, saved then loaded without network access, will not show the shopping cart icon in the navbar, which is made of a CSS background image whose CSS is generated and attached to nodes by JS. Unsurprisingly, the original, non-local address of the image remains in the JS. This shows that even replacing the <img src>
attrs would be grossly insufficient on real-world pages.
Add more mocking to our in-JS DOM representation (jsdom or domino) so it claims to know the sizes and positions of things. Or do that mocking off to the side, with special lookup procedures. We'd have to crawl the original page and squirrel away each [interesting] node's geometry information for later. Implicit in this is the need for a way to durably address each node. (Worst case, I you could use the path from root (html.1/body.1/div.1/div.2/p.3) or the index of the tag's opener within the file, either in terms of "the nth tag in the file" or "tag at byte offset n". Hopefully there's a better idea. At least the first one you can figure out if all you have is the node—by walking upward to the root and looking around at siblings. Kind of expensive, though. Of course, we could do it once per page and then just keep a big hash in RAM. But I'm getting ahead of myself.)
- No browser needed during tests and tuning
- If we wanted to capture more about the page later, there's no guarantee we have access to the original. (Mitigation: capture a LOT, like all the attrs and computed CSS of every node, along with their dimensions and positions. Also, save a webarchive as well, just for human reference and, if the effort warrants, future mechanical extraction.)
- More coding to do? Not sure.
- Will storage requirements be even larger than the Preserve Assets approach? Estimate.
- Use Selenium to crawl the page, write locations to a file? (I have this somewhat working I think?)
- Node to number map