-
-
Notifications
You must be signed in to change notification settings - Fork 103
GSoC 2021 Work Product Submission
Hello there! Over the summer of 2021, I worked on improving the cloning and serving capabilities of Snare.
Specifically, I worked on introducing headless cloning through Pyppeteer, upgrading aiohttp to a version compatible with Tanner and adding support for newer versions of Python. Apart from these, I also worked on a few issues that helped make Snare more complete and provide a better overall user experience.
- Headless cloning
- Architectural redesign
- Retrying URLs
- aiohttp and Python 3.9
- Error handling
- Redirects
- Fingerprinting
- Unittesting
- Bug fixes and minor improvements
- Documentation
- Future
In some cases, the classic method of curl
-ing or requests.get
-ing might not provide us with the complete webpage. This can be due to a variety of reasons - User-Agent, Viewport, lazy loading or AJAX calls that are fired based on the cursor movement. Even though the user-agent can be spoofed, it still leaves us with a few issues that cannot be solved the conventional way. Enter headless browsing.
In a nutshell, headless browsing is making use of an actual browser instance, without a GUI, whose actions can be programmed and automated. Selenium is one such battle-tested tool for browser automation and we initially chose it for this very reason. However, it later struck us that the entirety of Snare runs asynchronously and Selenium is meant to be run synchronously. Hence, we shifted to Pyppeteer, the python port of Puppeteer from JavaScript, which worked asynchronously and fit in very well.
Headless cloning can now be enabled by adding the --headless
flag to the cloner call like clone --target http://example.com --path example-site --headless
.
Link to PR: #294
To incorporate changes for headless cloning, the data-fetching part had to be split into a separate function, but that was not all. We had one too many functions under a single class that served different purposes; this called for separate classes.
There were 2 ways to proceed -
- Keep the Cloner class as such and introduce a HeadlessCloner class that overloads the
fetch_data
method. - Separate the core functionalities of the cloner into
BaseCloner
, an abstract class, and definefetch_data
for SimpleCloner and HeadlessCloner. Finally provide a common interface through CloneRunner.
We collectively decided to proceed with the 2nd approach in lieu of cleaner design practices.
Headless cloning brought along a few challenges, one of them being request failures (from timeouts for example). initially, quite a number of requests err-ed out, resulting in the webpage not being scraped. To tackle this, a try_count
key was added to the URL item (a dictionary) in the URLs queue. If there is an error in fetching the data, the same URL is added to the queue again with the try_count
increased by 1. A single URL can be tried for a maximum of 3 times before it is discarded.
url_item = {"url": "example.com", "level": 0, "try_count": 1}
A change, as small and simple as this, resulted in a natural increase in the number of pages cloned and thus, better reliability in the cloning process.
Link to PR: #298
Two of the most daunting tasks in my proposal were upgrading the aiohttp library to v3.7.4, the same version used by Tanner, and adding support for newer versions of Python - v3.8 and v3.9. As encountered in #244, there was issue with Snare serving empty pages or connection reset errors with the root cause being the Python and aiohttp versions. To everyone's relief, the task turned to be very easy as Snare worked out of the box with Python 3.9 and aiohttp v3.7.4. 😄
While testing, I came across a strange issue where the meta info was not written onto meta.json
in the event a KeyboardInterrupt
was raised. After some research and the help of my mentors, we identified the issue to be with exception handling in asyncio event loops. asyncio.run_until_complete
delegates an exception from the point where it raised to the point where the loop run is invoked. This meant that the KeyboardInterrupt could not be handled within the cloner.
To overcome this issue, a close
method was introduced in the CloneRunner class to close all open connections and write the meta info.
Link to PRs:
In cases where sites redirected, cloner had a tough time fixing and following links. For example, there were a lot of problems with broken relative links when the home URL was shifted. To enable redirects, the return URL is compared with the requested URL and a key in the meta info is added accordingly. If a "redirect" key is present, a 302 exception is raised and the site redirects to the new URL.
For example, if /
redirected to /new/home/
, meta.json
would look like this:
{
"/":{
"redirect":"/new/home/"
},
"/new/home/":{
"hash":"abc123",
"headers":[
{
"Server":"ABC"
}
]
}
}
Link to PRs:
Snare claims to be a Nginx web server on the outside with aiohttp running under the hood. To solidify this claim, it was crucial to make sure the Snare web server did not leak the Server header. Basic fingerprinting methods involved checking the order of response headers and sending malformed requests to trigger various exceptions.
Though a 400 exception cannot be completely handled right now in aiohttp, 302, 404 and 500 now send proper headers. Additionally, Snare has been configured to drop the Server header altogether if it exposes the aiohttp server banner.
Link to PR: #308
The architectural changes to accomodate headless cloning required the tests to be rewritten partially. I learnt a lot about writing proper tests from my mentors during this period.
Link to PR: #304
CSS validation by cloner now properly logs errors and warnings into the log file instead of stdout. This was done to reduce the visual clutter while running cloner.
Link to PR: #297
There was issue with the Transfer-Encoding
header while serving webpages with Snare. Websites can opt to transfer data in chunks so that data from various sources can reach the viewer reliably. However, when data was sent in chunks, the Content-Length
header must not be present as all the relevant info for data transfer is present in the chunks themselves. Since all site data is aggregated in a single file by cloner, the Transfer-Encoding
header was dropped.
Since newer versions of libraries might contain crucial vulnerability fixes, it is always good to update them but this is a hassle for the developers and maintainers. Creating requirements.txt
without version specifications might lead to breaking changes from newer major versions while setting them might prevent minor updates and bug fixes. To establish a middle-ground, the tilde (~
) specifier can be used. Refer to PR#306 for a better explanation.
Link to PR: #306: Newer libraries, lax requirements and encoding fix
Documentation is the backbone of any software. @mzfr suggested the use of docstrings, similar to what had been done in Tanner. Docstrings of the sphinx format can be used to autogenerate developer documentation as Snare's documentation also uses Sphinx.
Since we have moved past Python 3.5, type hinting has also been used.
Link to PRs:
In this 10-week period, there were a few ideas that we discussed but could not proceed with. One such idea was framework integration.
Currently, given a website, Snare clones and serves it working in tandem with Tanner. The idea here is to leverage Snare's capabilities to communicate with Tanner, prepare responses and integrate it into another website's source.
At the moment, Flask and Django are good candidates for integration since Snare is written in Python.
This idea is in its infancy and thus, an approach can be decided only after a healthy amount of discussion. Please visit Snare's issues section for further discussion on this.