DuckDuckGo distributed crawler (DDC) prototype

The purpose of this project is to prototype a distributed crawler for the DuckDuckGo search engine.

Protocol

Basic workflow

A client requests a list of domains to check for spam, the server answers with a list of domains
The server might also add in the response additional data to ask the client to upgrade itself or the page analysis component
The client does the analysis on the domains, and then sends the results back to the server
The client request another bunch of domains to check and so on

Implementation

It's a classic REST API
To get a domain list the client sends a GET request, and to post the results it sends a POST request

URL parameters:

version : the protocol version which defines the XML response structure, it must be incremented when a change breaks client compatibility. The server must always handle all old protocol versions, to at least to tell the clients they must upgrade
pc_version : the version of the page processing binary component

XML response format

It contains one of these nodes immediately above the root:

'upgrades' : can contain nodes to tell the client to upgrade its components (with URL to download the new version)
'domainlist' : the list of domains to check ('domain' nodes)

Files

ddc_client.py : Code for a crawling worker
ddc_process.py : This file contains the code that simulates the binary component, currently it returns dumb results just to simulate
ddc_server.py : Code for the server that distributes the crawling work to the clients and gets the result from them
tests/single_client.sh : Bash script to do a small simulation by launching the server and connecting a client to it
tests/client_upgrade.sh : Bash script to simulate a client upgrade initiated by the server

Dependencies

Ubuntu users

On recent Ubuntu versions, you can install all dependencies by running the following command line:

sudo apt-get -V install python3 python3-httplib2

The code has only been tested on Linux but is fully OS neutral.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ddc_client.py		ddc_client.py
ddc_process.py		ddc_process.py
ddc_server.py		ddc_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuckDuckGo distributed crawler (DDC) prototype

Protocol

Basic workflow

Implementation

URL parameters:

XML response format

Files

Dependencies

Ubuntu users

About

Releases

Packages

Contributors 2

Languages

License

duckduckgo/duckduckcrawl

Folders and files

Latest commit

History

Repository files navigation

DuckDuckGo distributed crawler (DDC) prototype

Protocol

Basic workflow

Implementation

URL parameters:

XML response format

Files

Dependencies

Ubuntu users

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages