Couch Crawler

A search engine built on top of couchdb-lucene.

Dependencies

CouchDB

couchdb-lucene v0.4
couchapp

Python

couchdb-python
scrapy

Optionally for Yammer spidering:

pyopenssl
oauth

Installation

Assuming couchdb-lucene was installed to the "_fti" endpoint, you can push Couch Crawler to your CouchDB instance with the command:

cd couchapp
couchapp push

This will create a new CouchDB database called "crawler" on the localhost:5984 CouchDB instance. To change the db, modify couchapp/.couchapprc and do another couchapp push.

To configure the crawler, copy python/couchcrawler-sample.cfg to python/couchcrawler.cfg and fill out the appropriate configuration values.

To start indexing pages, run the crawler script:

cd python
./scrapy-ctl.py crawl domain_to_crawl.com

While it's indexing, you can visit the search engine at the following url:

http://localhost:5984/crawler/_design/crawler/index.html

Spiders

The crawler current has spiders for:

MediaWiki
Twiki
Yammer

It's pretty easy to create your own. See python/couchcrawler/spiders/wiki.py for an example, or Scrapy documentation for more a more in-depth explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Couch Crawler

Dependencies

Installation

Spiders

Files

README.md

Latest commit

History

README.md

File metadata and controls

Couch Crawler

Dependencies

Installation

Spiders