Skip to content

Commit

Permalink
Merge branch 'master' of github.com:paulvalla/node-crawler
Browse files Browse the repository at this point in the history
  • Loading branch information
paulvalla committed Nov 8, 2014
2 parents 23476c0 + 93ee443 commit 8f6e992
Showing 1 changed file with 87 additions and 40 deletions.
127 changes: 87 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ node-crawler aims to be the best crawling/scraping package for Node.

It features:
* A clean, simple API
* server-side DOM & automatic jQuery insertion
* server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM
* Configurable pool size and retries
* Priority of requests
* forceUTF8 mode to let node-crawler deal for you with charset detection and conversion
Expand All @@ -28,30 +28,37 @@ Crash course

```javascript
var Crawler = require("crawler");
var url = require('url');

var c = new Crawler({
maxConnections : 10,

// This will be called for each crawled page
callback : function (error, result, $) {

// $ is Cheerio by default, a lean implementation of core jQuery designed specifically for the server
$("#content a").each(function(index,a) {
c.queue(a.href);
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
$('a').each(function(index, a) {
var toQueueUrl = $(a).attr('href');
c.queue(toQueueUrl);
});
}
});

// Queue just one URL, with default callback
c.queue("http://joshfire.com");
c.queue('http://joshfire.com');

// Queue a list of URLs
c.queue(["http://jamendo.com/","http://tedxparis.com"]);
c.queue(['http://jamendo.com/','http://tedxparis.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
"uri":"http://parishackers.org/",
"jQuery":false,
uri: 'http://parishackers.org/',
jQuery: false,

// The global callback won't be called
callback: function (error, result) {
console.log('Grabbed', result.body.length, 'bytes');
}
}]);

// Queue using a function
var googleSearch = function(search) {
Expand All @@ -61,17 +68,12 @@ c.queue({
uri: googleSearch('cheese')
});

// The global callback won't be called
"callback":function(error,result) {
console.log("Grabbed",result.body.length,"bytes");
}
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
"html":"<p>This is a <strong>test</strong></p>"
html: '<p>This is a <strong>test</strong></p>'
}]);
```
For more examples, look at the [tests](https://github.com/sylvinus/node-crawler/tree/master/test).

Options reference
-----------------
Expand All @@ -84,48 +86,93 @@ the request() method.

Basic request options:

* uri: String, the URL you want to crawl
* timeout : Number, in milliseconds (Default 60000)
* method, xxx: [All mikeal's requests options are accepted](https://github.com/mikeal/request#requestoptions-callback)
* `uri`: String, the URL you want to crawl
* `timeout` : Number, in milliseconds (Default 60000)
* [All mikeal's requests options are accepted](https://github.com/mikeal/request#requestoptions-callback)

Callbacks:

* callback(error, result, $): A request was completed
* onDrain(): There is no more queued requests
* `callback(error, result, $)`: A request was completed
* `onDrain()`: There is no more queued requests

Pool options:

* maxConnections: Number, Size of the worker pool (Default 10),
* priorityRange: Number, Range of acceptable priorities starting from 0 (Default 10),
* priority: Number, Priority of this request (Default 5),
* `maxConnections`: Number, Size of the worker pool (Default 10),
* `priorityRange`: Number, Range of acceptable priorities starting from 0 (Default 10),
* `priority`: Number, Priority of this request (Default 5),

Retry options:

* retries: Number of retries if the request fails (Default 3),
* retryTimeout: Number of milliseconds to wait before retrying (Default 10000),
* `retries`: Number of retries if the request fails (Default 3),
* `retryTimeout`: Number of milliseconds to wait before retrying (Default 10000),

Server-side DOM options:

* jQuery: Boolean, if true creates a server-side DOM and adds jQuery (Default true)
* jQueryUrl: String, path to the jQuery file you want to insert (Defaults to bundled jquery-1.8.1.min.js)
* autoWindowClose: Boolean, if false you will have to close the window yourself with result.window.close(). Useful when your callback needs to continue having the window open until some async code has ran. (Default true)
* `jQuery`: true, false or ConfObject (Default true)
see below [Working with Cheerio or JSDOM](https://github.com/paulvalla/node-crawler/blob/master/README.md#working-with-cheerio-or-jsdom)

Charset encoding:

* forceUTF8: Boolean, if true will try to detect the page charset and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default false),
* incomingEncoding: String, with forceUTF8: true to set encoding manually (Default null)
* `forceUTF8`: Boolean, if true will try to detect the page charset and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default false),
* `incomingEncoding`: String, with forceUTF8: true to set encoding manually (Default null)
`incomingEncoding : 'windows-1255'` for example

Cache:

* cache: Boolean, if true stores requests in memory (Default false)
* skipDuplicates: Boolean, if true skips URIs that were already crawled, without even calling callback() (Default false)
* `cache`: Boolean, if true stores requests in memory (Default false)
* `skipDuplicates`: Boolean, if true skips URIs that were already crawled, without even calling callback() (Default false)

Other:

* userAgent: String, defaults to "node-crawler/[version]"
* referer: String, if truthy sets the HTTP referer header
* rateLimits: Number of milliseconds to delay between each requests (Default 0) Note that this option will force crawler to use only one connection (for now)
* `userAgent`: String, defaults to "node-crawler/[version]"
* `referer`: String, if truthy sets the HTTP referer header
* `rateLimits`: Number of milliseconds to delay between each requests (Default 0) Note that this option will force crawler to use only one connection (for now)

Working with Cheerio or JSDOM
-----------------------------

Crawler by default use [Cheerio](https://github.com/cheeriojs/cheerio) instead of [Jsdom](https://github.com/tmpvar/jsdom). Jsdom is more robust but can be hard to install (espacially on windows) because of [contextify](https://github.com/tmpvar/jsdom#contextify).
Which is why, if you want to use jsdom you will have to build it, and `require('jsdom')` in your own script before passing it to crawler. This is to avoid cheerio crawler user to build jsdom when installing crawler.

###Working with Cheerio
```javascript
jQuery: true //(default)
//OR
jQuery: 'cheerio'
//OR
jQuery: {
name: 'cheerio',
options: {
normalizeWhitespace: true,
xmlMode: true
}
}
```
These parsing options are taken directly from [htmlparser2](https://github.com/fb55/htmlparser2/wiki/Parser-options), therefore any options that can be used in `htmlparser2` are valid in cheerio as well. The default options are:

```js
{
normalizeWhitespace: false,
xmlMode: false,
decodeEntities: true
}
```

For a full list of options and their effects, see [this](https://github.com/fb55/DomHandler) and
[htmlparser2's options](https://github.com/fb55/htmlparser2/wiki/Parser-options).
[source](https://github.com/cheeriojs/cheerio#loading)

###Working with JSDOM

In order to work with JSDOM you will have to install it in your project folder `npm install jsdom`, deal with [compiling C++](https://github.com/tmpvar/jsdom#contextify) and pass it to crawler.
```javascript
var jsdom = require('jsdom');
var Crawler = require('crawler');

var c = new Crawler({
jQuery: jsdom
});
```

How to test
-----------
Expand All @@ -141,18 +188,18 @@ node-crawler use a local httpbin for testing purpose. You can install httpbin as
// Finally
$ npm install && npm test

Feel free to add more tests!

[![build status](https://secure.travis-ci.org/sylvinus/node-crawler.png)](http://travis-ci.org/sylvinus/node-crawler)

Rough todolist
--------------

* Refactoring the code to be more maintenable, it's spaghetti code in there !
* Have a look at the Cache feature and refactor it
* Same for the Pool
* Make Sizzle tests pass (jsdom bug? https://github.com/tmpvar/jsdom/issues#issue/81)
* More crawling tests
* Document the API more (+ the result object)
* Get feedback on featureset for a 1.0 release (option for autofollowing links?)
* Check how we can support other mimetypes than HTML
* Option to wait for callback to finish before freeing the pool resource (via another callback like next())


Expand Down

0 comments on commit 8f6e992

Please sign in to comment.