GitHub - tbpalsulich/nutch-auth-example: Example of using Nutch to authenticate and crawl mrs.org

Nutch HTTP Client Authentication

This WIP Nutch deployment uses Nutch to automatically log into and crawl www.mrs.org.

Run build.sh to checkout the Nutch trunk, build it, and copy the necessary configuration files. Once finished, cd dist to use the newly configured Nutch distribution.

Please see conf/nutch-site.xml and conf/httpclient-auth.xml for the updated configuration files.

urls/seed.txt is used as a seed for crawling when you run the command bin/crawl urls/ CrawlData/ N, where N is the number of rounds of fetching.

You can try crawling a single page by running bin/nutch parsechecker http://mrs.org/home/.

After running a crawl or using parsechecker, the logs will be in logs/hadoop.log.

Make sure to update the credentials in dist/conf/httpclient-auth.xml!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
conf		conf
urls		urls
.gitignore		.gitignore
build.sh		build.sh
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nutch HTTP Client Authentication

About

Releases

Packages

Languages

tbpalsulich/nutch-auth-example

Folders and files

Latest commit

History

Repository files navigation

Nutch HTTP Client Authentication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages