Skip to content

pascaldimassimo/nio-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#Simple Java Crawler using NIO

This Java crawler uses NIO to fetch web pages. Only a single thread is blocked on IO. The worker threads are responsible to parse the HTTP response and the HTML body.

To manage a crawl job, you have to implement the Job class:

public class MyJob implements Job
{	
    @Override
    public boolean visit(URI url)
    {
        // Only visit url in myhost.com
        return url.toString().startsWith("http://www.myhost.com");
    }

    @Override
    public void process(Page page)
    {
        // Do whathever you want with the page!
    }
}

This crawler is inspired by Anemone, crawler-4j, em-http-request and Nutch. The HTTP parsing is done using a Ragel generated parser adapted from the parser written by Zed A. Shaw for Mongrel.

This code is a learning experience, it is far from being production ready!

About

Simple Java Crawler using NIO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages