Introduction

Sample web crawler program

How to build and run:

To build on the command line:

   mvn clean install

To execute, locate the directory containing "apjt-core-1.0-SNAPSHOT.jar" and run:

   java -jar apjt-core-1.0-SNAPSHOT.jar << URL >>

Where << URL >> is the url to parse, you should then get output that eventually ends up like:

   INTERNAL - http://localhost:54410/index.html
      img - http://localhost:54410/image2.jpg
      INTERNAL - http://localhost:54410/file1.html
      img - http://localhost:54410/image1.jpg
      img - http://localhost:54410/image3.jpg
      INTERNAL - http://localhost:54410/nested/file2.html
        a - http://www.google.co.uk
        img - http://localhost:54410/nested/nestedimg.jpg
      INTERNAL - ERROR[404] - http://localhost:54410/file3.html

INTERNAL indicates a relative internal link, "a" indicates an external resource, "img" indicates an image, etc (other tags containing a "src" attribute are also supported).

Three main components to question:

Write a crawler - I believe this is the main functionality being requested, so although 3rd party solutions are available, I have opted to implement a basic version myself
Write a webpage parser - Parsing html is surprisingly difficult, so I have utilised "jsoup" 3rd party library for this functionality
Print out results - Simple outputting to console has been implemented due to time constraints, rather than a graphical model

Time spent:

approximately 15 minutes was spent with "prep", reading question, determining an html parser, etc
following days a little over two hours was spent on actual coding, but additional time was spent writing this documentation
did already have a "template" maven project that runs a build - but needed to add the shade plugin separately - so not much time spent there

Missed functionality due to time constraints and tradeoffs

Needs additional tests around the parsing of the html pages
Launcher needs more functionality that can be tested
Need to better handle the distinction between internal and external links, as support is brittle
Should support maximum nested pages, rather than maximum number of pages, as the output would then be a bit more logical
Should allow the maxPages and/or maxDepth to be passed in as parameter
Handling of pages should be executed in parallel - code has been mostly written to support parallelisation, but the launcher does not use it
Rendering of the "site map" should be in a better form, graphical output, xml site, etc
Input to the program should be validated
Should act like a better behaved robot
Documentation should be improved
Unit tests are a little heavy-weight due to the jetty server wiring as a TestRule (although doesn't add too much overhead) - implemented something similar recently for cxf testing, so knew how easy it is although means the app is tested a lot more production-like, the main reason for it was to simplify the jsoup code and wiring, to remove too many conditional parsing types
Used jsoup for the html parsing, this was the first time I've ever come across jsoup, so the API usage is mainly based upon their examples - if for a real app would prefer to spend more time understanding that area
Model implemented was separate parse and render, so will be holding more resources in memory, but have implemented it such that a page hopefully won't be parsed more than once
Simple in-memory cache of the pages was used, rather than a store that could output to disk, therefore would not scale well

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

How to build and run:

Three main components to question:

Time spent:

Missed functionality due to time constraints and tradeoffs

About

Releases

Packages

Languages

turneand/apjt-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Introduction

How to build and run:

Three main components to question:

Time spent:

Missed functionality due to time constraints and tradeoffs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages