-
Notifications
You must be signed in to change notification settings - Fork 762
Heritrix in Eclipse
Alex Osborne edited this page Jul 4, 2018
·
3 revisions
Specifically Ubuntu 11.04, but should work for other versions from the general time period (10.10, 11.10, ...).
N.B. There are other ways to do this, for example using additional eclipse plugins for maven or git, but this is one way that is known to work.
sudo apt-get install sun-java6-jdk eclipse git maven2
sudo update-java-alternatives --set java-6-sun
sudo update-java-alternatives --list
cd ~/workspace
git clone git://github.com/internetarchive/heritrix3.git
cd ~/workspace/heritrix3
mvn -Dmaven.test.skip=true install
In eclipse: File / Import... / Existing Projects Into Workspace ... choose ~/workspace/heritrix3
Select Project > Properties > Java Build path >
Select Libraries tab > Add variable > Configure variables > New
Name: M2_REPO
Path: /home/{username}/.m2/repository
- Run / Debug Configurations...
- double-click Java Applications to create a new one
- choose Main class org.archive.crawler.Heritrix
- Arguments tab
- Program arguments: -a PASSWORD -l dist/src/main/conf/logging.properties
- VM arguments: -Dheritrix.development
Screenshot.png (image/png)
Screenshot-1.png (image/png)
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse