-
Notifications
You must be signed in to change notification settings - Fork 762
national or regional domain scope
this is a complex topic possibly requiring deep meditation, however
for example - if you only want to capture "Hungarian" sites with the
"hu" TLD, then you could specify "http://(hu," as an allowable
scope
SURT prefix using SurtPrefixedDecideRule
in your crawler beans
CXML file. see:
- Crawl Scope
https://webarchive.jira.com/wiki/display/Heritrix/Crawl+Scope - SurtPrefixedDecideRule
http://crawler.archive.org/apidocs/org/archive/crawler/deciderules/SurtPrefixedDecideRule.html
however, this is a highly oversimplified notion of what might
constitute "Hungarian sites only". you'll want to think about what
that actually means for a class of URLs. for instance, what about
sites that are hosted elsewhere but are clearly Hungarian? or hosted
Hungarian blogs (livejournal, blogger, typepad, etc.)? or hungarian
social networking sites (facebook, myspace, bebo, flickr, etc.)?
other alternatives for addressing this scoping issue include
requesting a listing of sites from servers geographically located in
the country of interest from a service like Alexa Internet or a domain
registry.
it's also conceivable to write custom modules to try to detect the
language or geolocation of potential target sites to decide what to
capture.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse