-
Notifications
You must be signed in to change notification settings - Fork 762
Avoiding False Requests When Processing Certain Types of Content
As of Heritrix 3.1, improvements have been made to the crawler's ability to determine if a string is a valid URI. These improvements can provide better link extraction from content such as unparsed/uninterpreted Javascript. However, this technique can be error-prone, causing problems or annoyance on target Web sites. This functionality can be disabled in Heritrix 3.1 for a full crawl or on a site-by-site basis. To disable remove the "extractorJs" bean reference from the "fetchProcessors" bean and set the "extractionHtml" bean's "extractJavascript" and "extractValueAttributes" properties to false.
-
Remove the "fetchProcessors" bean's reference to "extractorJs".
<bean id="fetchProcessors" class="org.archive.modules.FetchChain"> <property name="processors"> <list> <!-- re-check scope, if so enabled... --> <ref bean="preselector"/> <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... --> <ref bean="preconditions"/> <!-- ...fetch if DNS URI... --> <ref bean="fetchDns"/> <!-- <ref bean="fetchWhois"/> --> <!-- ...fetch if HTTP URI... --> <ref bean="fetchHttp"/> <!-- ...extract outlinks from HTTP headers... --> <ref bean="extractorHttp"/> <!-- ...extract outlinks from HTML content... --> <ref bean="extractorHtml"/> <!-- ...extract outlinks from CSS content... --> <ref bean="extractorCss"/> <!-- ...extract outlinks from Javascript content... --> <!-- ****** <ref bean="extractorJs"/> ****** --> <!-- ...extract outlinks from Flash content... --> <ref bean="extractorSwf"/> </list> </property> </bean>
-
Set the "extractionHtml" bean's "extractJS" and "extractValueAttributes" settings to false.
<bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML"> <property name="extractJavascript" value="false" /> <property name="extractValueAttributes" value="false" /> </bean>
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse