-
Notifications
You must be signed in to change notification settings - Fork 762
Release Notes Heritrix 3.4.0 20210803
Andy Jackson edited this page Aug 3, 2021
·
1 revision
Summary of changes since Release Notes - Heritrix 3.4.0-20210803 - see the full changelog for more details.
- ExtractorChrome: reduce request duplication between browser and frontier #416 (ato)
- ExtractorChrome: Capture requests made by the browser #411 (ato)
- Add ExtractorChrome to contrib #403 (ato)
- Add basic syntax highlighting to the crawl.log viewer #408 (ato)
- JDK 16 compatibility #418 (ato)
- Upgrade httpclient to 4.5 #397 (anjackson)
- Don't extract data URIs #423 (ato)
- ToeThread: ensure currentCuri is finished before exiting #421 (ato)
- Switch from Travis CI to Github Actions #404 (ato)
- Speed up test suite #405 (ato)
- Fix a couple of boring maven warnings #407 (ato)
- Fix and document the -r option which runs a named job on startup #406 (ato)
- Upgrade maven-assembly-plugin to 3.3.0 to fix file permissions #414 (ato)
- Warc writer stats fixes #410 (ato)
none
- Fix WARC-IP-Address and use a common server-ip CrawlURI attribute for all protocols #409 (ato)
- Jobs can get stuck STOPPING with "Interrupt leaving unfinished CrawlURI" #420
- Groovy version is incompatible with JDK 16+ #419
- module java.base does not export sun.security.tools.keytool to unnamed module @1ece4432 #417
- Distribution package has broken filesystem permissions #413
- Add WARC-IP-Address header to WARCWriterChainProcessor #396
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse