-
Notifications
You must be signed in to change notification settings - Fork 762
New Settings Web UI
(from some brainstorming of Paul, Michael Magin, Vinay, and Gordon on April 27th)
The redone UI for operating on the new settings will likely have an area for composing 'sheets' and one for assigning URIs/SURT-prefixes to sheets.
(Gordon wondered: can there be more than one 'global defaults' sheet, either as a bundle or an empty-prefix override?)
Some problems with current settings web UI include:
- overrides sometimes don't work as expected, either having no effect or (at one point, bug probably fixed) changing global settings
- it's unclear what can be effectively changed mid-crawl, and whether it is necessary to pause to do so. Can we better document/enforce these? Can we make all changes 'safe' either via some way of holding settings constant for a thread until a moment to safely atomically change is possible?
Though the SURT-prefixed-mapped-overrides are a superset of the current functionality, the convenience of still being able to enter a plain hostname should be retained.
View/edit frontier looks useful but isn't useful/efficient at scale – can that be fixed?
Is there a way to interactively test what settings would apply to a URI in the UI? (Same goes for scopes.)
Frontier report is oft-used but not optimal form for common tasks. A sort by queue size (or other salient characteristics) would help. It is often hard to view long URIs (exactly those of most interest for trap-evaluation). Can color/size be used to highlight important data needing attention? The seeds report – errors at top, clickable URIs – is a good model.
The cross-links from reports to regex-filtered views of logs have sometimes been broken in the past.
Can the crawl.log get syntax-highlighting or be clickable?
Bugs/defects/annoyances/unexpected-behaviors in old and new UIs should be quickly and liberally reported to bug-tracking. (Err on the side of over-reporting.)
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse