- Change white/black lists to allow/deny lists
- Update phantomjs_options to use 'TLSv1.2'
- Delete
driver_options
configuration key as it was never used. cleanup_all_processes
is a self method as intended to.
-
New configuration key
on_periodic_restart
. -
CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
-
Breaking changes
- Requires Ruby 2.1 or later.
- Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
- Crawler's methods
restart
andquit
have been moved to CrawlerManager. - Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
- Ensure all links are loaded by waiting for Ajax requests to complete
- Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
- Avoid following JS href links, add missing dependencies to fix Travis build
- Avoid following links when disabled by CSS (1.6.8 worked only for Javascript)
- Avoid following disabled links
- Increment '@times_visited' first to avoid infinite retries when rescuing errors
- Updated phantomjs_logger not to open '/dev/null'
- Added #quit to Crawler
- Added #quit to Capybara driver
- Only follow visible links
- Reset Capybara driver to Puffing Billy (used to rewrite URL requests in specs)
- Use float timestamp for Poltergeist driver name to support fast test executions
- Use non-static name to support registering Poltergeist crawler multiple times
- More exception handling, store redirected URLs in addition to original URL
- Support custom URL comparison when adding new pages during crawling
- Don't rescue Timeout error, so that Delayed Job can properly terminate hanging jobs
- Fail early if Capybara doesn't initialize properly
- Fixed deprecation warning (Thanks scott)
- Updated Poltergeist dependency
- Grell will follow redirects.
- Added #followed_redirects? #error? #current_url methods to the Page class
- Added crawler.restart to restart browser process
- The block of code can make grell retry any given page.
- Rescue Timeout error and return an empty page when that happens
- Added whitelisting and blacklisting
- Better info in gemspec
- The Crawler object allows you to provide an external logger object.
- Clearer semantics when an error happens, special headers are returned so the user can inspect the error
- Caveats:
- The 'debug' option in the crawler does not have any affect anymore. Provide an external logger with 'logger' instead
- The errors provided in the headers by grell has changed from 'grell_status' to 'grellStatus'.
- The 'visited' property in the page was never supposed to be accesible. Use 'visited?' instead.
- Solve bug: URLs are case insensitive
- Grell now will consider two links to point to the same page only when the whole URL is exactly the same. Versions previously would only consider two links to be the same when they shared the path.
- Solve bug where we were adding links in heads as if there were normal links in the body
- Solve bug with the new data-href functionality
- Solve problem with randomly failing spec
- Search for elements with 'href' or 'data-href' to find links
- Rescueing Javascript errors
- Initial implementation
- Basic support to crawling pages.