Replies: 1 comment
-
You can request to follow all links that match with current domain: $domain = parse_url($response->getUri(), PHP_URL_HOST);
$links = $response->filter('a')->each(function ($node) {
return $node->link()->getUri();
});
foreach ($links as $link) {
if (parse_url($link, PHP_URL_HOST) !== $domain) {
continue;
}
yield $this->request('GET', $link);
} I don't mean about if roach-php support or must do it or not, but in the end the result is it. If you do that you need to take care of the list of requests and there are duplicated requests. Also, it's recommended to use I opened a PR with a spider middleware that helps with that. The current downloads middleware let the requests grow much faster than the middleware drop it before to be downloaded, then that new one is a spider middleware, and can drop duplicated requests earlier when the spider request it. That avoid high memory consumption while continue avoiding duplicate requests. Also, it uses a limited map-array, then it'll repeat a least repeated request before to crash because too much memory usage. |
Beta Was this translation helpful? Give feedback.
-
Hello,
Does roach-php support following links and crawling an entire site? I read the documentation on Scraping versus Crawling and understand the difference between two... but throughout the documentation, both terms "scraper" and "crawler" are used to describe roachphp. So my question is does roach-php support crawling? If so, I can't seem to find anywhere on the documentation.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions