Crawling entire site? #86

seongbae · 2022-12-09T19:57:22Z

seongbae
Dec 9, 2022

Hello,

Does roach-php support following links and crawling an entire site? I read the documentation on Scraping versus Crawling and understand the difference between two... but throughout the documentation, both terms "scraper" and "crawler" are used to describe roachphp. So my question is does roach-php support crawling? If so, I can't seem to find anywhere on the documentation.

Thank you.

jhg · 2024-09-21T03:04:44Z

jhg
Sep 21, 2024

You can request to follow all links that match with current domain:

        $domain = parse_url($response->getUri(), PHP_URL_HOST);
        $links = $response->filter('a')->each(function ($node) {
            return $node->link()->getUri();
        });

        foreach ($links as $link) {
            if (parse_url($link, PHP_URL_HOST) !== $domain) {
                continue;
            }
            yield $this->request('GET', $link);
        }

I don't mean about if roach-php support or must do it or not, but in the end the result is it. If you do that you need to take care of the list of requests and there are duplicated requests. Also, it's recommended to use RobotsTxtMiddleware and UserAgentMiddleware download middlewares.

I opened a PR with a spider middleware that helps with that. The current downloads middleware let the requests grow much faster than the middleware drop it before to be downloaded, then that new one is a spider middleware, and can drop duplicated requests earlier when the spider request it.

That avoid high memory consumption while continue avoiding duplicate requests. Also, it uses a limited map-array, then it'll repeat a least repeated request before to crash because too much memory usage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling entire site? #86

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Crawling entire site? #86

seongbae Dec 9, 2022

Replies: 1 comment

jhg Sep 21, 2024

seongbae
Dec 9, 2022

jhg
Sep 21, 2024