Question: Best way to process items that depend on another...? #46

telkins · 2022-06-18T19:05:33Z

telkins
Jun 18, 2022

Hello,

I am taking a look at this package, using the Laravel integration, actually.

I have a specific need. I would like to scrape a page, but the page has three components:

a parent component (usually)...let's say a country
the component itself...let's say a city
child components (usually)...let's say citizens

The country and citizen components are provided via links on the main page, with the latter possibly having a few bits of data that the main citizen page might not provide.

I'd like to parse the city page, but I'm not quite sure how to handle processing everything.

I would like to be able to grab the country link, creating it or updating it in my database, and then passing its ID along to the city processing, so that the city can be inserted/updated to become a part of the country.

Finally, I'd like to be able to process any of the citizens, again inserting/updating them in my database, as needed, all with reference to the city ID to which they belong. (And, ideally, handling some of the "extra" bits of data that might exist on the main city page.)

I can't quite figure out if I should have three different spiders, with the city spider calling the country and citizen spiders...or one city spider with different parser methods...? 🤔 In any case, I can't figure out how to pass the country/city database IDs along...and in the case of a single spider, I can't figure out how to make the item processors process one component versus another.

Any help or suggestions? I took a look at https://github.com/ksassnowski/roach-example-project, which was quite helpful, in general, but I didn't see how it could help with these particular problems.

Thanks in advance for your help. 🤓

ksassnowski · 2022-06-18T20:03:07Z

ksassnowski
Jun 18, 2022
Maintainer

FYI, I converted the issue to a discussion since it’s more of a question and less of an issue.

In general, questions like this are much easier to answer if you provide a link to the actual site you’re trying to scrape as otherwise it’s difficult to fully understand the scenario.

I currently don’t have my laptop with me so it’s difficult to provide a more comprehensive answer.

I’ll try and get back to this in the next couple of days.

0 replies

ksassnowski · 2022-06-22T08:34:48Z

ksassnowski
Jun 22, 2022
Maintainer

I just tagged a new release of Roach which contains this PR which makes it possible to have item processors only respond to certain type of items. In this case, I would suggest you only have a single spider and instead dispatch multiple requests for the different pages with different parsing callbacks.

To answer your other question about how to deal with passing data between different requests. You can store arbitrary meta data on a request which you are then able to access in your parsing callback via $response->getRequest()->getMeta(...).

One thing you will have to work around is that item processors don't return anything, so you won't be able to return the model id after saving it to the database for instance. In this case, I would suggest that instead of relying on the database PK, you generate a UUID in your spider and use that to reference models in the database.

So putting it all together, you could write something like this:

A custom Country item

final class Country extends AbstractItem
{
    public function __construct(
        public readonly UuidInterface $countryID,
        public readonly string $name,
    ) {
    }
}

A custom City item that contains a reference to the country's UUID.

final class City extends AbstractItem
{
    public function __construct(
        public readonly UuidInterface $countryID,
        public readonly string $name,
        public readonly int $population,
    ) {
    }
}

An item processor that only handles countries and saves them to the database:

final class SaveCountryToDatabaseProcessor extends CustomItemProcessor
{
    /**
     * @param Country $item
     */
    public function handleItem(ItemInterface $item): ItemInterface
    {
        CountryModel::create([
            'uuid' => $item->countryID,
            'name' => $item->name,
        ])
    }

    protected function getHandledItemClasses(): array
    {
        return [
            Country::class,
        ];
    }
}

A custom city processor that saves the city to the database and relates it to the country

final class SaveCityToDatabaseProcessor extends CustomItemProcessor
{
    public function processItem(ItemInterface $item): ItemInterface
    {
        CityModel::create([
            'name' => $item->name,
            'population' => $item->population,
            // Or you could query for the country's database ID
            // based on the UUID and save that if you prefer
            'country_id' => $item->countryID,
        ]);
    }

    protected function getHandledItemClasses(): array
    {
        return [
            City::class,
        ];
    }
}

And finally the spider class. From what I understood, you would be starting on the city page, so the first thing we do is extract the URL for the country and dispatch a new request. We will also generate a UUID for this city which we attach to the request's meta data by using withMeta. The request's parse callback will be the parseCountry method of the spider.

We parse the country and yield a Country item which will get processed by the SaveCountryToDatabaseProcessor (but no the SaveCityToDatabaseProcessor).

Since we generated the UUID ourselves, we can now pass it along to the city we scraped so we can use it in the item processor.

class MySpider extends BasicSpider
{
    public function parse(Response $response): Generator
    {
        $countryURL = /* extract URL from page */;
        $countryRequest = new Request('GET', $countryURL, $this->parseCountry(...));
        $countryUUID = Uuid::uuid4();
        yield ParseResult::fromValue(
            $countryRequest->withMeta('countryID', $countryUUID),
        );

        $city = new City(
            $countryUUID,
            /* extract other data from response */
        );
        yield $this->item($city);
    }

    public function parseCountry(Response $response): Generator
    {
        $country = new Country(
            $response->getRequest()->getMeta('countryID'),
            /* extract other data from response */
        );

        yield $this->item($country);
    }
}

I hope this helped.

0 replies

ksassnowski · 2022-06-22T08:39:40Z

ksassnowski
Jun 22, 2022
Maintainer

One thing to keep in mind is that requests don't immediately get dispatched when yielding them from a spider. This means that the parseCountry callback would actually run after the parse callback has finished. So the country would actually get created after the city. That means you probably won't be able to use a real foreign key to connect the city with the country since the country gets inserted later. You could also try and change your spider so that you actually start on the country page but I'm not sure if this is feasible in your situation.

1 reply

telkins Jul 27, 2022
Author

@ksassnowski Thanks so much for your help. Things got a bit busy with other projects and then there was a big summer vacation thrown into the mix. Anyway, I've caught up a bit and just want to say that your responses have been quite helpful. Things are moving in the right direction now and this package is helping out a lot. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Best way to process items that depend on another...? #46

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question: Best way to process items that depend on another...? #46

telkins Jun 18, 2022

Replies: 3 comments · 1 reply

ksassnowski Jun 18, 2022 Maintainer

ksassnowski Jun 22, 2022 Maintainer

ksassnowski Jun 22, 2022 Maintainer

telkins Jul 27, 2022 Author

telkins
Jun 18, 2022

Replies: 3 comments 1 reply

ksassnowski
Jun 18, 2022
Maintainer

ksassnowski
Jun 22, 2022
Maintainer

ksassnowski
Jun 22, 2022
Maintainer

telkins Jul 27, 2022
Author