Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort crawling #380

Open
Radiergummi opened this issue May 25, 2021 · 3 comments
Open

Abort crawling #380

Radiergummi opened this issue May 25, 2021 · 3 comments

Comments

@Radiergummi
Copy link

I have looked at #293 and #289, but those issues are slightly different. We have a crawler library based on node-crawler that performs computationally intensive crawling tasks and writes to different output sources, depending on the current task.
Due to the nature of our tasks, it is crucial for crawlers to be interruptible and pick up work at the same point later on, and write consistent output files at the same time.
Think of a JSON output file, for example, to which an array of objects is written, one object per URL. If the crawler stops, a final closing bracket must be written to the output file to ensure the file is valid JSON.

We achieved this using the preRequest hook and a flag:

let aborted: boolean = false;

const crawler = new Crawler({
    async preRequest(options, done): Promise<void> {
        if (! aborted) {
            await cleanup() && done();
        }
    },
    preRequest(error, response, done): Promise<void> {
        if (error || aborted) {
            await cleanup() && done();
        }

        // ...
    },
});

This works, but it's not optimal: Requests may still be queued while we're in aborted state, depending on the implementation. Additionally, there's no way to abort in-flight requests.

To tackle this issue, I'd like to suggest implementing support for the AbortController API, which can be used in browsers to abort ongoing fetch requests and has been implemented in recent Node.JS versions (with a poly-fill available, too). Implementation wise, one could steal from node-fetch:

// Wrap http.request into fetch
const send       = ( options.protocol === 'https:' ? https : http ).request;
const { signal } = request;
let response     = null;

const abort = () => {
    const error = new AbortError( 'The operation was aborted.' );
    reject( error );
    if ( request.body && request.body instanceof Stream.Readable ) {
        request.body.destroy( error );
    }

    if ( !response || !response.body ) {
        return;
    }

    response.body.emit( 'error', error );
};

if ( signal && signal.aborted ) {
    abort();
    return;
}

(See full source)

I'm happy to help with this.

@mike442144
Copy link
Collaborator

Thank you. It has been a really headache these years. So I'm happy that you dive into the code and propose a solution. Looking forward to your MR. :)

@Radiergummi
Copy link
Author

Maybe it would make sense to swap request for node-fetch altogether at this point? I'm currently entangled in another project, but I'll set a reminder to get back to this issue and see what I can come up with ✌️

@mike442144
Copy link
Collaborator

Thanks. I had a chance to use node-fetch in a much small project and found out that I couldn't manage the tasks very well but rely on the Promise itself. What do you think about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants