Some requests (stream based) never ends and block the queue #371

Verhov · 2020-12-20T02:01:02Z

Summary:

I parsed millions of domains and faced issue what some stream based domain request can permanently block the queue.
Timeout in this case does not fired and RAM is constantly leaking.
I found two same domains: https://goldfm.nu/, https://rsradio.online/.
It's really nice radio 😄 but totally block my crawler pattern))

Current behavior

I'am using timeout but looks it not work pretty correctly, callback never fired in this case:

_crawler = new Crawler({
      timeout: 9000,
      retries: 1,
      retryTimeout: 1000,
      debug: true,
      callback: (error, res, done) => {
          ...
          done()
      }
})

_crawler.queue([{ uri: 'https://goldfm.nu' }])

Issue

Definitely it because of this request starts media stream and node-crawler tried to get it all... request always in pending state.

Side issues

Also as stream is arriving it increase RAM and seem will thrown 'out of memory' exception.

Attempts to fix

I also tried to set accept header to html only, but it's doesn't have affect.
headers: { Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9' },

Currently I just skip this url as the special case, but I think it may not be unique case.

Expected behavior

Timeout should fire an error when we did not receive a response within the allotted time.

Related issues

This issue is definitely related with request package.

same behavior described here: Timeout ignored for json request and streaming body request/request#3341

Question

Do you have any ideas how to resolve this case?)

The text was updated successfully, but these errors were encountered:

slienceisgolden · 2020-12-24T13:11:48Z

have the same issue. the spider must have not only a timeout, but also a limit on the download volume

mike442144 · 2020-12-25T08:46:45Z

Refer to my comment here: request/request#3341
Feel free to discuss if any more questions. and hope it helps.

Verhov · 2021-01-03T13:28:43Z

Thanks @mike442144, but in this (crawling) context we can't blacklisted it before we face it.
And it's not very good to wait few days before the server decides to disconnect - we will receive continuous payload on both sides during this time.

I still don't know how we can identify this type of connection in advance and complete it - Iam tried to send a OPTIONS request at first, but it did't helps to detect next GET request type.

The most elegant solution in my opinion would be a some timeout, and 'response size limit' option that @slienceisgolden mentioned would be great (incl. other pitfalls: huge docs, files, other streams etc..).

Currently not working on it, but it's still relevant.

mike442144 · 2021-03-05T07:20:07Z

@Verhov Good idea to limit response body size, should work well in your case. Body size, also should be in the options for flexibility. Look forward to your merge request :)

Verhov mentioned this issue Dec 20, 2020

Timeout ignored for json request and streaming body request/request#3341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some requests (stream based) never ends and block the queue #371

Some requests (stream based) never ends and block the queue #371

Verhov commented Dec 20, 2020 •

edited

Loading

slienceisgolden commented Dec 24, 2020

mike442144 commented Dec 25, 2020

Verhov commented Jan 3, 2021

mike442144 commented Mar 5, 2021

Some requests (stream based) never ends and block the queue #371

Some requests (stream based) never ends and block the queue #371

Comments

Verhov commented Dec 20, 2020 • edited Loading

Summary:

Current behavior

Issue

Side issues

Attempts to fix

Expected behavior

Related issues

Question

slienceisgolden commented Dec 24, 2020

mike442144 commented Dec 25, 2020

Verhov commented Jan 3, 2021

mike442144 commented Mar 5, 2021

Verhov commented Dec 20, 2020 •

edited

Loading