Crawl Limits #22

blaiszik · 2020-08-13T16:24:28Z

It might be helpful to add arguments to allow for a user to specify a max_crawl_depth (folder depth max) or max_crawl_total (max total number of files). This is not something we need currently, but just a potentially useful addition.

The text was updated successfully, but these errors were encountered:

tskluzac · 2020-10-17T22:42:22Z

Currently thinking this:

Have optional max_crawl_depth and max_crawl_total args at the crawler. The state of the crawl (all local queues and in-flight tasks) will be pickled and stored in S3 as a checkpoint of sorts. Then once the service stops, the user can access a 'crawlNext' token that will pick up where the previous 'max' was met. The 'crawlNext' token will be deleted in 24 hours to save space as these queues can get pretty hefty, and the state of a repo could change pretty drastically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl Limits #22

Crawl Limits #22

blaiszik commented Aug 13, 2020 •

edited

Loading

tskluzac commented Oct 17, 2020

Crawl Limits #22

Crawl Limits #22

Comments

blaiszik commented Aug 13, 2020 • edited Loading

tskluzac commented Oct 17, 2020

blaiszik commented Aug 13, 2020 •

edited

Loading