Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] 分布式爬虫数据收集与处理方案 #1

Open
thonatos opened this issue Sep 4, 2018 · 1 comment
Open

[Discussion] 分布式爬虫数据收集与处理方案 #1

thonatos opened this issue Sep 4, 2018 · 1 comment

Comments

@thonatos
Copy link
Member

thonatos commented Sep 4, 2018

Egg.js Issue 用于反馈框架问题,讨论放这里

背景

一个分布式爬虫系统,工作流程是n个小型阿里云ecs爬虫拿到数据后socket.io发回到中小型阿里云ecs主控,主控做数据运算后将数据放入到阿里云rds,数据量很大,egg里用了async/await库的queue队列。

问题

爬虫返回给主控的数量量比主控运算的速度还快,十来分钟队列就会积了几十万导致主控的node进程崩了

方案

TODO

@thonatos
Copy link
Member Author

thonatos commented Sep 4, 2018

流程

说明

  1. 爬虫类任务可能伴随大量的数据文件,故而建议考虑云存储如OSS,将数据文件存在云端
  2. 借助 OSS/s3 的通知功能将信息通知到队列(或上传后主动发送到队列)
  3. 处理数据模块可选(如爬虫爬取阶段已经处理完毕)
  4. 队里依次处理并将需要的数据同步至数据库或其他持久化设备即可

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant