-
Notifications
You must be signed in to change notification settings - Fork 6
/
README
48 lines (40 loc) · 2.52 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
this project is only used to crawl bbs.byr.cn data.
according to authentication mechanism and data stream, i simplify the crawler flow.
make crawler is easier, smaller and fast.
only 3 steps needed:
1: create mysql db, tables information show bellow
table sect is used to store each section on the lefp panel
+-------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| url | varchar(60) | NO | UNI | NULL | |
| name | varchar(50) | NO | | NULL | |
+-------+------------------+------+-----+---------+----------------+
table auart is used to store each article description
+--------+---------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+---------------------+------+-----+-------------------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| uptime | date | YES | | 2016-05-19 | |
| hot | int(10) unsigned | NO | | 0 | |
| author | varchar(50) | NO | MUL | NULL | |
| title | varchar(100) | NO | | NULL | |
| url | varchar(80) | NO | UNI | http://bbs.byr.cn | |
+--------+---------------------+------+-----+-------------------+----------------+
table art is used to store each artile detail content
+-------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| url | varchar(80) | NO | UNI | NULL | |
| text | text | YES | | NULL | |
+-------+---------------------+------+-----+---------+----------------+
2: crawl bbs section information
cmd: scrapy crawl bbscat
3: crawl bbs content
cmd: scrapy crawl bbs
note:
my crawl is very fast. all bbs article is about 1.1 millions. i just use 6 hours to finish it.
machine: aliyun ecs, 1GB Mem, 1 Core CPU, 1MB bandwith
because of auth., please replace your account and passwd. in your own project.