This crawler has built up the following dataset based on Medium.com with ground truth sentence and review correspondence.
You may contact Charlie Wu at [email protected] to obtain the dataset
Crawler environment requirement
- python 3.6
- postgresql environment on MacOS/Linux
Define constant.py, Download chromedriver for your environment here to project directory
To query data:
psql medium
\dt
select $field from $table where [conditions];
# select numLikes of highlights whose article is published before 2017 Dec 1.
SELECT (highlight.numLikes, article.postTime) from highlight LEFT JOIN article ON highlight.corrArticleID = article.articleID WHERE (article.postTime <= timestamp '2017-12-01 00:00:00' AND highlight.numLikes >= 0);
# select all paragraphs of an article
SELECT * from stn where corrArticleID = $articleID
# select number of comments associating with highlights
SELECT count(*) from comment where corrHighlightID = $highlightID
# select hightlight where number of comments associating with highlights is more than one
SELECT * from highlight where (SELECT count(*) from comment where corrHighlightID = highlightID) > 1 limit 1;
Datebase Table Structure:
- article
Field | Type | Info |
---|---|---|
articleID | SERIAL PRIMARY KEY | |
mediumID | varchar(300) | |
title | text | |
recommends | int | |
tags | varchar(300) | list of tags |
postTime | timestamp | |
numLikes | int | |
corrAuthorID | int | link to author |
- author
Field | Type | Info |
---|---|---|
authorID | SERIAL PRIMARY KEY | |
name | varchar(50) | |
mediumID | varchar(20) | |
username | varchar(50) | |
bio | text |
- topic
Field | Type | Info |
---|---|---|
topicID | SERIAL PRIMARY KEY | |
name | text | |
mediumID | varchar(20) | |
description | text |
- paragraph
Field | Type | Info |
---|---|---|
paragraphID | SERIAL PRIMARY KEY | |
mediumID | varchar(10) | |
content | text | |
corrArticleID | int | link to article |
position in article ordered by its paragraphID
- stn
Field | Type | Info |
---|---|---|
stnID | SERIAL PRIMARY KEY | |
paragraphID | int | link to paragraph |
content | text | |
corrArticleID | int | link to article |
position in paragraph ordered by its stnID
- highlight
Field | Type | Info |
---|---|---|
highlightID | SERIAL PRIMARY KEY | |
content | text | |
numLikes | int | |
startOffset | int | |
endOffset | int | |
corrParagraphID | int | link to paragraph |
corrArticleID | int | link to article |
- comment
Field | Type | Info |
---|---|---|
commentID | SERIAL PRIMARY KEY | |
selfArticleID | int | link to article |
corrArticleID | int | link to article |
corrHighlightID | int | link to highlight |
The detailed info of a comment is stored inside an article model as field selfArticleID, so it features a tree node structure:
Disclaimer: The development is for academic use only. The developer shall not be responsible for any consequence from the user behavior of this program. For the use of dataset, acknowledgement would be appreciated.