Skip to content

Latest commit

 

History

History
114 lines (86 loc) · 4.09 KB

README.md

File metadata and controls

114 lines (86 loc) · 4.09 KB

Medium Crawler 🐜

This crawler has built up the following dataset based on Medium.com with ground truth sentence and review correspondence.

Dataset Statistics

You may contact Charlie Wu at [email protected] to obtain the dataset

Crawler environment requirement

  • python 3.6
  • postgresql environment on MacOS/Linux

Define constant.py, Download chromedriver for your environment here to project directory

To query data:

psql medium
\dt
select $field from $table where [conditions];
# select numLikes of highlights whose article is published before 2017 Dec 1.
SELECT (highlight.numLikes, article.postTime) from highlight LEFT JOIN article ON highlight.corrArticleID = article.articleID WHERE (article.postTime <= timestamp '2017-12-01 00:00:00' AND highlight.numLikes >= 0);
# select all paragraphs of an article
SELECT * from stn where corrArticleID = $articleID
# select number of comments associating with highlights
SELECT count(*) from comment where corrHighlightID = $highlightID
# select hightlight where number of comments associating with highlights is more than one
SELECT * from highlight where (SELECT count(*) from comment where corrHighlightID = highlightID) > 1 limit 1;

Datebase Table Structure:

  • article
Field Type Info
articleID SERIAL PRIMARY KEY
mediumID varchar(300)
title text
recommends int
tags varchar(300) list of tags
postTime timestamp
numLikes int
corrAuthorID int link to author
  • author
Field Type Info
authorID SERIAL PRIMARY KEY
name varchar(50)
mediumID varchar(20)
username varchar(50)
bio text
  • topic
Field Type Info
topicID SERIAL PRIMARY KEY
name text
mediumID varchar(20)
description text
  • paragraph
Field Type Info
paragraphID SERIAL PRIMARY KEY
mediumID varchar(10)
content text
corrArticleID int link to article

position in article ordered by its paragraphID

  • stn
Field Type Info
stnID SERIAL PRIMARY KEY
paragraphID int link to paragraph
content text
corrArticleID int link to article

position in paragraph ordered by its stnID

  • highlight
Field Type Info
highlightID SERIAL PRIMARY KEY
content text
numLikes int
startOffset int
endOffset int
corrParagraphID int link to paragraph
corrArticleID int link to article
  • comment
Field Type Info
commentID SERIAL PRIMARY KEY
selfArticleID int link to article
corrArticleID int link to article
corrHighlightID int link to highlight

The detailed info of a comment is stored inside an article model as field selfArticleID, so it features a tree node structure:

tree graph

Disclaimer: The development is for academic use only. The developer shall not be responsible for any consequence from the user behavior of this program. For the use of dataset, acknowledgement would be appreciated.