Korean SNS & Article Analysis

Back-end

Crawling

Todo

Crawl Naver news or Google news based on a query
Connect to the parsing part to parse and store in database automatically

Parsing using KoNLPy

Todo

Issue

~~Are all the morphems indeed needed? (ex. 의, 는, 이다, etc.)~~
~~Multi threading in parsing process doesn't work. It works only for short sentences.~~ doesn't matter. will not use multi threading.

Done

2016.05.12.

Done with parsing the input sentence for creating rule.

2016.04.09.

Used dummy text file
Parsed each line to sentences with multi threading process (2 threads)
~~Parsed each sentences to morphemes with multi threading process (2 threads)~~ doesn't work.

REST API

Todo

API	Description	CRUD
(rulesets) PUT /rulesets/{topic_id}/{ruleset_seq}/{new_name}	Change the name of the ruleset.	U rulesets

Done

Removed

API	Description	CRUD
(topics) GET /topics/	Get all topics from the database.	R topics
(sources) GET /sources/	Get all sources from the database.	R sources
(rulesets) POST /rulesets/{topic_id}/~~{category_seq}/~~{name}	Create new ruleset. Which is a kind of package of rules.	C rulesets
(rulesets) GET /rulesets/{topic_id}	Get all the rulesets from the database.	R rulesets
(rulesets) DELETE /rulesets/{topic_id}/{category_seq}	Delete the ruleset and its realted rules.	D rulesets rules rule_word_relations
(words) POST /words/{fulltext}	Parse the {fulltext} into morphemes. Store the unregistered morphems into words table. Get morphemes of the fulltext after parsing.	C words
(rules) POST /rules/{topic_id}/{category_seq}/{fulltext}/{word_ids}	Create an actual rule, combination of words.	C rules rule_word_relations
(rules) GET /rules/{topic_id}/{ruleset_seq}	Get rules of the selected rulset.	R rules (rule_word_relations?)
(rules) GET /rules/{rule_id}	Get specific rule.	R rules rule_word_relations
(rules) PUT /rules/{rule_id}/{word_ids}	Change the rule, combination of words.	U rule_word_relations
(rules) DELETE /rules/{rule_id}	Delete the rule, either fulltext and combination of words.	D rules rule_word_relations

Issue

See Database - Issue 2.

Done

Database

Todo

Issue

Some emojis are not properly saved. Some are saved just like '?????'
How about create 'querys' table to store the queries which are used to crawl the posts. (ex. 총선) Then it is possible to categorize the posts and user can analyze only the posts they are interested in. If we only want to analyze just all of the recent posts, it might be redundant data. However, still it is a good option, considering expandability. There is topics table

Done

2015.05.12.

Crawled posts are stored in MySQL database.
Rulesets and Rules are sotred in MySQL database.
Redis hold the result of analysis. There are key-bitarray maps with a rule_id as a key and bitarray with 1 at the position of realted sentece_id as value. If there are no realted sentences for the rule, all the value of bitarray will be 0. The rule_id of Unanalyzed rule is not set in the redis.

2016.04.09.

Created database shceme and initializing code.

Front-end

Todo

Issue

Done

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
analysis		analysis
crawl		crawl
modules		modules
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
analysis.py		analysis.py
morpheme.ini		morpheme.ini
morpheme.py		morpheme.py
requirements.txt		requirements.txt
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korean SNS & Article Analysis

Back-end

Crawling

Parsing using KoNLPy

REST API

Database

Front-end

About

Releases

Packages

Languages

YunseokJANG/Sentences-analysis

Folders and files

Latest commit

History

Repository files navigation

Korean SNS & Article Analysis

Back-end

Crawling

Parsing using KoNLPy

REST API

Database

Front-end

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages