-
Notifications
You must be signed in to change notification settings - Fork 0
onidzelskyi/clubnika_chat_grabber
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
--- title: "Clubnika" author: "Oleksii" date: "July 29, 2015" Updated: "October 31, 2015" --- ## Prerequests sudo apt-get install python-pip python-dev python-lxml sudo pip install scrapy w3lib cssselect selenium python-dateutil pyvirtualdisplay chromedriver https://sites.google.com/a/chromium.org/chromedriver/home sudo apt-get install unzip wget http://chromedriver.storage.googleapis.com/2.20/chromedriver_linux64.zip unzip chromedriver_linux64.zip ## How to use * 1. Grab chat once at day: ```{sh} /usr/bin/python grab_chat.py ``` In results will be: - created (appended) *clubnika.db* db with the chat messages; - created (updated) file *timestamp.txt* with the last message's timestamp - created (updated) file *deep.txt* with the last parsed chat's page * 2. Save corpus to file *corpus.txt* from db: ```{sh} sqlite3 clubnika.db sqlite> .output corpus.txt sqlite> select msg from messages; sqlite> .exit ``` * 2. Convert corpus file to lower case: ```{sh} tr "[:upper:]" "[:lower:]"< Corpus.txt > Corpus_lower.txt ``` * 3. Delete duplicates from corpus file: ```{sh} cat Corpus_lower.txt | sort -u > Corpus_lower_uniq.txt ``` * 4. Remove punctuation: ```{sh} tr "[:punct:]" " "< Corpus_lower_uniq.txt | sort -u > Corpus_lower_uniq_no_punct.txt ``` * 5. Calculate word's frequence: ```{sh} cat Corpus_lower_uniq_no_punct.txt | tr ' ' '\n' | sort | uniq -c | sort -nr | head -20 ``` ## Debug: * 1. Check out messages without phone numbers were correctly placed in file *no_phones.txt* ```{sh} for i in $(cat mobile_codes.txt) ; do grep $i -r no_phones.txt ; done > del.txt ``` * 2. To swich grab_chat.py in debug mode change ```{sh} DEBUG = 0 ``` to ```{sh} DEBUG = 1 ``` ## Raw corpus file: *Corpus.txt* ## Convert Corpus's text to lower case: ```{sh} tr "[:upper:]" "[:lower:]"< Corpus_uniq.txt ``` ## Remove punctuation symbols ```{sh} tr "[:punct:]" " "< Corpus_uniq_lower.txt | sort -u ``` ## Tokenize ```{sh} ```
About
Grab messages from online chat on clubnika.com.ua
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published