Skip to content

onidzelskyi/clubnika_chat_grabber

Repository files navigation

---
title: "Clubnika"
author: "Oleksii"
date: "July 29, 2015"
Updated: "October 31, 2015"
---

## Prerequests

    sudo apt-get install python-pip python-dev python-lxml
    sudo pip install scrapy w3lib cssselect selenium python-dateutil pyvirtualdisplay
    
    chromedriver
    https://sites.google.com/a/chromium.org/chromedriver/home
    sudo apt-get install unzip
    wget http://chromedriver.storage.googleapis.com/2.20/chromedriver_linux64.zip
    unzip chromedriver_linux64.zip
    

## How to use
* 1. Grab chat once at day:
```{sh}
/usr/bin/python grab_chat.py
```
In results will be:
  - created (appended) *clubnika.db* db with the chat messages;
  - created (updated) file *timestamp.txt* with the last message's timestamp
  - created (updated) file *deep.txt* with the last parsed chat's page

* 2. Save corpus to file *corpus.txt* from db:
```{sh}
sqlite3 clubnika.db
sqlite> .output corpus.txt
sqlite> select msg  from messages;
sqlite> .exit
```

* 2. Convert corpus file to lower case:
```{sh}
tr  "[:upper:]" "[:lower:]"< Corpus.txt > Corpus_lower.txt
```

* 3. Delete duplicates from corpus file:
```{sh}
cat Corpus_lower.txt | sort -u > Corpus_lower_uniq.txt
```

* 4. Remove punctuation:
```{sh}
tr "[:punct:]" " "< Corpus_lower_uniq.txt | sort -u > Corpus_lower_uniq_no_punct.txt
```

* 5. Calculate word's frequence:
```{sh}
cat Corpus_lower_uniq_no_punct.txt | tr ' ' '\n' | sort | uniq -c | sort -nr | head -20
```


## Debug: 
* 1. Check out messages without phone numbers were correctly placed in file *no_phones.txt*
```{sh}
 for i in $(cat mobile_codes.txt) ; do grep $i -r no_phones.txt ; done > del.txt
```
* 2. To swich grab_chat.py in debug mode change 
```{sh}
DEBUG = 0
```
to 
```{sh}
DEBUG = 1
```
## Raw corpus file:  
*Corpus.txt*

## Convert Corpus's text to lower case:
```{sh}
tr  "[:upper:]" "[:lower:]"< Corpus_uniq.txt
```

## Remove punctuation symbols
```{sh}
tr "[:punct:]" " "< Corpus_uniq_lower.txt | sort -u
```

## Tokenize
```{sh}

```

About

Grab messages from online chat on clubnika.com.ua

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages