Scripts for analyzing text

Based on emotion, sentiment, subjectivity, orientation, and color.

Lexicons Used

These lexicons were parsed and compiled using the script compile_lexicons.py into file lexicons/lexicons_compiled.csv using these categories with the following counts:

Total word count: 14,852
Words with emotion: 4,463 (30.0%)
Words with sentiment: 10,916 (73.5%)
Words with subjectivity: 6,886 (46.4%)
Words with orientation: 2,192 (14.8%)
Words with color: 5,404 (36.4%)

How to analyze text

Download text, e.g. texts/moby_dick.txt

Parse text using gutenberg_text.py <text file> <output text json file> <output chapter json file>, e.g. gutenberg_text.py ../texts/moby_dick.txt ../output/moby_dick_normal.json ../output/moby_dick_chapters.json. This generates an output that looks like this:

{
  "title": "Moby-Dick; or, The Whale'",
  "author": "Herman Melville",
  "chapters": [
    {
      "title": "Loomings",
      "text": "discrete words in lowercase separated by spaces with punctuation removed"
    },
    {
      "title": "The Carpet-Bag",
      "text": "discrete words in lowercase separated by spaces with punctuation removed"
    },
    ...
  ]
}

Run get_data.py <json file from previous step> <a path to output csv file>, e.g. get_data.py output/moby_dick_normal.json output/moby_dick_data.csv. This outputs a .csv file in the format:
```
emotion,color,orientation,sentiment,subjectivity,chapter
0,2,1,1,-1,0
...
```
Where each row represents a word, and each column represent the index of each category listed in data/categories.json

Run analyze_data.py <csv file from previous step> <a path to output csv file> <word buffer> <word offset>, e.g. python analyze_data.py output/moby_dick_data.csv output/moby_dick_analysis.json 400 200. This outputs a .json file in the format:

[
 {
   "chapter": 0,
   "emotion": [
     0.500, // anger
     0.250, // fear
     ...
   ],
   "subjectivity": [
     0.600, // weak
     0.150 // strong
   ],
   "sentiment": [
     0.750, // positive
     0.050 // negative
   ],
   "orientation": [
     0.850, // active
     0.450 // passive
   ],
   "color": [
     0.950, // white
     0.001, // black
     ...
   ]
 },
 ...
]

Where each item represents a group of words (with a size of word buffer as configured in the previous step). The numbers are percentages between 0 and 1 that represents the relative weight of that particular category value.

Optionally, run python report_data.py <analysis json file> <output dir> to write individual .csv files for each category, e.g. python report_data.py output/moby_dick_analysis.json output/moby_dick/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts for analyzing text

Lexicons Used

How to analyze text

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
lexicons		lexicons
lexicons_external		lexicons_external
output		output
parsers		parsers
texts		texts
visualization		visualization
.gitignore		.gitignore
README.md		README.md
analyze_data.py		analyze_data.py
compile_lexicons.py		compile_lexicons.py
get_data.py		get_data.py
report_data.py		report_data.py

beefoo/text-analysis

Folders and files

Latest commit

History

Repository files navigation

Scripts for analyzing text

Lexicons Used

How to analyze text

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages