Based on emotion, sentiment, subjectivity, orientation, and color.
- NRC Emotion Lexicon
- Bing Liu's Opinion Lexicon
- MPQA Subjectivity Lexicon
- Harvard General Inquirer
- NRC Word-Colour Association Lexicon
These lexicons were parsed and compiled using the script compile_lexicons.py into file lexicons/lexicons_compiled.csv using these categories with the following counts:
- Total word count: 14,852
- Words with emotion: 4,463 (30.0%)
- Words with sentiment: 10,916 (73.5%)
- Words with subjectivity: 6,886 (46.4%)
- Words with orientation: 2,192 (14.8%)
- Words with color: 5,404 (36.4%)
-
Download text, e.g. texts/moby_dick.txt
-
Parse text using
gutenberg_text.py <text file> <output text json file> <output chapter json file>
, e.g.gutenberg_text.py ../texts/moby_dick.txt ../output/moby_dick_normal.json ../output/moby_dick_chapters.json
. This generates an output that looks like this:{ "title": "Moby-Dick; or, The Whale'", "author": "Herman Melville", "chapters": [ { "title": "Loomings", "text": "discrete words in lowercase separated by spaces with punctuation removed" }, { "title": "The Carpet-Bag", "text": "discrete words in lowercase separated by spaces with punctuation removed" }, ... ] }
-
Run
get_data.py <json file from previous step> <a path to output csv file>
, e.g.get_data.py output/moby_dick_normal.json output/moby_dick_data.csv
. This outputs a .csv file in the format:emotion,color,orientation,sentiment,subjectivity,chapter 0,2,1,1,-1,0 ...
Where each row represents a word, and each column represent the index of each category listed in data/categories.json
-
Run
analyze_data.py <csv file from previous step> <a path to output csv file> <word buffer> <word offset>
, e.g.python analyze_data.py output/moby_dick_data.csv output/moby_dick_analysis.json 400 200
. This outputs a .json file in the format:[ { "chapter": 0, "emotion": [ 0.500, // anger 0.250, // fear ... ], "subjectivity": [ 0.600, // weak 0.150 // strong ], "sentiment": [ 0.750, // positive 0.050 // negative ], "orientation": [ 0.850, // active 0.450 // passive ], "color": [ 0.950, // white 0.001, // black ... ] }, ... ]
Where each item represents a group of words (with a size of
word buffer
as configured in the previous step). The numbers are percentages between 0 and 1 that represents the relative weight of that particular category value. -
Optionally, run
python report_data.py <analysis json file> <output dir>
to write individual .csv files for each category, e.g.python report_data.py output/moby_dick_analysis.json output/moby_dick/