Text_Anlaysis_Technobabble

NLP (Natural Language Processing) Using Star Trek scripts as training data.

Using the website http://chakoteya.net/StarTrek/index.html, which contains formatted scripts from all five Star Trek series, this program downloads all the webpages into text files, sanitizes and preprocessing those scripts to extract character names and dialogue from the text, and models the dialogue of the top 100 characters (as ranked by lines spoken) into word clouds of the speaker. Word Clouds graphically represent most spoken words in both size and colour, with larger font sizes indicating higher frequencies and darker colours representing desner allocations of words within the text.

Python file descriptions

htm-process.py

Call this method first to scrape website for scripts -Uses line comphrehensions to generate urls for series using episode number ranges. -Uses BeautifulSoup to get the contents of the webpage, writes to plain text files

ProcessAllScripts-2.py

Extracts lines of dialogue from full script, ignoring erroneous text
Concatenates multiline character dialogue into single lines
Saves character's spoken lines into dictionary where {key: value} are represented as {character_name: lines_of_dialogue}.
Generates content for data_char_lines, with a folder per series with character's dialogue as text files stored within the folder.

FilterFiles.py

Uses file size to calculate a cutoff point for character files to keep
Natural cutoff occurs at top ~100 characters

processWords.py

Uses files in resources/ folder to create stopwords list (words to be removed from analysis) from nltk standard package and personal, curated list of stopwords collected through NLP projects in school.
Removes stopwords and punctuation from words dictionary.
Create word frequency dictionary per character.
Generates a Word Cloud image of top words for files listed in char_lines_top_100

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
data_char_lines		data_char_lines
data_char_lines_top_100		data_char_lines_top_100
resources		resources
wordcloud_out		wordcloud_out
wordclouds		wordclouds
0-htm-process.py		0-htm-process.py
1-scripts_to_data_char_lines.py		1-scripts_to_data_char_lines.py
2-filter_character_data_by_file_size.py		2-filter_character_data_by_file_size.py
3-add_char_names_to_stop_words.py		3-add_char_names_to_stop_words.py
4-filter_stop_words_and_generate_wordclouds.py		4-filter_stop_words_and_generate_wordclouds.py
Figure_1.png		Figure_1.png
Figure_1x.png		Figure_1x.png
README.md		README.md
requirements.txt		requirements.txt
trek_mask.png		trek_mask.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text_Anlaysis_Technobabble

NLP (Natural Language Processing) Using Star Trek scripts as training data.

Python file descriptions

htm-process.py

ProcessAllScripts-2.py

FilterFiles.py

processWords.py

About

Releases

Packages

Contributors 3

Languages

K10ForTheWin/Text_Analysis_Technobabble

Folders and files

Latest commit

History

Repository files navigation

Text_Anlaysis_Technobabble

NLP (Natural Language Processing) Using Star Trek scripts as training data.

Python file descriptions

htm-process.py

ProcessAllScripts-2.py

FilterFiles.py

processWords.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages