Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Integrate Document Categorization to Frequency Analysis #68

Open
wants to merge 72 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
1e5c149
created the spring log for the documentation part of our tasks
solisa986 Mar 30, 2021
3f5ac7b
finished the spring log for issue#51
solisa986 Mar 30, 2021
2a5a4e4
Writing word frequencies to csv
Mar 31, 2021
3233a42
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Mar 31, 2021
225ce2f
Putting different run's results into separate files
Mar 31, 2021
446215a
Update textmining.py
hadenwIV Mar 31, 2021
b638b98
Categorization of words
Mar 31, 2021
902e704
Additional elaboration on functions of tasks completed
Mar 31, 2021
a065da8
Fixed name spelling
Mar 31, 2021
c208849
Added docstrings
Mar 31, 2021
9f9733b
moving all of our code files to a folder called categorize_words
donizk Apr 1, 2021
25abe2e
created interface file, began implementation for interface
donizk Apr 1, 2021
b816c40
added notes (as comments) to myself onto the __main__.py file to keep…
donizk Apr 1, 2021
60c36e2
added some test cases
solisa986 Apr 5, 2021
9455433
classifying categories of files inputted
Apr 5, 2021
1b63ffd
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 5, 2021
f994779
Sorting assignment categories
Apr 5, 2021
48f9fae
finished documenting sprint 2 log and moved the categories_words.py file
solisa986 Apr 5, 2021
013b9db
formatting
solisa986 Apr 5, 2021
8b5347d
Merge branch 'issue#51' of https://github.com/Allegheny-Ethical-CS/Ga…
hadenwIV Apr 6, 2021
b95ebee
Revert "Merge branch 'issue#51' of https://github.com/Allegheny-Ethic…
enpuyou Apr 6, 2021
2e79165
Word categorization program
Apr 7, 2021
56c8689
Word categorization using training data and Scikit
Apr 7, 2021
ccd2cba
Start of the interface pipeline
Apr 7, 2021
85c9ce6
Beginning of interface page to for category frequency analysis
Apr 7, 2021
5dec4f7
Removed category classification model training data
Apr 7, 2021
5a9168e
Merge branch 'master' into issue#51
enpuyou Apr 10, 2021
eb71d6e
Development on categorization
Apr 14, 2021
4594407
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 14, 2021
99e2dd1
Removed sample_md_reflections training data
Apr 14, 2021
c4d689b
Readded existing sample_md_reflections
Apr 15, 2021
598e3c1
Restored original sample_md_reflections
Apr 15, 2021
3f47837
fixing
favourojo Apr 15, 2021
cee57d9
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
favourojo Apr 15, 2021
16bdf5d
fixed
favourojo Apr 15, 2021
8ab4693
Category data classification of user input
Apr 19, 2021
4ba42cc
System for category classification of student responses
Apr 19, 2021
4f68b49
Replacing Pipfile lock
Apr 19, 2021
46063bb
Start of visualization interface
Apr 20, 2021
c9b67b4
Visualization interface with graph of overall categories
Apr 20, 2021
ba92938
Removed print statements
Apr 20, 2021
763f72f
finished documentation
favourojo Apr 21, 2021
114b815
finished
favourojo Apr 21, 2021
7cefb9a
Update test_analyzer.py
hadenwIV Apr 21, 2021
19778ba
Merge branch 'issue#51' of https://github.com/Allegheny-Ethical-CS/Ga…
hadenwIV Apr 21, 2021
f6f87a7
Delete test_word_cloud.py
hewittk Apr 21, 2021
afcd302
Removed irrelevant additions made
Apr 21, 2021
eccccde
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 21, 2021
7d5eb8a
Removed unneccessary line in category_frequency
Apr 21, 2021
573082f
Fixed test_category_frequency test cases
Apr 21, 2021
1bcc4a1
Colored barplot broken down by category
Apr 26, 2021
e71adef
New pipfile.lock copied from master branch with dependencies installed
Apr 26, 2021
ddba12b
Installed importlib.metadata to pipfile.lock
Apr 26, 2021
dce826b
Update pipfile to the master branch
enpuyou Apr 26, 2021
fca0da9
Added to match master branch
Apr 26, 2021
16992bd
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 26, 2021
f6373a6
Deleted sprint log
Apr 26, 2021
c337883
Fixed visualization flake8 errors
Apr 26, 2021
c8886d2
Reset textmining
Apr 26, 2021
d2afa09
Fixed flake8 issues
Apr 26, 2021
45efa7f
Deleted no longer used word cloud generator
Apr 26, 2021
6d4059d
Black reformatting
Apr 26, 2021
4c8e35e
Specification about bar plot type in docstring
Apr 26, 2021
3b39e54
Fix flake8 line length errors
Apr 27, 2021
f36f0e1
Fixed flake8 errors in test_analyzer
Apr 27, 2021
32cc38b
Removed plots_per_row argument
Apr 27, 2021
2fa124c
Remove non used dictionaries
Apr 27, 2021
90949eb
changes made to git-standup
solisa986 Apr 28, 2021
6332292
Merge branch 'master' into issue#51
favourojo Apr 28, 2021
4748eb8
Merge branch 'master' into issue#51
corlettim Apr 29, 2021
6184a77
Remove git standup folder
May 3, 2021
f74def8
Fix flake8 spacing errors
May 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions git-standup
Submodule git-standup added at 5a707b
45 changes: 43 additions & 2 deletions src/analyzer.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,21 @@
"""Text Proprocessing"""
from collections import Counter

import pickle
from . import markdown as md

from textblob import TextBlob
import pandas as pd

import re
import string
from typing import List, Tuple
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk

from . import markdown as md
nltk.download("wordnet")
nltk.download("stopwords")

PARSER = spacy.load("en_core_web_sm")

Expand Down Expand Up @@ -55,7 +62,7 @@ def tokenize(normalized_text: str) -> List[str]:


def compute_frequency(
token_lst: List[str], amount=50
token_lst: List[str], amount=50
) -> List[Tuple[str, int]]: # noqa: E501
"""Compute word frequency from a list of tokens"""
word_freq = Counter(token_lst)
Expand All @@ -68,6 +75,40 @@ def word_frequency(text: str, amount=50) -> List[Tuple[str, int]]:
return compute_frequency(tokenize(normalize(text)), amount)


def category_frequency(responses: List[str]) -> dict:
"""A pipeline to normalize, tokenize, and
find category frequency of raw text"""

for i in range(len(responses)):
responses[i] = normalize(responses[i])
if "" in responses:
responses.remove("")

with open("text_classifier", "rb") as training_model:
model = pickle.load(training_model)

with open("vectorizer", "rb") as training_vectorizer:
vectorizer = pickle.load(training_vectorizer)

category_dict = {
"Ethics": 0,
"Professional Skills": 0,
"Technical Skills": 0
}

for element in responses:
element = vectorizer.transform([element]).toarray()
label = model.predict(element)[0]
if label == 0:
category_dict["Ethics"] += 1
if label == 1:
category_dict["Professional Skills"] += 1
if label == 2:
category_dict["Technical Skills"] += 1

return category_dict


def dir_frequency(dirname: str, amount=50) -> List[Tuple[str, int]]:
"""A pipeline to normalize, tokenize, and
find word frequency of a directory of raw input file"""
Expand Down
18 changes: 18 additions & 0 deletions src/visualization.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,24 @@ def facet_freq_barplot(
return grid


def facet_category_barplot(category_df):
"""facet colored bar plot for category frequencies"""

base = (
alt.Chart(category_df)
.mark_bar()
.encode(
x="Student:N",
y="Frequency:Q",
color="Category:N",
order=alt.Order("Category", sort="descending")
)
.properties(width=570,)
).interactive()

return base


def facet_senti_barplot(senti_df, options, column_name, plots_per_row=3):
"""facet bar plot for word frequencies"""
base = (
Expand Down
65 changes: 61 additions & 4 deletions streamlit_web.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,12 @@ def main():
interactive()
success_msg.empty()



def readme():

def landing_src():

"""function to load and configurate readme source"""

with open("docs/LANDING_PAGE.md") as landing_file:
Expand All @@ -99,6 +104,7 @@ def landing_src():

st.markdown(landing_src, unsafe_allow_html=True)


def landing_pg():
"""landing page"""
landing = st.sidebar.selectbox("Welcome", ["Home", "Interactive"])
Expand Down Expand Up @@ -169,9 +175,14 @@ def load_model(name):
@st.cache(allow_output_mutation=True, suppress_st_warning=True)
def import_data(data_retreive_method, paths):
"""pipeline to import data from local or aws"""

if data_retreive_method == "Local file system":
json_lst = []

json_lst = []
global main_md_dict
if data_retreive_method == "Path input":

try:
for path in paths:
json_lst.append(md.collect_md(path))
Expand Down Expand Up @@ -229,7 +240,7 @@ def df_preprocess(df):
def frequency():
"""main function for frequency analysis"""
freq_type = st.sidebar.selectbox(
"Type of frequency analysis", ["Overall", "Student", "Question"]
"Type of frequency analysis", ["Overall", "Student", "Question", "Category"]
)
if freq_type == "Overall":
freq_range = st.sidebar.slider(
Expand All @@ -256,10 +267,15 @@ def frequency():
f"Most frequent words in individual questions in **{assign_text}**"
)
question_freq(freq_range)
elif freq_type == "Category":
st.header(
f"Frequency of responses focused on ethics, technical skills, and professional skills in **{assign_text}**"
)
category_freq()


def overall_freq(freq_range):
"""page fore overall word frequency"""
"""page for overall word frequency"""
plots_range = st.sidebar.slider(
"Select the number of plots per row", 1, 5, value=3
)
Expand All @@ -282,7 +298,7 @@ def overall_freq(freq_range):
freq_df, assignments, "assignments", plots_per_row=plots_range
)
)

freq_df.to_csv('frequency_archives/' + str(item) + '.csv')

def student_freq(freq_range):
"""page for individual student's word frequency"""
Expand Down Expand Up @@ -326,7 +342,6 @@ def student_freq(freq_range):
)
)


def question_freq(freq_range):
"""page for individual question's word frequency"""
# drop columns with all na
Expand Down Expand Up @@ -373,6 +388,48 @@ def question_freq(freq_range):
)
)

def category_freq():
"""page for word category frequency"""

questions_end = len(main_df.columns) - 3
question_df = main_df[main_df.columns[1:questions_end]]
category_df = pd.DataFrame(columns=["Ethics", "Professional Skills", "Technical Skills", "Student"])
simple_df = pd.DataFrame(columns=["Student", "Category"])
user_responses = []
categories = {}
row_number = 0
id = 0
ordered_student_ids = []
ordered_categories = []
ordered_frequencies = []

for i, row in question_df.iterrows():
# add each user's responses to a list to pass in to dataframe
for col in range(len(question_df.columns)):
if col == 0: # append student ID
id = (str(main_df.iloc[row_number]["reflection by"]))
else: # append categories of response
response = row[col]
user_responses.append(response)
row_number += 1
categories = az.category_frequency(user_responses)
for element in categories:
ordered_student_ids.append(id)
ordered_categories.append(element)
ordered_frequencies.append(categories[element])
categories["Student"] = id
category_df = category_df.append(categories, ignore_index=True)
user_responses.clear()
simple_df["Student"] = ordered_student_ids
simple_df["Category"] = ordered_categories
simple_df["Frequency"] = ordered_frequencies

st.altair_chart(
vis.facet_category_barplot(
simple_df,
)
)


def sentiment():
"""main function for sentiment analysis"""
Expand Down
20 changes: 20 additions & 0 deletions tests/test_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,26 @@ def test_tfidf():
assert vector is not None



def test_category_frequency():
"test that professional skills, technical skills, and ethics are properly \
classified "
text = ["One professional skill that I practiced was communicating \
independently with a team. I did this by atttending all meetings, using \
Zenhub, and including everyone in the major decision making process. I \
also practiced the professional skill of resolving conflicts by talking \
through the conflict with my group members, coming to a resolution, and \
apologizing for the mishap that I caused."]
output = az.category_frequency(text)
print(output)
assert output["Professional Skills"] == 1

text = ["One technical skill that I practiced was installing Python \
packages and integrating these packages with my code."]
output = az.category_frequency(text)
print(output)
assert output["Technical Skills"] == 1

def test_top_polarized_word():
"""Tests if the positive/negative words columns are created"""
df = pd.DataFrame(columns=[cts.TOKEN, cts.POSITIVE, cts.NEGATIVE])
Expand Down
Binary file added text_classifier
Binary file not shown.
Binary file added vectorizer
Binary file not shown.