Skip to content

This is my portfolio for CS 4365.001 Human Language Technologies at UTDallas for the Spring 2023 semester.

License

Notifications You must be signed in to change notification settings

6henrykim/NLP-Portfolio

Repository files navigation

NLP-Portfolio

This is my portfolio for the CS 4365.001 Human Language Technologies course taught by Dr. Mazidi at UTDallas. These projects were completed for the Spring 2023 semester.

Over the course of these projects I worked on my technical skills including Python programming and using libraries such as Keras, Scikit-Learn, and NLTK. I also developed soft skills in reading research articles, presenting my work to an audience, and collaborative coding. My other work has been in research and working in Unity app development. You can view my resume here.

The field of NLP is still rapidly growing as seen in the development of projects such as Chat-GPT. I now have an understanding of basic NLP techniques and concepts such as tokenization, lemmatization, parsing, and n-grams as well experience with machine learning algorithms for text classification. In the future, I would like to learn more about recent ML models, especially transformers since they have become the new state-of-the-art. Aside from the latest trends, I would also like to learn more about information extraction from natural langauge as that would bridge the gap between the way computers and humans think.

Overview of Natural Language Processing

Here are my responses to some more prompts about natural language processing.

Basic Text Processing

This is a basic text processor that takes a data file of employee information and formats it. The employee data is also saved as a pickle file. The code for the project can be viewed here.

How To Run

Use Python3 to run the homework1_htk180000.py file in the text_processing folder and provide a path to the data file. For example:

python homework1_htk180000.py data/data.csv

The data file should contain entries where each line is

Last,First,Middle Initial,ID,Office phone

with the first line being the headers.

Reflection

One of the strengths of Python for a text processing task like this is the built-in methods for strings such as title(), lower(), and upper() for easy edits to capitalization. Reading each entry of the data is also fairly simple with the way Python can iterate over all the lines of a file. On the other hand, using classes and objects is not as secure in Python as other langauges due to the lack of access modifiers, which could cause issues for larger projects.

While completing this assignment, I learned about saving objects as pickle files and using regular expressions in Python. Specifically, I found the fullmatch() useful for validating the format of strings. I also reviewed the use of sysargs and taking command line arguments as input for a Python program.

Word Guess Game

This program takes a text file, calculates its lexical diversity, tokenizes it, and plays a word guessing game based on the most common nouns in the document. The code for the project can be viewed here.

How To Run

Use Python3 to run the homework2_htk180000.py file in the word_guess_game folder and provide a path to the data file. For example:

python homework1_htk180000.py data.txt

Note: this program uses the NLTK library and requires that nltk.download() have already been used to download corpora data. I used nltk.download('all') to get all available data.

WordNet

This Python notebook contains an exploration of WordNet's capabilities. There is also a pdf version of it.

N-Gram Language Classification

This project was completed in partnership with @Hikaito. This program classifies a text's language as English, French, or Italian based on unigram and bigram presence in a language training corpus. This program is divided into two parts that can be viewed here. A discussion of n-grams can be viewed here.

Part 1

The first part of this program reads from three language training files and creates pickle files of the counts of each unigram and bigram. The program lowercases all tokens before calculations take place and sets all arabic integer numbers to "NUM" instead.

Part 2

The second part of this program predicts the language of a sample text by calculating and comparing the log probabilities that it was generated by each of the languages given the corpus. The output is stored in predictions.txt

How to Run

Download all the LangId files and save them in the same directory as the Python scripts. Also make sure the NLTK library is installed in the environment. Then run the scripts in order.

python homework3_jef180001_htk180000_part1.py

python homework3_jef180001_htk180000_part2.py

Sentence Parsing

An overview of phrase structure grammar parsing, dependency parsing, and semantic role label parsing can be viewed here. These examples were done with AllenNLP and Standford's CoreNLP.

Web Crawler

This project was completed in partnership with @Hikaito. This program scrapes the web for articles about the Titanic starting with its Wikipedia page. The paragraph sections of these webpages are scraped for keywords related to the Titanic and a small SQLite database is created to tag sentences by the keywords they contain. The code can be viewed here. A report about the knowledge base and a potential chatbot that could use the knowledge base can be viewed here.

How to Run

Install the NLTK and BeautifulSoup4 libraries and download the stopwords corpus for NLTK. Then run the script

python web_crawler_jef180001_htk180000.py

Text Classification

This notebook was created on Kaggle and tries using Naive Bayes, Logistic Regression, and basic Neural Net classifiers from scikit-learn on the Physics vs Chemistry vs Biology dataset of Reddit comments to classify their subjects. A PDF version of the notebook can be viewed here.

Text Classification 2

This notebook was created on Kaggle and tries using deep learning approaches such as CNNs and RNNs to classify headlines from the Clickbait Database as clickbait or not-clickbait. A PDF version of the notebook can be viewed here.

ACL Anthology Paper Summary

This summary of the paper GPT-D: Inducing Dementia-related Linguistic Anomalies by Deliberate Degradation of Artificial Neural Language Models from the ACL Anthology was completed in partnership with @Hikaito.

ACL Anthology Chatbot

This project was completed in partnership with @Hikaito. This chatbot converses with the user about works in the ACL Anthology. The user can ask about authors or papers published in the anthology, and the bot will scrape the associated page to tell the user information such as the publishing date, coauthors, or paper abstracts. The bot also downloads pdfs of the scraped papers and tracks which papers and authors it has mentioned to the user.

The code can be viewed here. A report about the chatbot can be viewed here.

How to Run

Make sure the following files and folders are in the same directory:

  • chatterbot-corpus-master-edit folder
    • Contains the edited chatterbot corpus files
  • new-corpus folder
    • Contains the task-specific chatterbot files
  • jef180001_htk180000_backend.py script
  • jef180001_htk180000_baxter.py script
  • jef180001_htk180000_baxter_conversation.py script
  • jef180001_htk180000_chatbot.py script
  • jef180001_htk180000_database.py script
  • jef180001_htk180000_webscraping.py script

Download and install all the libraries in the requirements.txt file

python -m pip install -r requirements.txt --no-deps

Run the jef180001_htk180000_chatbot.py script

python -m jef180001_htk180000_chatbot.py

About

This is my portfolio for CS 4365.001 Human Language Technologies at UTDallas for the Spring 2023 semester.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published