Distributed Word Embedding and Similarity Calculation

Author - Gautham Satyanarayana
Email - [email protected]
UIN - 659368048

Introudction

As part of UIC-CS441 Engineering Distributed Objects for Cloud Computing, this project demonstrates how to train an LLM. For the first Homework, we build a mapreduce model to tokenize, encode, and finally compute word embeddings and find similar words based on these embeddings. The implementation follows a cluster run on AWS EMR where a Word2Vec model is trained independently on each mapper and the embeddings are averaged in the reducer for each token.

Video Link : https://youtu.be/s3oJi9uA6e8

Strategies & Frameworks

Dataset - Books from Project Gutenberg
Preprocessing Strategy - Remove punctuation, numbers and force to lowercase
Sharding Strategy - Shard the file into n lines where each line contains 100,000 words
Tokenization Strategy - Split each line into words
Encoding and Training - JTokkit and Deeplearning4j
Testing Framework - munit
Config - TypeSafe Config
Logging - Sl4j

Environment

MacOSX ARM64
IntelliJ IDEA 2024.2.1
Hadoop 3.3.6
Scala v2.13.12
SBT v1.10.2

Data Flow and Logic

Hadoop / EMR Flow

Data Structures and Flow

WordCount Job
1. Mapper
  1. Clean the sentence by removing punctuation and numbers and trailing white spaces and forcing to lowercase
  2. Split the sentence by spaces and for each token - map the word, its byte pair encoding and 1 - <Text, IntWritable> as <(word, encoding), 1>
2. Reducer
  1. For <key: Text, values: Iterable[IntWritable]> sum all the values
  2. Format a resulting writable Text as s"$word, $encodingStr, $sum"
  3. Output this resulting string; example - hello,96,54

2. Embedding Job

Mapper
1. On the sharded data mapped as <key, line>, clean the line as mentioned in the previous step, and tokenise the words to their byte pair encodings
2. I followed a format to map the encoding IntArray to a string by concatenating these integers in the encoding array into a string separated by :,
  Example - "melodiously" -> [20, 187, 923] -> "20:187:923"
  "hello world" -> [[96], [182,34]] -> "96 182:34"
3. Now a Word2Vec model is trained using Deeplearning4j on this tokenised and encoded sentence
4. For every word in this local mapper's model, we emit the word and its embedding in the format <word: Text, vector: Text(v1,v2,v3,...,vn)>
  where n is the dimension of the vector and the word is decoded from the token using helper functions to decode a token 1:2:3 into IntArrayList [1, 2, 3]
Reducer
1. For every pair of key and values - word and array of vectors <word: Text, vectors: Iterable[String]>, Deserialize them into Array[Double]
2. Sum up these vectors on each dimension and divide each dimension by the number of dimensions chosen
3. Output this averaged vector as the resulting word embedding for this word as <word: Text, avgVector: Text(v1, v2, ... , vn)>

3. Cosine Similarity Job

Mapper
1. The input line comes from the sharded embeddings file where each line contains 5000 vectors
2. For every pair of vectors in this list, compute the cosine similarity and if it is greater than a set threshold, emit this pair along with the similarity as word1, (word2, similarity)
Reducer
1. For every wordI, Iterable[(wordJ, simIJ)] sort the Iterable and emit the top 5 wordJ for wordI with highest similarity scores

EMR Flow

Explained in the YouTube video linked above

Test Suite

Tests are found in /src/test/scala/Suite.scala or just run from root

sbt test

Results

WordCount Results are found in /src/main/resources/wordcount.txt (Vocabulary Size result in last line of file, emitted by reducer)
Embedding Results are found in /src/main/resources/embeddings.txt
Cosine Results are found in /src/main/resources/similarity.txt

Usage

Clone this repository
cd into the root of the project repository
sbt update - Installs the dependencies in build.sbt
sbt assembly - The jar file should be created in the /target/scala-2.13 directory
upload all the input files to hdfs

hdfs dfs -put ./src/main/resources/*.txt /input

Runs the map reduce job for generating vocabulary and word count on hadoop locally.

hadoop jar ./target/scala-2.13/hw1.jar stat /input/ulyss12-sharded.txt /output-stat

Runs the map reduce job for computing word count and vocabulary on hadoop locally.

hadoop jar <name of jar file> stat /input/ulyss12-sharded.txt /output-wc

Runs the map reduce job for computing word embeddings on hadoop locally.

hadoop jar <name of jar file> embedding /input/ulyss12-sharded.txt /output-embedding

Runs the map reduce job for finding similar words for each word in the vocabulary on hadoop locally.

hadoop jar <name of jar file> cosSim /input/ulyss12-sharded.txt /output-cosSim

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
images		images
project		project
results		results
src		src
.gitignore		.gitignore
README.md		README.md
bootstrap.sh		bootstrap.sh
build.sbt		build.sbt
feedback.md		feedback.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Word Embedding and Similarity Calculation

Introudction

Strategies & Frameworks

Environment

Data Flow and Logic

Hadoop / EMR Flow

Data Structures and Flow

EMR Flow

Test Suite

Results

Usage

About

Releases

Packages

Languages

gauthamys/UIC_CS441_HW1

Folders and files

Latest commit

History

Repository files navigation

Distributed Word Embedding and Similarity Calculation

Introudction

Strategies & Frameworks

Environment

Data Flow and Logic

Hadoop / EMR Flow

Data Structures and Flow

EMR Flow

Test Suite

Results

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages