ECE324 Project
Winter 2022
- Data Collection
- Gensim Word Embedding Model Training
- Custom Word2Vec Definition and Training
- Storing the Word Embedding Models
- Bolukbasi et al. Debiasing
- Zhao et al. Debiasing
- Savani et al. Debiasing
- Bias Measurements
Wikipedia:
- wiki/Downloading_wiki.ipynb: Jupyter notebook which downloads 2020 Wikipidia articles from Tensorflow's cloud database. (Database) (Code credit)
Gutenburg Books:
- gutenberg/gutenberg_data.py: Reads the urls from the gutenburg url files and reads the data inside. These functions are called in Word2Vec Model and Bias Measurements.ipynb.
- Gutenburg URL files (Source):
- gutenberg/gutenberg-test-urls.csv: test data urls
- gutenberg/gutenberg-train-urls.csv: train data urls
- gutenberg/gutenberg-validation-urls.csv: validation data urls
- Exploring_Gender_Biases_in_Word2Vec.ipynb is where the gensim word2vec model is defined (Source) and embedding training occurs for the Gutenberg dataset.
- wiki/Wiki_Word2Vec_Training.ipynb is where the training occurs for the Wikipedia dataset.
- The custom word2vec model instance, along with the random pertubution algorithm functions can be found in custom_word2vec.py.
- The training for the custom model was done in the main notebook, Exploring_Gender_Biases_in_Word2Vec.ipynb
- /models: For the genism models, this folder stores the Wikipedia and Gutenberg models produced by the built-in save function in the gensim library. For the custom models, the files where saved using pickle, but were not added to this github due to the large file size.
- /embeddings/:The embedding folder saves the word embedding dictionaries (word, embeddings on each line) in a .txt format.
Code related to Bolukbasi et al. debaising is found in the 2016debais folder. See the ReadMe in the folder for more details. (Code credit)
Code related to Zhao et al. debiasing is found in the 2018debias folder, along with the debiased .txt word embeddings files.
Code related to Savani et al. debiasing is found in the main notebook, Exploring_Gender_Biases_in_Word2Vec.ipynb, right after the custom word2vec model is trained.
The three bias measurements (direct, indirect, WEAT) and their results can be found in Exploring_Gender_Biases_in_Word2Vec.ipynb