This directory contains datasets used in the repository, organized as follows:
The imdb
directory includes data that has been processed and split into training, testing, and evaluation sets.
- Original Dataset: IMDb Dataset
- This dataset was created by the Stanford AI Lab and contains movie reviews along with sentiment polarity labels.
train.csv
(35,000 samples)val.csv
(5,000 samples)test.csv
(10,000 samples)
These files were generated using the script located at utils/preprocess_imdb_dataset.py
. The script processes the original IMDb dataset to create three distinct splits: train, test, and eval.
This file, debiased_profanity_check_with_keywords.csv
, contains data related to profanity and bias checks, with specific keywords highlighted for analysis.
- Dataset: Bias-DeBiased
- This dataset is part of efforts to understand and mitigate bias in media texts, hosted on Hugging Face.
- Exploring the Detection of Media Bias via Machine Learning
- This paper discusses methodologies for detecting and debiasing media bias using advanced machine learning techniques.
If you use the Bias-DeBiased dataset provided in this directory, please cite the appropriate sources as follows:
@misc{raza2023newsmediabias,
Author = {Shaina Raza},
title = {News Media Bias},
year = {2023},
url = {https://huggingface.co/newsmediabias},
}