Skip to content

Latest commit

 

History

History
44 lines (30 loc) · 1.66 KB

README.md

File metadata and controls

44 lines (30 loc) · 1.66 KB

Data Directory

This directory contains datasets used in the repository, organized as follows:

IMDb Dataset

The imdb directory includes data that has been processed and split into training, testing, and evaluation sets.

Source

  • Original Dataset: IMDb Dataset
    • This dataset was created by the Stanford AI Lab and contains movie reviews along with sentiment polarity labels.

Files

  • train.csv (35,000 samples)
  • val.csv (5,000 samples)
  • test.csv (10,000 samples)

Generation

These files were generated using the script located at utils/preprocess_imdb_dataset.py. The script processes the original IMDb dataset to create three distinct splits: train, test, and eval.

Bias-DeBiased Dataset

This file, debiased_profanity_check_with_keywords.csv, contains data related to profanity and bias checks, with specific keywords highlighted for analysis.

Source

  • Dataset: Bias-DeBiased
    • This dataset is part of efforts to understand and mitigate bias in media texts, hosted on Hugging Face.

Reference Paper

Citation

If you use the Bias-DeBiased dataset provided in this directory, please cite the appropriate sources as follows:

@misc{raza2023newsmediabias,
  Author     = {Shaina Raza},
  title     = {News Media Bias},
  year      = {2023},
  url       = {https://huggingface.co/newsmediabias},
}