This is the official repository of the paper:
Zoom Out and Observe: News Environment Perception for Fake News Detection
Qiang Sheng, Juan Cao, Xueyao Zhang, Rundong Li, Danding Wang, and Yongchun Zhu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)
PDF / Poster / Code / Chinese Video / Chinese Blog / English Blog
The experimental datasets where can be seen in dataset
folder, including the Chinese Dataset, and the English Dataset. Note that you can download the datasets only after an "Application to Use the Datasets for News Environment Perceived Fake News Detection" has been submitted.
python==3.6.10
torch==1.6.0
transformers==4.0.0
Due to the space limit of the GitHub, we upload the SimCSE's training data by Google Drive. You need to download the dataset file (i.e., [dataset]_train.txt
), and move it into the preprocess/SimCSE/train_SimCSE/data
of this repo. Then,
cd preprocess/SimCSE/train_SimCSE
# Configure the dataset.
sh train.sh
Of course, you can also prepare the SimCSE model by your custom dataset.
cd preprocess/SimCSE
# Configure the dataset.
sh run.sh
Get the macro environment and rank its internal items by similarites:
cd preprocess/NewsEnv
# Configure the specific T days of the macro environment.
sh run.sh
This step is for the preparation of the specific detectors. There are six base models in our paper, and the preparation dependencies of them are as follows:
Model | Input (Tokenization) | Special Preparation | |
Post-Only | Bi-LSTM | Word Embeddings | - |
EANN | Word Embeddings | Event Adversarial Training | |
BERT | BERT's Tokens | - | |
BERT-Emo | BERT's Tokens | Emotion Features | |
"Zoom-In" | DeClarE | Word Embeddings | Fact-checking Articles |
MAC | Word Embeddings |
In the table above, there are five preprocess in total: (1) Tokenization by Word Embeddings, (2) Tokenization by BERT, (3) Event Adversarial Training, (4) Emotion Features, and (5) Fact-checking Articles. We will describe the five respectively.
This tokenization is dependent on the external pretrained word embeddings. In our paper, we use the sgns.weibo.bigram-char (Downloading URL) for Chinese and glove.840B.300d (Downloading URL) for English.
cd preprocess/WordEmbeddings
# Configure the dataset and your local word-embeddings filepath.
sh run.sh
cd preprocess/BERT
# Configure the dataset and the pretrained model
sh run.sh
cd preprocess/EANN
# Configure the dataset and the event number
sh run.sh
cd preprocess/Emotion/code/preprocess
# Configure the dataset
sh run.sh
There are two preparation for fact-checking articles:
- Retrieve the most relevant articles for every post. Specifically, we have retrieved every post's Top10 relevant articles that should be published BEFORE the post, whose results are saved in the
preprocess/BM25/data
folder. If you want to learn about more implementation details, just refer topreprocess/BM25/[dataset].ipynb
. - Tokenize the fact-checking articles by word embeddings:
cd preprocess/WordEmbeddings
# Configure the dataset and your local word-embeddings filepath. Set the data_type as 'article'.
sh run.sh
cd model
# Configure the dataset and the parameters of the model
sh run.sh
After that, the results and classification reports will be saved in ckpts/[dataset]/[model]
.
If you find our dataset and code are helpful, please cite the following ACL 2022 paper:
@inproceedings{NEP,
title = "Zoom Out and Observe: News Environment Perception for Fake News Detection",
author = "Sheng, Qiang and
Cao, Juan and
Zhang, Xueyao and
Li, Rundong and
Wang, Danding and
Zhu, Yongchun",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
month = may,
year = "2022",
publisher = "Association for Computational Linguistics"
}
And as the HuffPost part of the English news environment is based on the News Category Dataset, please cite the following reports as the kaggle page requires:
@dataset{misra2018news,
title={News Category Dataset},
author={Misra, Rishabh},
year = {2018},
month = {06},
doi = {10.13140/RG.2.2.20331.18729}
}
@book{misra2021sculpting,
author = {Misra, Rishabh and Grover, Jigyasa},
year = {2021},
month = {01},
pages = {},
title = {Sculpting Data for ML: The first act of Machine Learning},
isbn = {9798585463570}
}