Skip to content

Train a T5 model to generate simple Fake News and use a RoBERTa model to classify what's fake and what's real.

License

Notifications You must be signed in to change notification settings

RobinSmits/FakeNews-Generator-And-Detector

Repository files navigation

FakeNews-Generator-And-Detector

Introduction

Recently I was experimenting with the T5 model and exploring the options it has to offer. Thinking about the Summarization capabilities of current State-of-the-Art NLP models I was curious to find out what the result could be if I would turn that around. So input a short text and let the model generate a longer text. If I would also use a News dataset for that than I would have a simple Fake News generator. Next I could use that real and fake news to train a classifier and see how well yet another NLP model would be able to classify the real and fake news.

Dataset

As news dataset I used the Tensorflow Datasets 'ag_news_subset'. It is primarily used for news topic classification and contains 120K rows with news articles for training and 7.6K news articles for testing. Since each row contains a short 'title' and a longer 'description' for a news article it can also perfectly be used for a fake news generator and detector.

Summary

To summarize this repository contains the code for the following 3 steps:

  • Train a T5 model on the first half of a news dataset where we use 'title' as input and 'description' as output. After training we use the T5 model to generate fake news based on the 'title' in the second half of the news dataset. That 'title' is the input and the model will generate a new fake 'description' as output.
  • Train a RoBERTA model to be able to classify the real and fake news.
  • With the unseen data from the test set first generate fake news with the T5 model and then have the RoBERTa model classify the real and fake news.

Details

Below a further description of the specific actions you can find in each notebook. Also note that the subfolder 'fake_news' contains the generated csv files with the text as generated by the T5 model. If you want to download the model weights files for the trained T5 and RoBERTa models then use the following url. Extract the files and also place them in the 'fake_news' subfolder. This way you can use the notebooks without having to perform the full training process. Do note that the text generation with the T5 model still can take a long time - depending on your available hardware..

FakeNews_Generator_T5 Notebook:

  • Download the 'train' part of the 'ag_news_subset' dataset.
  • Split 'ag_news_subset' into a 'train' set with 60K rows and a 'generate' set with 60K rows
  • Train the T5 model on the train set. Use the 'title' column as input and use the 'description' column as output.
  • Use the 'title' input from the generate set as input for the T5 model to generate a full set with fake news (stored in column 'generated_description') and save as file 't5_generated_fake_news.csv'

FakeNews_Classifier_RoBERTa Notebook:

  • Import the previously generated file 't5_generated_fake_news.csv' and preprocess to be used for classification. We want to be able to classify real or fake news. The 'description' column will be labelled as the real news. The 'generated_description' column will be labelled as the fake news
  • Split that dataset into a 80/20 train and validation set.
  • Train and validate a RoBERTa base model for classification.

FakeNews_Generator_And_Detector Notebook:

  • Download the 'test' part of the 'ag_news_subset'. This dataset contains 7600 rows of data not seen by either the T5 or the RoBERTa model.
  • Use the T5 model to generate fake news based on using the 'title' of the new 7600 rows as input. The generated fake news is stored in file 't5_generated_fake_news_final.csv' as the column 'generated_description' together with the original 'title' and 'description'.
  • Use the RoBERTa model to classify all the data and detect what is real (the column 'description') or fake news (the column 'generated_description').

Results

During my experiments I used T5-small and T5-base models to train and generate the fake news. All classification were done with a RoBERTa base model.

Fake News generated by the T5-small model could be classified by the RoBERTa model with roughly 99% accuracy. Fake News generated by the T5-base model could be classified by the RoBERTa model with roughly 97% accuracy.

About

Train a T5 model to generate simple Fake News and use a RoBERTa model to classify what's fake and what's real.

Topics

Resources

License

Stars

Watchers

Forks