Fake News Detection

Fake News Detection in Python

In this project, we have used various natural language processing techniques and machine learning algorithms to classify fake news articles using sci-kit libraries from python.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

What things you need to install the software and how to install them:

Python 3.6
- This setup requires that your machine has python 3.6 installed on it. you can refer to this url https://www.python.org/downloads/ to download python. Once you have python downloaded and installed, you will need to setup PATH variables (if you want to run python program directly, detail instructions are below in how to run software section). To do that check this: https://www.pythoncentral.io/add-python-to-path-python-is-not-recognized-as-an-internal-or-external-command/.
- Setting up PATH variable is optional as you can also run program without it and more instruction are given below on this topic.
Second and easier option is to download anaconda and use its anaconda prompt to run the commands. To install anaconda check this url https://www.anaconda.com/download/
You will also need to download and install below 3 packages after you install either python or anaconda from the steps above
- Sklearn (scikit-learn)
- numpy
- scipy

if you have chosen to install python 3.6 then run below commands in command prompt/terminal to install these packages

pip install -U scikit-learn
pip install numpy
pip install scipy

if you have chosen to install anaconda then run below commands in anaconda prompt to install these packages

conda install -c scikit-learn
conda install -c anaconda numpy
conda install -c anaconda scipy

Dataset used

The data source used for this project is LIAR dataset which contains 3 files with .tsv format for test, train and validation. Below is some description about the data files used for this project.

LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION

William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.

the original dataset contained 13 variables/columns for train, test and validation sets as follows:

Column 1: the ID of the statement ([ID].json).
Column 2: the label. (Label class contains: True, Mostly-true, Half-true, Barely-true, FALSE, Pants-fire)
Column 3: the statement.
Column 4: the subject(s).
Column 5: the speaker.
Column 6: the speaker's job title.
Column 7: the state info.
Column 8: the party affiliation.
Column 9-13: the total credit history count, including the current statement.
9: barely true counts.
10: false counts.
11: half true counts.
12: mostly true counts.
13: pants on fire counts.
Column 14: the context (venue / location of the speech or statement).

To make things simple we have chosen only 2 variables from this original dataset for this classification. The other variables can be added later to add some more complexity and enhance the features.

Below are the columns used to create 3 datasets that have been in used in this project

Column 1: Statement (News headline or text).
Column 2: Label (Label class contains: True, False)

You will see that newly created dataset has only 2 classes as compared to 6 from original classes. Below is method used for reducing the number of classes.

Original -- New
True -- True
Mostly-true -- True
Half-true -- True
Barely-true -- False
False -- False
Pants-fire -- False

The dataset used for this project were in csv format named train.csv, test.csv and valid.csv and can be found in repo. The original datasets are in "liar" folder in tsv format.

File descriptions

DataPrep.py

This file contains all the pre processing functions needed to process all input documents and texts. First we read the train, test and validation data files then performed some pre processing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Data Visualization		Data Visualization
DataPreparation.py		DataPreparation.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News Detection

Getting Started

Prerequisites

Dataset used

File descriptions

DataPrep.py

About

Releases

Packages

Languages

theanshulcode/Fake-News-Detection

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection

Getting Started

Prerequisites

Dataset used

File descriptions

DataPrep.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages