In this notebook, we propose several embedding and model pairs to classify news articles as either real or fake. The following approaches were explored:
- GloVe Embeddings + LSTM (both uni/bidirectional models)
- TFIDF + Logistic Regression
- CountVectorizer + Logistic Regression
- Pretrained Tokenizer + Transformer Model from BERT
To run this project, you'll need to install the required libraries. Use the following commands to set up your environment:
pip install numpy pandas matplotlib seaborn torch torchtext scikit-learn tqdm
Using preprocessing notebook, you can download the original dataset and apply whichever preprocessing steps you'd like.
You can either import the preprocessed data (news_df_processed.csv
) or the raw dataframe that is of the form {'label': (0 or 1), 'content': (article string)}
.
news_df = pd.read_csv('path/to/your/news_df_processed.csv')
If you are going to run the notebook on base version GPU's, we highly recommend keeping embedding_dimensions
under 250. Otherwise your session might crash and you'd lose all your progress.
Beware of the fact that each training epoch takes 10 mins on 15.84 GB Tesla T4 GPU.
history = model.fit(x = {'input_ids':X_train_token['input_ids'],'input_mask':X_train_token['attention_mask']}, y = Y_train, epochs=2, validation_split = 0.2, batch_size = 30, callbacks=[callback])
I would like to thank the Neuromatch Academy for providing the platform and resources for this project.
This project is licensed under the MIT License - see the LICENSE file for details.
For any questions or suggestions, please contact:
Boran Aybak Kilic [email protected]