Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
eftekhar-hossain authored Aug 17, 2020
1 parent d43f4ac commit c72565b
Showing 1 changed file with 30 additions and 1 deletion.
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,35 @@
## Bangla News Headlines Categorization Using Gated Recurrent Unit (GRU): Project Overview
- Created a tool that can categorizes the Bengali news headlines into six category (**`National, Politics, International, Sports, Amusement, IT`**) using deep recurrent neural network.
- A dataset of **`0.1 Million`** news headlines is created. In built chrome web scrapper used for scraping the news headlines from different Bengali online news portals such as **`Dainik Jugantor, Dainik Ittefaq, Dainik Kaler Kontho`** and so on.
- A dataset of **`0.13 Million`** news headlines is created. In built chrome web scrapper used for scraping the news headlines from different Bengali online news portals such as **`Dainik Jugantor, Dainik Ittefaq, Dainik Kaler Kontho`** and so on.
- **`Word embeeding`** feature represtations technique is used for extracting the semantic meaning of the words.
- A deep learning model has been built by using a **`bidirectional gated recurrent network`**.
- Finally, the model performance evaluated using various evaluation measures such as **`confusion matrix, accuracy , precision, recall and f1-score`**.

## Resources Used
- **Developement Envioronment :** Google Colab
- **Python Version :** 3.7
- **Framework and Packages :** Tensorflow, Scikit-Learn, Pandas, Numpy, Matplotlib, Seaborn
- **Scrapper :** [Chrome Web Scrapper](https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en)

## Project Outline
- Data Collection and Cleaning
- Data Summary
- Data Preparation for Model Building
- Model Development
- Model Evaluation


## Data Collection and Cleaning
Data is collected by creating a scraping graph using chrome web scarpper. Around `0.13 Million` Bengali news headlines of six categries are scrapped from different online news portals. The news portals are **`Dainik Jugantor, Dainik Ittefaq, Dainik Kaler Kontho`** and so on. The headlines distribution of each categories represents in the following figure. This dataset is an imbalanced dataset.

![](/images/data_distribution.PNG)

As the headlines are small in length it is not mandatory to remove the stopwords from the headlines. After cleaning the sample data would look like this.

![Sample Data](/images/data_sample.PNG)

## Data Summary

Data summary includes the information about number of documents, words and unique words have in each category class. Also, include the length distribution of the headlines in the dataset.

| ![national](/images/national.PNG) | ![international](/images/international.PNG) | ![politics](/images/politics.PNG) | ![sports](/images/sports.PNG) |![amusement](/images/amusement.PNG) |![it](/images/it.PNG) |

0 comments on commit c72565b

Please sign in to comment.