Text Classification for Craigslist Posts

This project aims to classify Craigslist posts into different categories based on their heading. It utilizes machine learning models to predict the category of a given heading within a selected city and section.

Usage

Running the Streamlit App

Run the Streamlit app by executing the following command:
```
streamlit run app.py
```
Select a city and section from the dropdown menus.
Enter the heading of the Craigslist post.
Click the "Predict Category" button to see the predicted category.

Preprocessing

The project involves several preprocessing steps on the input text:

Tokenization: Splitting the text into individual words or tokens. This step breaks down the text into smaller units for further analysis.
Removing special characters and URLs: Eliminating URLs and non-alphanumeric characters from the text. Special characters and URLs are often noise in the text data and need to be removed for better analysis.
Removing numeric characters: Eliminating numerical digits from the text. Numeric characters might not contribute much to the meaning of the text and can be safely removed.
Removing emoticons and emojis: Stripping emoticons and emojis from the text. Emoticons and emojis might not provide useful information for text classification and can be removed.
Stemming and lemmatization: Reducing words to their base or root form to normalize text. This step ensures that different forms of the same word are treated as the same token.
Removing stopwords: Filtering out common words that do not contribute much to the meaning of the text. Stopwords are common words like "and", "the", "is", etc., which are often removed to focus on more meaningful words.

Models

The project employs ensemble methods such as Gradient Boosting, Random Forest, and XGBoost for text classification. Models are trained on different sections of Craigslist posts, including for-sale, housing, services, and community. Each section has its trained models to predict the most relevant category based on the heading provided.

Project Structure

app.py: Streamlit application for user interaction. This file contains the main code for the Streamlit web application, allowing users to input their data and view the predicted category.
codes/: Contains Python scripts for text preprocessing and section preprocessing. This directory holds the scripts responsible for preprocessing the text data and preparing it for model input.
model/: Contains trained models for text classification. This directory stores the trained machine learning models used for predicting the category of Craigslist posts.
utils/: Contains utilities such as stemmer and lemmatizer. This directory contains helper functions and utilities used in the preprocessing steps, such as stemming and lemmatization.

Dependencies

Streamlit: For building interactive web applications.
NLTK: For natural language processing tasks such as tokenization and stemming.
scikit-learn: For machine learning algorithms and preprocessing tasks.
XGBoost: For gradient boosting algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
codes		codes
dataset		dataset
model/text		model/text
utils		utils
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification for Craigslist Posts

Usage

Running the Streamlit App

Preprocessing

Models

Project Structure

Dependencies

About

Releases

Packages

Languages

Theofilusarifin/Text-Classification-for-Craigslist-Posts

Folders and files

Latest commit

History

Repository files navigation

Text Classification for Craigslist Posts

Usage

Running the Streamlit App

Preprocessing

Models

Project Structure

Dependencies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages