This project focuses on predicting Walmart's weekly store sales using advanced time-series modeling techniques. The project leverages LSTM-based models (including a Bidirectional LSTM) as well as Random Forest for comparison and ensemble modeling. The primary goal is to capture complex patterns such as trends, seasonality, and other variations within the data to provide accurate and reliable sales forecasts, aiding business strategy and decision-making.
The Walmart dataset consists of weekly sales data for multiple stores, along with various factors that may influence sales. This includes promotional events, holidays, and economic indicators, offering a challenging real-world dataset for time-series forecasting tasks. By accurately capturing this data's complexities, this project aims to improve future sales predictions and support strategic business insights.
The following is the structure of the project files and directories:
project-root/ ├── data/ │ ├── optimized_dataset.csv # Optimized and preprocessed dataset │ ├── scaler.pkl # Scaler used for data normalization │ ├── random_forest_model.pkl # Trained Random Forest model │ └── bilstm_model.h5 # Trained BiLSTM model ├── notebooks/ │ ├── Notebook1-DataCleaning.ipynb # Jupyter Notebook for data cleaning │ └── Notebook2-ModelDevelopment.ipynb # Jupyter Notebook for model development └── README.md # Project README file
- File:
Notebook1-DataCleaning.ipynb
- Key Steps:
- Loaded and preprocessed the dataset.
- Addressed missing values, outliers, and standardized sales data.
- Extracted date-based features like month, week, and day of the week for improved temporal predictions.
- Tools Used:
pandas
,numpy
- Extracted temporal features to enrich data:
- Month, Week of Year, and Day of Week features were created using the
Date
column.
- Month, Week of Year, and Day of Week features were created using the
- Resulting dataset was divided into train and test sets for modeling.
- File:
Notebook2-ModelDevelopment.ipynb
- Models Used:
- LSTM with various configurations.
- Bidirectional LSTM.
- Random Forest Regressor for baseline and ensemble comparison.
- Key Results:
- Best performance achieved using the Bidirectional LSTM with a Mean Absolute Error (MAE) of 0.4099 and R-squared value of 0.6379.
- Bidirectional LSTM:
- Improved prediction accuracy by leveraging bidirectional context of the time series data.
- Performance Metrics:
- Mean Absolute Error (MAE):
0.4099
- Mean Squared Error (MSE):
0.3300
- R-squared:
0.6379
- Mean Absolute Error (MAE):
- Random Forest:
- Used for baseline comparisons and ensemble predictions.
- Combined with LSTM predictions to improve accuracy using a simple average ensemble.
- Actual vs Predicted plots for different model segments.
- Illustrations highlighting ensemble performance compared to individual models.
- Clone this repository:
git clone https://github.com/yourusername/sales-prediction-project.git cd sales-prediction-project
2.Install the dependencies: pip install -r requirements.txt
3.Run the Jupyter notebooks: jupyter notebook Notebook1-DataCleaning.ipynb for data cleaning and preprocessing. Notebook2-ModelDevelopment.ipynb for model training, evaluation, and prediction.
optimized_dataset.csv
: The preprocessed dataset used for training and testing.bilstm_model.h5
: Saved model weights for the Bidirectional LSTM.scaler.pkl
: Scaler object for data standardization.random_forest_model.pkl
: Saved Random Forest model for baseline comparison.
- The Bidirectional LSTM model outperformed other configurations, achieving the best balance between prediction accuracy and computational complexity.
- The ensemble model combining LSTM and Random Forest predictions further demonstrated robust performance improvements.
This project is licensed under the MIT License.
For questions or collaboration, please contact [[email protected]].