Santander - Data Science Kaggle Competition

📚 Final Project Report:

Data Science Project Report – Santander Customer Transaction Prediction – Click the link for a detailed report of our Santandar Kaggle Competition

Project Description

This repository includes the complete solution to the Kaggle competition on Santander Customer Transaction Prediction. The challenge aimed to predict which customers will make specific transactions in the future, based on anonymized features, regardless of the transaction amount.

We focused on building robust machine learning models using LightGBM (LGBM) and Convolutional Neural Networks (CNN), employing feature engineering and data augmentation techniques to optimize performance.

Kaggle competition overview: Santander Kaggle Competition.

Installation

To get started with this project, you’ll need Python 3.11.0 or higher. We recommend using a virtual environment to manage dependencies. Follow these steps to set it up:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

The requirements.txt includes all necessary dependencies to run the Jupyter notebooks included in the project. We also recommend using JupyterLab for running and exploring the notebooks:

pip install jupyterlab
jupyter lab

This will open JupyterLab in your browser, where you can navigate to the project’s notebooks and run them to reproduce the results.

Dataset Information

The dataset for this project was sourced from the Kaggle Santander Customer Transaction Prediction competition. It can be found in the data directory, where it is divided into:

train.csv: The training set with labels
test.csv: The test set without labels
sample_submission.csv: A sample submission file provided by Kaggle

For more details or to download the data directly, visit the competition page: Santander Kaggle Data.

Repository Contents

This repository is organized as follows:

1. Data

Contains raw and processed data files for training and testing the models.

2. Engineering

This folder contains notebooks for Feature Engineering (version 1 and version 2), where we implemented techniques such as unique counts and reverse feature engineering.

3. Exploration

The exploratory data analysis (EDA) notebook can be found here. This analysis was crucial for shaping our feature engineering decisions and model development.

4. Models

Includes subfolders for different model types, focusing primarily on LGBM and CNN models. Additional archives contain early-stage models, and the blending folder holds the notebook used for our final score blending of the best CNN and LGBM models.

5. Submission

Contains various model-generated predictions for Kaggle submission, including the final blend of our best LGBM and CNN models in submission_blending.csv.

Key Learnings

Feature Engineering: Feature engineering had a significant impact on model performance. Techniques like unique counts and reverse features helped us push our models beyond the baseline.
Model Blending: Combining CNN and LGBM models via linear blending proved to be a successful strategy, improving our final AUC score.
Data Imbalance Handling: Oversampling the minority class (customers with transactions) was essential to prevent bias and optimize model generalization.
Performance Optimization: Extensive experimentation with hyperparameters and the application of data augmentation techniques helped us fine-tune our models and achieve high AUC scores.

Feel free to explore the code and try the models yourself. Should you have any questions or feedback, we would be happy to connect!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
engineering		engineering
exploration		exploration
models		models
submission		submission
Project_Report.pdf		Project_Report.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Santander - Data Science Kaggle Competition

📚 Final Project Report:

Project Description

Installation

Dataset Information

Repository Contents

1. Data

2. Engineering

3. Exploration

4. Models

5. Submission

Key Learnings

About

Releases

Packages

Languages

fabian-gubler/santander

Folders and files

Latest commit

History

Repository files navigation

Santander - Data Science Kaggle Competition

📚 Final Project Report:

Project Description

Installation

Dataset Information

Repository Contents

1. Data

2. Engineering

3. Exploration

4. Models

5. Submission

Key Learnings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages