Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Average runtime for the batch processor with a ThreadPoolTaskExecutor is 2 minutes 33 seconds. Average runtime for the stateless sessions parser/processor is 40 minutes. Both of these methods will be improved upon in the future by incorporating a MultiResourcePartitioner within the Spring Batch Configuration file, as well as splitting the large dataset into smaller sets, so that multiple threads may operate on different files at a given time.

This project:

Uses Spring Boot service uses Spring Batch with Spring Data JPA-Hibernate.

Imports data from a CSV file (about 6 million records) to a PostgreSQL database.

Improved batch processing performance from implementing a ThreadPoolTaskExecutor to achieve data chunking and multithreaded code.

Based on this data, a fraud detection model is built using python machine learning libraries.

Is intended to be launched through an API Gateway server (linked below).

Instructions to run:

1. Clone this repository to your local machine.

Kaggle

3. Within main/java/com there are two distinct packages, "batch" and "session", which are the batch processor and sessions processor respectively.

4. Each package has it's own main file that can be ran

5. Once the application is launched without issues, head over to Postman and test on your configured port and the route "/load"

Technologies Used

Java

Spring Boot for REST API

Spring Batch Processing (Open Source Data Processing Framework)

Maven

Factory Design Pattern within Batch Processor

Hibernate

Java Persistence API (JPA)

PostgreSQL

Gateway Server Communication. Gateway Server can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.mvn/wrapper		.mvn/wrapper
src/main		src/main
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Instructions to run:

Technologies Used

About

Releases

Packages

Languages

margueriteblair/Big-Data-Processor

Folders and files

Latest commit

History

Repository files navigation

Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Instructions to run:

Technologies Used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages