This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.
Average runtime for the batch processor with a ThreadPoolTaskExecutor is 2 minutes 33 seconds. Average runtime for the stateless sessions parser/processor is 40 minutes.
Both of these methods will be improved upon in the future by incorporating a MultiResourcePartitioner within the Spring Batch Configuration file, as well as splitting the large dataset into smaller sets, so that multiple threads may operate on different files at a given time.
This project:
- 1. Clone this repository to your local machine.
- 2. Download the financial data from Kaggle. Add this data to "resource/data" and be sure to include the .csv file in your .gitignore!
- 3. Within main/java/com there are two distinct packages, "batch" and "session", which are the batch processor and sessions processor respectively.
- 4. Each package has it's own main file that can be ran
- 5. Once the application is launched without issues, head over to Postman and test on your configured port and the route "/load"