The Collaborne application logs all actions taken by its users. We use this data to optimize the user experience. For that, our Data Science loads the logs into AWS Redshift, a hosted data warehouse system.
During the hackathon, you will explore how to extract data from these logs and prepare them for a load to Redshift.
You can either use Java or Node for the hackathon, depending on your personal preference. The hackathon should take ca. 1.5h to complete (excl. Q&A session).
Please execute the following steps to prepare your development environment (example commands are for macOS/Linux).
- Unpack the ZIP file
$ unzip hackathon-log-exporter-developer.zip
-
Build and run the application
$ cd hackathon-log-exporter-developer/java/ $ mvn clean install $ java -cp target/classes/ com.collaborne.hackathon.exporter.log.App
-
Install NodeJS and npm
-
Run the application
$ cd hackathon-log-exporter-developer/node/ $ npm start
You see in your console a couple of sample log entries? Congratulations! You are ready to go. Good luck!
One of Collaborne's log files captures all changes sent to the user database. Each log entry contains a time stamp and then further information as JSON.
The following log entry represents when Joe's account was created:
2017-03-13 02:45:58,660 {"date":"2017-03-13T10:45:58.660+0000","type":"User","op":"CREATE","params":{"obj":{"id":"bc68f233-09bd-4e51-3257-76bb1ec8c854","emails":["[email protected]"],"title":"Junior Coder","collaborne":{"createdBy":"admin"}}}}
This log entry shows when Joe changed his email address:
2017-03-13 09:45:11,962 {"date":"2017-03-13T09:45:11.962+0000","type":"User","op":"UPDATE","params":{"id":"bc68f233-09bd-4e51-3257-76bb1ec8c854","up":[{"field":"emails","val":["[email protected]"]}}
This section describes all tasks of the hackathon. All tasks are independent. In case you run out of time, please spend at least some time on all tasks for a fruitful discussion.
We're looking for code quality, correctness, technical choices, and creativity.
You can use the provided skeleton code as a starting point, which reads the sample log file and outputs its contents to the command line.
The Collaborne log files store the core of its data in a nested JSON format. In contrast, Redshift requires the data in a flat format. Importantly, whereas each row in the log file might describe changes to multiple fields, the CSV file must contain a separated row for each field. Similarly, the nested structure of the CREATE log entries needs to mapped to flat field names, e.g. collaborne.roles.
Please implement the code to convert a Collaborne log file into a CSV file with a format similar to this:
2017-03-13 02:45:58,660,User,61cc1b08-17ce-4215-9d4b-544fc9135fb1,CREATE,emails,["[email protected]"]
2017-03-13 02:45:58,660,User,61cc1b08-17ce-4215-9d4b-544fc9135fb1,CREATE,title,Junior Coder
2017-03-13 02:45:58,660,User,61cc1b08-17ce-4215-9d4b-544fc9135fb1,CREATE,collaborne.roles,admin
2018-03-13 02:45:58,660,User,61cc1b08-17ce-4215-9d4b-544fc9135fb1,UPDATE,title,Senior Coder
Creating the CSV file is the first step of the journey. Here are the other required steps until our user data is loaded into Redshift:
-
Upload the CSV file to AWS S3, a hosted storage service
-
Call the Redshift's REST endpoint to trigger the load (whereby one of the request parameters is the S3 location of the CSV file)
For the following, assume that one of your colleagues wrote already a service that triggers the Redshift load itself (step 2).
Please put on your architect hat and prepare to discuss these questions during the Q&A session:
-
Right now, neither your code nor your colleague's service does the S3 upload (step 1). Where would you add this functionality?
-
How would you connect your code with the other service? Assume that you can tell your colleague to implement any mechanism of communication that you wish.
-
Detail the content of the messages that you would exchange with this other service.
-
What are the advantages and disadvantages of your chosen communication approach vs. possible alternatives?
Hint: Please consider especially reliability, security, and ease-of-development.
After you successfully loaded the user persistency logs, new requirements have been popping up:
-
Load the persistency logs for groups into Redshift, for example a user might be a member in a R&D group
-
Load realtime click data into Redshift, for example a user clicked on Edit profile
-
Send reminder emails when the user didn't update its contact details within the last 12 months ("Is your phone number still correct?")
Please prepare to discuss during the Q&A how to handle those cases. In particular, how does the functionality developed in task 1+2 relate to these new requirements?