One billion matches is challenge to implement generic delivery algorithm, that may be suitable for variety of tasks.
To give you some context, you may think this problem as:
- delivering marketing email
- sending group messages in messaging platform
- delivering ads in the Internet
Read the problem definition below and implement your solution in solutions folder.
There are 100 000 items that are classified to have N labels, here N is between 1 and 50.
There are 100 000 subscribers that are interested in items of specific labels, here again one subscriber may have 1 to 50 labels.
There are 50 000 labels.
We are seeking for solution to efficiently calculate which items should each subscriber recieve.
Items have timestamps, subscribers want items ordered by timestamp.
Item:
- id
- timestamp
- label_ids
Label:
- id
Subscriber:
- id
- label_ids
[Item] --(has many)--> [Label] <--(has many)-- [Subscriber]
To generate test data run:
make generate
This generates random test data in data/data
Build and run example solution:
make build solution='example'
make run solution='example'
This starts Docker container that runs until solution is found.
Give name to your solution, it should be filepath compatible, below we use your-solution
You can make copy of example solution:
cp -r ./solutions/example ./solutions/your-solution
Then you can implement your solution.
We expect to have various different solutions, so implement yours in Dockerfile.
There should be Dockerfile at solutions/your-solution/Dockerfile
Running the dockerfile should produce output file solutions/your-solution/output/output.csv
Document your solution and results in solutions/your-solution/readme.md
Input data contains three files as UTF-8 CSV:
- items.csv
- labels.csv
- subscribers.csv
Notes:
- First line is header containing column names
- Label ids in items and subscribers are joined in single column and can be separated by pipe character
|
- Output file containing delivery solution should be UTF-8 CSV
id,timestamp,label_ids
1,2,2|4|6
2,1,5
3,3,6
id
2
4
5
6
id,label_ids
1,5|6
2,2
3,2|6
Notes:
- item_ids should be ordered by timestamp
- subscribers should be ordered by id
- same item cannot be sent twice to one subcriber even when matches multiple labels
subscriber_id,item_ids
1,2|1|3
2,1
3,1|3