To demonstrate the data streaming pipelines by implementing the modern data stack.
- Streaming message data:
bot -> kafka -> druid -> superset
- Detect hate speech in the Discord Server
- Language: Python 3.9
- Data Source: Discord Bot
- Storage Layer: PostgreSQL
- Streaming Layer: Apache Kafka
- Analytics Layer: Apache Druid
- Visualization Layer: Apache Superset
Due to the scope and purpose of this project, everything will be running in docker containers via simple make commands. Thus, it is essential to have make
, docker
and docker-compose
properly installed.
- make
- docker & docker-compose
- Platform: Mac or Linux (for automatic setup to work)
- Minimum 12GB memory allocated for Docker
-
Step 1.
make setup
-
Step 2.1.
Fill in your Discord Bot Token in .env file -
Step 2.2.
- Add
TALISMAN_ENABLED=False
in thesuperset
service as an extra environment variable in Superset's docker-compose.yml - Add the below networks block in Superset's docker-compose.yml
networks: discord: name: discord_ai_bot_network external: true
- Add the below for all services in Superset's docker-compose.yml
networks: - discord
- Comment out or remove the below in
superset-tests-worker
service in Superset's docker-compose.yml
network_mode: host
- Add
-
Step 3.
make kafka
Note 1: This step will create amessages
topic by default
Note 2: Please make sure all containers are up and running before moving to the next step. -
Step 4.
make bot
Note 1: This step will instantiate the bot and make the connection to kafka -
Step 5.
Install the Bot on your Discord Server and type some texts in the Discord channel for testing. Monitoring the Discord Bot log to see if there are any errors. -
Step 7.
make druid
Note 1: kafka host:broker:29092
Note 2: kafka topic:messages
-
Step 8.
make superset
Note 1: Druid connection URL:druid://<druid_router_ip>:8888/druid/v2/sql
TODO