==========================
Welcome to our Data Engineering Roadmap!
This roadmap is designed to help you navigate the world of data engineering, from the fundamentals to advanced topics. Whether you're a beginner or an experienced professional, this roadmap will guide you through the key concepts, tools, and technologies you need to master.
- Introduction to Data Engineering
- Data Engineering Lifecycle:
- Data Collection
- Data Ingestion
- Data Storage and Management
- Data Transformation
- Data Serving
- Overview of Data Pipelines (ETL/ELT)
- Overview of Data Modeling
- Overview of Cloud Data Engineering
- Soft Skills for Data Engineers
-
Basic Linux Commands:
- Introduction to the Command Line
- Creating and Navigating Directories
- Listing Files in Directories
- Creating and Viewing Files
- Copying and Moving Files
- Renaming Files
- Absolute and Relative Paths
- Viewing and Managing Processes
-
GitHub:
- Creating a Repo
- Cloning a Repo
- Git Add
- Git Commit
- Git Push
- Git Branch
- Pull Request
- Resolving Git Conflicts
- Creating a Git README and Documenting Projects
- Introduction to Databases and Data Warehousing
- Downloading the Postgres Server Locally
- Basic Queries
- DDL - Data Definition Language
- DML - Data Manipulation Language
- DCL - Data Control Language
- Joins
- SQL Data Cleaning
- Window Functions
- Introduction to Advanced SQL (Subquery & CTE)
- Creating Tables/Views (Working with Tables)
- Stored Procedures
- Entity-Relationship Diagrams (ERDs)
- Python Basics
- Control Flow
- Operators:
- Arithmetic Operators
- Assignment Operators
- Comparison Operators
- Logical Operators
- Identity Operators
- Membership Operators
- Logical Statements
- If and Else Statements
- Loops:
- For Loop
- While Loop
- Functions:
- Normal Functions
- Generic Functions
- Non-Default Arguments
- Default Arguments
- *Args and **kwargs
- Modules & Packages:
- In-Built Modules
- Custom Modules
- Packages
- Errors and Exceptions
- Data Structures
- File Handling
- Data Manipulation with Pandas
- Database Interaction
- API and Web Scraping
- ETL Process and Data Pipeline
- Version Control for Projects
- Introduction to OOP
- Fundamental Concepts
- Basic Techniques for Dimension Tables
- Basic Techniques for Fact Tables
- Slowly Changing Dimensions
- DBT Fundamentals
- Understanding Jinja, Macros, and Testing in DBT
- DBT Packages
- Introduction to DBT Cloud
- Overview of Docker and Internals of Docker
- Dockerfile
- Docker Images
- Docker Containers
- Understanding Docker Volumes
- Docker Networking
- Introduction to YAML
- Docker Compose and Anchors in Docker Compose
- GitHub Actions
- Airbyte:
- Airbyte Concepts
- Source
- Destination
- Connection
- Connector
- Sync
- Airbyte Architecture:
- Architecture Overview
- WebApp
- API Server
- Metadata Database
- Temporal
- Worker
- Running Airbyte in Docker
- Understanding Source Configuration
- Understanding Destination Configuration
- Configuring a Full Synchronization between Source and Destination:
- S3 to Postgres Database
- Postgres Database to Redshift
- How Sync Works Under the Hood
- Introduction to Apache Airflow
- Airflow Concepts:
- Workflow
- DAG
- Task
- Operators
- Dependencies
- Installation and Setup:
- Prerequisites
- Installation
- Configuration
- Airflow Architecture:
- Architecture Overview
- WebServer
- Metadata Database
- Scheduler
- Worker
- How Airflow Works
- Architecture Overview
- Creating Your First DAG
- Understanding DAG Configuration
- Understanding Task Configuration
- Understanding Airflow Variables
- Advanced DAG Concepts
- Monitoring and Debugging
- Airflow Configuration and Best Practices
- Projects
- Introduction to the Cloud
- IAM
- Data Lake
- Python Libraries to Interact with the Cloud
- Data Catalog
- Relational Database Services (RDS)
- Data Warehouse
- ETL Services
- Orchestration Services
- Compute
- Introduction to Spark
- Installation
- Spark SQL and DataFrame API
- RDDs
- Transformations and Actions
- Spark Streaming
- Structured Streaming
- Tuning and Optimization
- What terraform is
- How terraform works
- Terraform state file
- Remote state file
- Basic provisioning
- Resource referencing
- Data source
- Local usage
- Modules
- Module overview
- Module structure
- Create a simple resource module
- Apache Kafka Overview
- Kafka Architecture
- Kafka topic, partition and offset
- Producer and Consumer
- Consumer group and Rebalancing protocol
- Kafka Connect
- Introduction to Kubernetes