Skip to content

Latest commit

 

History

History
128 lines (108 loc) · 3.66 KB

day_16.md

File metadata and controls

128 lines (108 loc) · 3.66 KB

Day 16 - Advanced Spark Topics 🔥

Overview:

Today's focus will be Spark's APIs, both from the new and old times. You will learn different data handling methods in Spark, and various optimization techniques that data engineers use to efficiently process data and draw conclusions from it.

Goals:

  • Continue exploring advanced topics in Apache Spark to deepen your knowledge and skills in big data processing.
  • Build on the concepts and techniques covered in the previous days.

II. Structured APIs - DataFrames, SQL, and Datasets

Topics Covered:

  • Structured API Overview
  • Basic Structured Operations
  • Working with Different Types of Data
  • Aggregations
  • Joins
  • Data Sources
  • Spark SQL
  • Datasets

Structured API Overview

  • DataFrames and Datasets
  • Schemas
  • Overview of Structured Spark Types
  • Overview of Structured API Execution

Basic Structured Operations

  • Schemas
  • Columns and Expressions
  • Records and Rows
  • DataFrame Transformations

Working with Different Types of Data

  • Converting to Spark Types
  • Working with Booleans
  • Working with Numbers
  • Working with Strings
  • Working with Dates and Timestamps
  • Working with Nulls in Data
  • Ordering
  • Working with Complex Types

Aggregations

  • Aggregation Functions
  • Grouping
  • Window Functions
  • Grouping Sets
  • User-Defined Aggregation Functions

Joins

  • Join Expressions
  • Join Types
  • Inner Joins
  • Left Outer Joins
  • Natural Joins
  • Challenges When Using Joins
  • How Spark Performs Joins

Data Sources

  • The Structure of the Data Sources API
  • CSV Files
  • JSON Files
  • Parquet Files
  • Advanced I/O Concepts

Spark SQL

  • Big Data and SQL: Apache Hive
  • Big Data and SQL: Spark SQL
  • How to Run Spark SQL Queries
  • Catalog
  • Tables
  • Views
  • Databases
  • Select Statements
  • Advanced Topics
  • Miscellaneous Features

Datasets

  • When to Use Datasets
  • Creating Datasets

III. Low-Level APIs

Topics Covered:

  • Resilient Distributed Datasets (RDDs)
  • Advanced RDDs
  • Distributed Shared Variables

Resilient Distributed Datasets (RDDs)

  • What Are the Low-Level APIs?
  • About RDDs
  • Creating RDDs
  • Manipulating RDDs
  • Transformations
  • Actions
  • Saving Files
  • Caching
  • Checkpointing
  • Pipe RDDs to System Commands

Advanced RDDs

  • Key-Value Basics (Key-Value RDDs)
  • Aggregations
  • Joins
  • Controlling Partitions
  • Custom Serialization

Distributed Shared Variables

  • Broadcast Variables
  • Accumulators

Wrapping Up:

Great job on Day 16! You've explored advanced topics related to cluster management and performance optimization in Spark. Continue practicing and fine-tuning your Spark applications for maximum efficiency and scalability. Tomorrow, we'll dive into Spark SQL.

Discuss with your mentor about the day's learnings and explore potential project applications. Reflect on the significance of advanced Spark topics and how you can apply these concepts in your big data projects.

Discussion and Q&A (1 hour): 🙋

  • Engage in a discussion with mentors and peers to share experiences related to running Spark on a cluster.
  • Ask questions and seek advice on tuning and optimizing Spark applications.

Action Items:

  • Identify areas for deeper exploration.
  • Get recommendations on resources for further learning.

Recommended Videos: 💡