Spark-for-Big-Data

This repository demonstrates how to use Spark to work with big data and build machine learning models at scale.

Goals

Practice processing and cleaning datasets to get comfortable with Spark’s SQL and dataframe APIs (Spark SQL, PySpark).
Debug and optimize for data skewness when running on a cluster.
Use Spark’s Machine Learning Library (MLlib) to train machine learning models at scale.

Provide feedback