Skip to content

andreaghezzi/big-data-computing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Computing (2020-2021)

News | General Information | Syllabus | Environment Setup | Class Schedules | Previous Years

News

  • July 2021 Exam Session
    Registrations to the July 2021 exam session are now open on Infostud (id 765581), and so they will until July 8, 2021. Project submission week opens up on July 1, 2021 at 00:00 CEST (Central European Summer Time) and closes on July 7, 2021 at 23:59 CEST.
  • June 2021 Exam Session
    Registrations to the June 2021 exam session are now open on Infostud (id 765579), and so they will until June 17, 2021. Project submission week opens up on June 11, 2021 at 00:00 CEST (Central European Summer Time) and closes on June 17, 2021 at 23:59 CEST.
  • Project Guidelines: A document containing the main guidelines for the final project is available here.
  • April 2021 Exam Session (Extra): Project Presentation Schedule
    Presentation of the project that has been accepted for oral discussion will take place remotely via Google Meet on April 16, 2021 at 12:00PM CEST. Everyone is welcome to join!
  • IMPORTANT: Back to In-Person Classes
    Starting from April 12, 2021 classes will take place in blended mode again (with 30% limit attendance).
  • April 2021 Exam Session (Extra)
    Registrations to the April 2021 exam session (extra) are now open on Infostud (id 762156), and so they will until April 4, 2021. Project submission week opens up on April 5, 2021 at 00:00 CEST (Central European Summer Time) and closes on April 11, 2021 at 23:59 CEST.
    NOTE: This extra session is reserved only to part-time or working students, students with learning disabilities, students who have not completed university exams within set time period, as well as students who are about to graduate.
  • IMPORTANT: In-Person Classes Suspended
    Starting from Monday, March 15, 2021 all the educational activities for all Sapienza's degree programmes will be held remotely only. Therefore, our classes will continue on the same Zoom meeting room, as per the original schedule.
    For any further information, please refer to this link on the Sapienza website.
  • 2020-21 classes are starting!
    Classes are starting on February 23, 2021 at 5:00PM CET and will be held in blended mode. To attend classes either in presence or remotely, please check out the instructions below.
  • February 2021 Exam Session: Final Grades
    Final grades are available at this link.
  • February 2021 Exam Session: Project Presentation Schedule
    Presentations of the projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on February 10, 2021 at 9:00AM CET. Everyone is welcome to join!
  • February 2021 Exam Session
    Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
    (Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)
  • Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
    (NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2020-2021 academic year.

Class Schedule

  • Tuesday from 5:00PM to 7:00PM
  • Wednesday from 4:00PM to 7:00PM

How to Attend Classes

According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in presence and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.

Attending Classes in Presence: Room G50 - Building G, Viale Regina Elena 295

Students who are willing to attend classes in presence must issue their request through the Infostud Lab App or the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room G50, which is located on the 3rd floor of the Building G in viale Regina Elena 295.

Attending Classes Remotely: Zoom

Students who are willing to attend classes remotely online will need to register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZUtd-mupz8rGt3uK2Mz_cKmOGDyVQpNmMfm

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=12771

Office Hours

  • Tuesday from 2:00PM to 4:00PM, Room 106 located at the 1st floor of Building E in viale Regina Elena 295.
    (NOTE: Due to the COVID-19 emergency, office hours will be exclusively held online via Google Meet or Zoom upon email request message sent to the following address: [email protected])

Contacts

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project is available here.

Recommended Textbooks

No textbooks are mandatory to successfully follow this course. However, there is a huge set of references which may be worth mentioning, especially to those who wants to dig deeper into some specific topics. Among those, some readings I would like to suggest are as follows:

  • Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
  • Big Data Analysis with Python [Marin, Shukla, VK]
  • Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
  • Spark: The Definitive Guide [Chambers, Zaharia]
  • Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
  • Hadoop: The Definitive Guide [White]
  • Python for Data Analysis [Mckinney]

Syllabus

[Tentative]

Introduction

  • The Big Data Phenomenon
  • The Big Data Infrastructure
    • Distributed File Systems (HDFS)
    • MapReduce (Hadoop)
    • Spark
  • PySpark + Databricks

Unsupervised Learning: Clustering

  • Similarity Measures
  • Algorithms: K-means
  • Example: Document Clustering

Dimensionality Reduction

  • Feature Extraction
  • Algorithms: Principal Component Analysis (PCA)
  • Example: PCA + Handwritten Digit Recognition

Supervised Learning

  • Basics of Machine Learning
  • Regression/Classification
  • Algorithms: Linear Regression/Logistic Regression/Random Forest
  • Examples:
    • Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
    • Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

  • Content-based vs. Collaborative filtering
  • Algorithms: k-NN, Matrix Factorization (MF)
  • Example: Movie Recommender System (MovieLens)

Graph Analysis

  • Link Analysis
  • Algorithms: PageRank
  • Example: Ranking (a sample of) the Google Web Graph

Real-time Analytics

  • Streaming Data Processing
  • Example: Twitter Hate Speech Detector

Environment Setup

PySpark + Databricks

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Databricks. This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

  • Zero configuration required;
  • Free access to Databricks' powerful cloud infrastructure (including GPUs);
  • Easy sharing.

Why Databricks?

Starting from this year, our Big Data Computing class at Sapienza has joined the Databricks University Alliance. This is a very active community of educators and faculty members who collaboratively share ideas, thoughts, and actual material on how to improve their teaching experience of Data-Science-like classes, which ultimately allow students to learn the latest data science tools used in the industry.

Where Should I Start with Databricks?

The first thing you have to do in order to start using Databricks is to set up a personal account. Databricks accounts come in two flavours:

  • Full Platform (payment, 14-day trial)
  • Community Edition (free)

The former is the standard payment account, which gives you access to the fully-fledged Databricks' data analytics platform based either on Microsoft Azure or Amazon AWS computational resources. The latter, instead, allows you to enjoy Databricks on Amazon AWS for free (of course with some limitations!)

For the aim of our class, students must all sign up for a personal Databricks Community Edition account using this link. Please, be sure to select the correct type of account, as highlighted in the snapshot below:

Databricks Account Sign Up

For any further information, please follow the instructions provided in the documentation.

What Databricks Resources Should I Use?

Many big companies have started relying on Databricks platform for running their data analytics tasks. As such, Databricks is really well-documented and provides you with a lot of useful material to consult. Among such material, I would suggest you to check out the following:

Optionally, you may also want to install PySpark on your own local machine.

(NOTE: This step is not required for passing this class)

Local Mode Setup [Optional]

In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.


Class Schedules

Lecture # Date Topic Material
Lecture 1 02/23/2021 Introduction to Big Data: Motivations and Challenges [slides: PDF]
Lecture 2 02/24/2021 MapReduce Programming Model [slides: PDF]
Lecture 3 03/03/2021 Apache Spark [slides: PDF]
Lecture 4 03/09/2021 PySpark Tutorial (with Databricks) [notebook: ipynb]
Lecture 5-6 03/10/2021 - 03/16/2021 Clustering [slides: PDF]
Lecture 7-8 03/17/2021 - 03/23/2021 Clustering Algorithms: K-means [slides: PDF]
Lecture 9 03/24/2021 Document Clustering with PySpark [slides: PDF, notebook: ipynb]
Lecture 10-11 03/30/2021 - 03/31/2021 Dimensionality Reduction (Principal Component Analysis) [slides: PDF, notes: PDF]
Lecture 12 04/07/2021 Principal Component Analysis with PySpark [notebook: ipynb]
Lecture 13 04/13/2021 Supervised Learning [slides: PDF]
Lecture 14-15 04/14/2021 - 04/20/2021 Linear Regression [slides: PDF]
Lecture 16 04/21/2021 Linear Regression with PySpark [notebook: ipynb]
Lecture 17-18 04/27/2021-04/28/2021 Logistic Regression [slides: PDF, notes: PDF]
Lecture 19 05/04/2021 Decision Trees and Ensembles [slides: PDF]
Lecture 20 05/05/2021 Evaluation Metrics for Classification [slides: PDF, notebook: ipynb]
Lecture 21 05/11/2021 Recommender Systems (Part I) [slides: PDF]
Lecture 22 05/06/2020 Recommender Systems (Part II) [slides: PDF]

Previous Years

In the following, you can quickly navigate through Big Data Computing class information and material from previous years.

NOTE: The folder containing the class material is unique and it is subject to changes and/or updates; as such, there may be differences between the content displayed on this website and what have been shown in class in the past.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%