Skip to content

Latest commit

 

History

History
225 lines (190 loc) · 26.4 KB

2021-22.md

File metadata and controls

225 lines (190 loc) · 26.4 KB

Big Data Computing (2021-2022)

News | General Information | Syllabus | Environment Setup | Class Schedules |

News

  • January 2023 Exam Session: Final Grades
    Final grades are available at this link.
  • Students who are planning to submit their projects after the January 2023 session should refer to the new Big Data Computing 2022-23 Moodle page, rather than the current one (i.e., Big Data Computing 2021-22). This is to align exam sessions to the correct academic year since the academic year 2021-22 formally ends on January 31, 2023. As such, from February 2023 until January 2024, all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their proposals and projects on the corresponding submission links opened along the way, as usual. For example, the upcoming Call for Project Proposals for February 2023 Exam Session is available at the following link.
    (NOTE: Only students who expect to complete the exam in one of the upcoming 2022-23 sessions must subscribe to the Big Data Computing 2022-23 Moodle page!)
  • January 2023 Exam Session: Registration, Project Submission and Presentation
    Registrations to the January 2023 exam session are open on Infostud (id 858976), and so they will be until January 22, 2023.
    Project submission week opens up on January 16, 2023, at 00:00 a.m. CET (Central European Time) and closes on January 22, 2023, at 11:59 p.m. CEST.
    More information will be provided soon about how projects accepted for oral discussion will be presented.
  • January 2023 Exam Session: Call for Project Proposals
    Project proposals for the January 2023 session will be accepted starting from December 5, 2022, at 00:00 a.m. CET (Central European Time) until December 11, 2022, at 11:59 p.m. CEST.
    Please, be sure you upload your proposal on time using the appropriate Moodle assignment link, following the specified guidelines.
  • October 2022 Exam Session (Extra): Final Grades
    Final grades are available at this link.
  • October 2022 Exam Session (Extra): Registration, Project Submission and Presentation
    Registrations to the October 2022 exam session (extra) are open on Infostud (id 840632), and so they will be until October 23, 2022.
    Project submission week opens up on October 17, 2022, at 00:00 a.m. CEST (Central European Summer Time) and closes on October 23, 2022, at 11:59 p.m. CEST.
    Projects accepted for oral discussion will be presented on October 24, 2022, at 10:00 a.m. CEST. The discussion will take place exclusively remotely via Google Meet. More information about how to join this meeting will be sent close to the presentation date.
    IMPORTANT NOTE: This session is reserved only for eligible students who issued their request to the administration office by the specified deadline (September 25, 2022, at 11:59 p.m. CEST).
    Please, do not register to this exam on Infostud if you are not allowed to do so; otherwise, your project will not be considered for grading anyway.
  • October 2022 Exam Session (Extra): Call for Project Proposals
    Project proposals for the November 2022 exam (extra) session will be accepted starting from September 15, 2022, at 00:00 a.m. CEST (Central European Summer Time) until September 21, 2022, at 11:59 p.m. CEST.
    Please, be sure you upload your proposal on time using the appropriate Moodle assignment link, following the specified guidelines.
  • September 2022 Exam Session: Final Grades
    Final grades are available at this link.
  • September 2022 Exam Session: Registration, Project Submission and Presentation
    Registrations to the September 2022 exam session are open on Infostud (id 824956), and so they will be until September 6, 2022.
    Project submission week opens up on August 31, 2022, at 00:00 a.m. CEST (Central European Summer Time) and closes on September 6, 2022, at 11:59 p.m. CEST.
    Projects accepted for oral discussion will be presented on September 8 at 9:30 a.m. CEST and, if needed, on September 9 at 2:00 p.m. CEST. The discussion will take place exclusively remotely via Google Meet. More information about how to join this meeting will be sent close to the presentation date.
  • September 2022 Exam Session: Call for Project Proposals
    Project proposals for the September 2022 session will be accepted starting from July 6, 2022, at 00:00 a.m. CEST (Central European Summer Time) until July 12, 2022, at 11:59 p.m. CEST.
    Please, be sure you upload your proposal on time using the appropriate Moodle assignment link, following the specified guidelines.
  • July 2022 Exam Session: Final Grades
    Final grades are available at this link.
  • June 2022 Exam Session: Final Grades
    Final grades are available at this link.
  • July 2022 Exam Session: Registration, Project Submission and Presentation
    Registrations to the July 2022 exam session are open on Infostud (id 822764), and so they will be until June 28, 2022.
    Project submission week opens up on June 22, 2022, at 00:00 a.m. CEST (Central European Summer Time) and closes on June 28, 2022, at 11:59 p.m. CEST.
    Projects accepted for oral discussion will be presented on June 30 and, if needed, July 1, 2022, at 9:30 a.m. CEST. The discussion will take place exclusively remotely via Google Meet. More information about how to join this meeting will be sent close to the presentation date. in person in Room G50, which is located in Viale Regina Elena 295. Alternatively, candidates who cannot participate in person can join the meeting remotely via Google Meet, using the link that will be sent close to the presentation date.
    Everyone is welcome to join!
  • July 2022 Exam Session: Call for Project Proposals
    Project proposals for the July 2022 session will be accepted starting from May 23, 2022, at 00:00 a.m. CEST (Central European Summer Time) until May 29, 2022, at 11:59 p.m. CEST.
    Please, be sure you upload your proposal on time using the appropriate Moodle assignment link, following the specified guidelines.
  • June 2022 Exam Session: Registration, Project Submission and Presentation
    Registrations to the June 2022 exam session are open on Infostud (id 819129), and so they will be until June 7, 2022.
    Project submission week opens up on June 1, 2022, at 00:00 a.m. CEST (Central European Summer Time) and closes on June 7, 2022, at 11:59 p.m. CEST.
    Projects accepted for oral discussion will be presented on June 9 and 10, 2022, at 9:30 a.m. CEST. The discussion will take place in person in Room G50, which is located in Viale Regina Elena 295. Alternatively, candidates who cannot participate in person can join the meeting remotely via Google Meet, using the link that will be sent close to the presentation date. Everyone is welcome to join!
  • June 2022 Exam Session: Call for Project Proposals
    Project proposals for the June 2022 session will be accepted starting from April 25, 2022, at 00:00 a.m. CEST (Central European Summer Time) until May 1, 2022, at 11:59 p.m. CEST.
    Please, be sure you upload your proposal on time using the appropriate Moodle assignment link, following the specified guidelines.
  • A document containing the main guidelines for the final project is available here. Please, check it out!
  • IMPORTANT NOTICE:
    At least, the following two lectures scheduled for Tuesday, April 12, from 5:00 p.m. to 7:00 p.m., and Wednesday, April 13, from 8:00 a.m. to 11:00 a.m. will take place entirely and exclusively online via Zoom at the usual link.
  • IMPORTANT NOTICE:
    Both today's (April 5) and tomorrow's (April 6) lectures are canceled.
  • March 2022 Exam Session (Extra): Final Grades
    Final grades are available at this link.
  • March 2022 Exam Session (Extra): Project Presentation Schedule
    Presentations of the projects that have been accepted for oral discussion is scheduled for March 28, 2022, at 3:00 p.m. CEST. The discussion will take place in person in Room G0, which is located in Viale Regina Elena 295. Alternatively, candidates who cannot participate in person, can join the meeting remotely via Google Meet, using the link indicated in the message sent on the Moodle forum. Everyone is welcome to join!
  • March 2022 Exam Session (Extra)
    Registrations to the March 2022 exam (extra) session are open on Infostud (id 814638), and so they will be until March 25, 2022. Project submission week opens up on March 19, 2022, at 00:00 a.m. CET (Central European Time) and closes on March 25, 2022, at 11:59 p.m. CET.
    NOTE: This extra session is reserved only for part-time or working students, students with learning disabilities, students who have not completed university exams within the set time period, as well as students who are about to graduate. Only the students who successfully issued their request on time will be allowed to participate in this session!
  • February 2022 Exam Session: Final Grades
    Final grades are available at this link.
  • February 2022 Exam Session: Project Presentation Schedule
    Presentations of the projects that have been accepted for oral discussion will take place remotely via Google Meet on February 8, 2022, at 10:00 a.m. CET, using the link indicated in the message sent on the Moodle forum. Everyone is welcome to join!
  • February 2022 Exam Session
    Registrations to the February 2022 exam session are open on Infostud (id 793404), and so they will be until February 4, 2022. Project submission week opens up on January 29, 2022, at 00:00 a.m. CET (Central European Time) and closes on February 4, 2022, at 11:59 p.m. CET.
    (Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2021-22.)
  • Students who are planning to submit their projects after the January 2022 session should refer to the Big Data Computing 2021-22 Moodle page, rather than the current one (i.e., Big Data Computing 2020-21). This is to align exam sessions to the correct academic year, since academic year 2020-21 formally ends on January, 31 2022. As such, starting from February 2022 until January 2023 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual. For example, the upcoming February Submission Week is available at the following link.
    (NOTE: Only students who expect to complete the exam in one of the upcoming 2021-22 sessions must subscribe to the Big Data Computing 2021-22 Moodle page!)

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2021-2022 academic year.

Class Schedule

  • Tuesday from 5:00 p.m. to 7:00 p.m.
  • Wednesday from 8:00 a.m. to 11:00 a.m.

How to Attend Classes

According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in-person and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.

Attending Classes in Person: Room 1L - Via del Castro Laurenziano 7a

Students who are willing to attend classes in-person must issue their request through the Infostud Lab App or the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room 1L, which is located in Via del Castro Laurenziano 7a.

Attending Classes Remotely: Zoom

Students who are willing to attend classes remotely online must register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZAkdOysqjkiG9SU5I1rG-oENGV-RIfCxLwv

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=14454

Contacts

Office Hours

Please, drop me a message at [email protected] in case you would like to schedule a meeting, either online (i.e., via Google Meet or Zoom) or in-person (i.e., in Room 106 located at the 1st floor of Building E in Viale Regina Elena 295).

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry-standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation (just to name a few) using very-large datasets in any application domain of interest.
Anyway, the topic of the project must be first agreed with the teacher through a proposal that must be sent at least one month before the targeted project submission deadline. NOTE: Only the projects that have been successfully approved will be considered for grading!
References where to select interesting projects will be suggested throughout the course (e.g., Kaggle). However, I strongly encourage you to come up with your own original ideas, as creativity will be very much appreciated.
Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project is available here.

Recommended Textbooks

No textbooks are mandatory to successfully follow this course. However, there is a huge set of references which may be worth mentioning, especially to those who wants to dig deeper into some specific topics. Among those, some readings I would like to suggest are as follows:

  • Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
  • Big Data Analysis with Python [Marin, Shukla, VK]
  • Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
  • Spark: The Definitive Guide [Chambers, Zaharia]
  • Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
  • Hadoop: The Definitive Guide [White]
  • Python for Data Analysis [Mckinney]

Syllabus

Introduction

  • The Big Data Phenomenon
  • The Big Data Infrastructure
    • Distributed File Systems (HDFS)
    • MapReduce (Hadoop)
    • Spark
  • PySpark + Databricks

Unsupervised Learning: Clustering

  • The Curse of Dimensionality (Similarity Measures)
  • Algorithms: K-means
  • Example: Document Clustering

Dimensionality Reduction

  • Feature Extraction
  • Algorithms: Principal Component Analysis (PCA)
  • Example: PCA + Handwritten Digit Recognition

Supervised Learning

  • Basics of Machine Learning
  • Regression/Classification
  • Algorithms: Linear Regression/Logistic Regression/Random Forest
  • Examples:
    • Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
    • Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

  • Content-based vs. Collaborative filtering
  • Algorithms: k-NN, Matrix Factorization (MF)
  • Example: Movie Recommender System (MovieLens)

Graph Analysis

  • Link Analysis
  • Algorithms: PageRank
  • Example: Ranking (a sample of) the Google Web Graph

Environment Setup

PySpark + Google Colaboratory

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Google Colaboratory (or "Colab" for short). This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

  • Zero configuration required;
  • Free access to Google's powerful cloud infrastructure (including GPUs);
  • Easy sharing.

Of course, the same can be achieved also on your own local machine but that would require: (i) dealing with clumsy installation issues that are very specific to your platform, and (ii) sticking to "small" rather than real "big" data, as your machine cannot compare with Google's infrastructure!

Optionally, you may also want to install PySpark on your own local machine.

(NOTE: This step is not required for passing this class)

Local Mode Setup [Optional]

In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.


Class Schedules

Lecture # Date Topic Material
Lecture 1 02/22/2022 Introduction to Big Data: Motivations and Challenges [slides: PDF]
Lecture 2 02/23/2022 MapReduce Programming Model [slides: PDF]
Lecture 3 03/01/2022 Apache Spark [slides: PDF]
Lecture 4 03/02/2022 PySpark Tutorial [notebook: ipynb]
Lecture 5 03/08/2022 The Curse of Dimensionality [slides: PDF]
Lecture 6 03/09/2022 Clustering Algorithms (Part I): K-means [slides: PDF]
Lecture 7 03/15/2022 Clustering Algorithms (Part II): Validity Measures [slides: PDF]
Lecture 8 03/16/2022 Document Clustering with PySpark [slides: PDF, notebook: ipynb]
Lecture 9 03/22/2022 Dimensionality Reduction: Principal Component Analysis (Part I) [slides: PDF]
Lecture 10 03/23/2022 Dimensionality Reduction: Principal Component Analysis (Part II) [slides: PDF, notes: PDF]
Lecture 11 03/29/2022 Principal Component Analysis with PySpark [notebook: ipynb]
Lecture 12 03/30/2022 Supervised Learning (Part I): Data Preparation [slides: PDF]
Lecture 13 04/12/2022 Supervised Learning (Part II): Model Training [slides: PDF]
Lecture 14 04/13/2022 Linear Regression (OLS) [slides: PDF, notebook: ipynb]
Lecture 15 04/20/2022 Logistic Regression (Part I): Model [slides: PDF]
Lecture 16 04/26/2022 Logistic Regression (Part II): Cost Function [slides: PDF, notes: PDF]
Lecture 17 04/27/2022 Gradient Descent [slides: PDF]
Lecture 18 05/03/2022 Decision Trees and Ensembles (Part I) [slides: PDF]
Lecture 19 05/04/2022 Decision Trees and Ensembles (Part II) [slides: PDF]
Lecture 20 05/10/2022 Evaluation Metrics for Classification [slides: PDF], notebook: ipynb]
Lecture 21 05/11/2022 Recommender Systems (Part I) [slides: PDF]
Lecture 22 05/17/2022 Recommender Systems (Part II) [slides: PDF]
Lecture 23 05/18/2022 Recommender Systems (Part III) [slides: PDF, notebook: ipynb]
Lecture 24 05/24/2022 Graph Link Analysis [slides: PDF]
Lecture 25 05/25/2022 PageRank [slides: PDF, notes: PDF]
---------- 05/25/2022 The Last Take Home Message [slides: PDF]