Big Data Computing (2019-2020)

News | General Information | Syllabus | Environment Setup | Class Schedules

News

Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
(NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)
February 2021 Exam Session
Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
(Please, see the announcement above for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)
January 2021 Exam Session: Final Grades
Final grades are available at this link
January 2021 Exam Session: Project Presentation Schedule
Presentations of the 4 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on January 20 at 9:00AM CET.
January 2021 Exam Session
Registrations to the January 2021 exam session are now open on Infostud (id 745153), and so they will until January 17, 2021. Project submission week opens up on January 11, 2021 at 00:00 CET (Central European Time) and closes on January 17, 2021 at 23:59 CET.
November 2020 Exam Session: Final Grades
Final grades are available at this link
November 2020 Exam Session: Project Presentation Schedule
Presentations of the 2 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on November 5 at 9:30AM CET.
November 2020 Exam Session (Extra)
Registrations to the November 2020 exam session (extra) are now open on Infostud (id 734617), and so they will until November 1, 2020. Project submission week opens up on October 26, 2020 at 00:00 CET (Central European Time) and closes on November 1, 2020 at 23:59 CET.
NOTE: This extra session is reserved only to part-time or working students, students with learning disabilities, students who have not completed university exams within set time period, as well as students who are about to graduate. Anyone else is not allowed to register to this session; also, it is up to the single student to prove their eligibility to attend this session by filling this form. Additional information are available at this link (in italian).
September 2020 Exam Session: Final Grades
Final grades are available at this link
September 2020 Exam Session: Project Presentation Schedule
Presentations of the 6 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on September 15 at 9:30AM CEST.
September 2020 Exam Session
Registrations to the September 2020 exam session are now open on Infostud (id 719728), and so they will until September 8, 2020. Project submission week opens up on September 2, 2020 at 00:00 CEST (Central European Summer Time) and closes on September 8, 2020 at 23:59 CEST.
July 2020 Exam Session: Final Grades
Final grades are available at this link
July 2020 Exam Session: Project Presentation Schedule
Presentations of the 7 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on July 22 at 9:30AM CEST.
All the details on how to join this event have already been specified in a message on the Moodle forum.
July 2020 Exam Session
Registrations to the July 2020 exam session are now open on Infostud (id 718065), and so they will until July 10, 2020. Project submission week opens up on July 4, 2020 at 00:00 CEST (Central European Summer Time) and closes on July 10, 2020 at 23:59 CEST.
June 2020 Exam Session: Final Grades
Final grades are available at this link
June 2020 Exam Session: Project Presentation Schedule
Presentations of the 11 projects that have been accepted for oral discussion will take place remotely via Google Meet on a two-day session:
- Day 1: July 1 at 9:30AM CEST (6 projects)
- Day 2: July 2 at 9:30AM CEST (5 projects)
All the details on how to join these events have already been specified in a message on the Moodle forum.
June 2020 Exam Session
Registrations to the June 2020 exam session are now open on Infostud (id 714643 and 716333), and so they will until June 19, 2020.
Project submission week opens up on June 13, 2020 at 00:00 CEST (Central European Summer Time) and closes on June 19, 2020 at 23:59 CEST.
### No class on Tuesday, May 19 ###
As already announced, there will be only one class in the next week schedule. More specifically, our next class will be on Wednesday, May 20 at 3:00PM.
A document containing the main guidelines for the final project is now available here.
Online classes will be suspended from Thursday, April 9 to Tuesday, April 14, due to Easter holidays. Our next class will take place on Wednesday, April 15 at 3:00PM on Google Meet.
Online classes will also be recorded, so as to allow you to watch them offline at your own convenience.
Video recordings will be compressed and uploaded to a shared Google Drive folder. I have already granted access to that folder to all the students who are enrolled to our Moodle web page at the time I am writing.
In case some of you haven't subscribed to Moodle yet, please do so as soon as you can!
Also, note that you must use your institutional credentials (i.e., @studenti.uniroma1.it) to access the shared folder above.
Due to the COVID-19 emergency, office hours are also suspended. Meanwhile, I will be reachable via email.
On next week, classes will be streamed online via Google Meet at the same time they were originally scheduled, i.e., Tuesday from 8:00AM to 10:00AM and Wednesday from 3:00PM to 6:00PM. To join the virtual room, please access the following URL on Tuesday a couple of minutes before the class starts using your institutional address: https://meet.google.com/pvs-cubs-gfw
Following the restrictions imposed by the latest ordinance issued by the Italian government to contrast the COVID-19 emergency, all classes are suspended until Friday, April 3 as reported on the Sapienza's website.
Due to unexpected issues, class lectures will start on Tuesday, March 3 rather than February, 25 as originally scheduled. Apologies for the very short notice.

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2019-2020 academic year.

Class Schedule

Tuesday from 8:00AM to 10:00AM (Room Alfa, Via Salaria 113)
Wednesday from 3:00PM to 6:00PM (Room Alfa, Via Salaria 113)

Office Hours

Tuesday from 2:00PM to 4:00PM, Room G39 located at the 2nd floor of Building G in viale Regina Elena 295.

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=8460

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project is available here.

Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
Big Data Analysis with Python [Marin, Shukla, VK]
Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
Spark: The Definitive Guide [Chambers, Zaharia]
Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
Hadoop: The Definitive Guide [White]
Python for Data Analysis [Mckinney]

Syllabus

Introduction

The Big Data Phenomenon
The Big Data Infrastructure
- Distributed File Systems (HDFS)
- MapReduce (Hadoop)
- Spark
PySpark + Google Colaboratory

Unsupervised Learning: Clustering

Similarity Measures
Algorithms: K-means
Example: Document Clustering

Dimensionality Reduction

Feature Extraction
Algorithms: Principal Component Analysis (PCA)
Example: PCA + Handwritten Digit Recognition

Supervised Learning

Basics of Machine Learning
Regression/Classification
Algorithms: Linear Regression/Logistic Regression/Random Forest
Examples:
- Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
- Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

Content-based vs. Collaborative filtering
Algorithms: k-NN, Matrix Factorization (MF)
Example: Movie Recommender System (MovieLens)

Graph Analysis

Link Analysis
Algorithms: PageRank
Example: Ranking (a sample of) the Google Web Graph

Real-time Analytics

Streaming Data Processing
Example: Twitter Hate Speech Detector

Environment Setup

PySpark + Google Colaboratory

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Google Colaboratory (or "Colab" for short). This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

Zero configuration required;
Free access to Google's powerful cloud infrastructure (including GPUs);
Easy sharing.

Of course, the same can be achieved also on your own local machine but that would require: (i) dealing with clumsy installation issues that are very specific to your platform, and (ii) sticking to "small" rather than real "big" data, as your machine cannot compare with Google's infrastructure!

Still, in case you would like to perform also local mode installation, the following are the steps (along with some references) you need to take.

Local Mode Setup

Prerequisites:

Install Python 3.6 (or later) via Anaconda along with Jupyter Notebook
Install Java 8
- If your system has multiple JDK installations, use jenv to manage them (e.g., for macOS users, please refer to this link)
- In your ~/.profile, ~/.bash_profile, or ~/.bashrc, let jenv for managing multiple JDKs by adding the following two lines:
  - export PATH="$HOME/.jenv/bin:$PATH"
  - eval "$(jenv init -)"
- Run jenv enable-plugin export to allow jenv to automatically set JAVA_HOME upon changes to Java local/shell/global versions
- In your ~/.profile, ~/.bash_profile, or ~/.bashrc, set default JAVA_HOME (system-wide) as follows:
  - export JAVA_HOME=$(/usr/libexec/java_home -v $(jenv version-name))

Installation:

Create a conda environment specifically for PySpark in combination with Python 3.6 (or later), and call it for instance "PySpark" (although you can choose any name you want):
- conda create -n PySpark python=3.6
- Install required packages inside the newly created conda environment either via conda or via pip:
  - conda activate PySpark
  - conda install pip
  - conda install numpy
  - conda install scipy
  - conda install pandas
  - conda install scikit-learn
  - conda install seaborn
  - conda install ipykernel
  - conda install findspark
- Install any additional packages:
  - conda install autopep8
  - ...
  - conda deactivate
- Prepare a kernel for the newly created environment on Jupyter Notebook:
  - conda activate PySpark
  - python -m ipykernel install --user --name PySpark --display-name "PySpark"
  - conda deactivate
- Download from Apache the latest version of Spark (e.g., 2.4.5)
- Untar the downloaded archive:
  - tar -xzf spark-2.4.5-bin-hadoop2.7.tgz
- Move the directory to a local folder (e.g., /opt/, /opt/local/, /usr/local/, etc.) [might require sudo/administrator's password]:
  - mv spark-2.4.5-bin-hadoop2.7 /usr/local/spark-2.4.5
- Create a symlink so as to allow multiple versions of Spark:
  - ln -s /usr/local/spark-2.4.5 /usr/local/spark
- Update your ~/.profile, ~/.bash_profile, or ~/.bashrc file as follows:
  - export SPARK_HOME=/usr/local/spark
  - export PATH=$SPARK_HOME/bin:$PATH
- NOTE: THIS STEP IS ONLY NEEDED IF YOU HAVE MULTIPLE JDK VERSIONS INSTALLED
  - Go to /usr/local/spark/conf and create a spark-env.sh file (copying it from the template provided)
  - Enforce Spark to run on top of JDK 1.8 by copy-pasting the following into spark-env.sh:
    - export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Usage:

You can start running your local Spark installation either interactively by typing the shell command pyspark, or using Jupyter Notebook in combination with the virtual environment created as indicated above. Either way, before being able to import any PySpark library you will need to use findspark to allow Python to correctly locate your own PySpark installation. To do so, just type the following (again, either at the prompt after you have executed pyspark, or in the first code cell of your Jupyter Notebook, or anyway at the top of your Python script):

import findspark

findspark.init("${SPARK_HOME}")

where ${SPARK_HOME} contains the path to your local Spark installation (e.g., /usr/local/spark).

Class Schedules

Lecture #	Date	Topic	Material
Lecture 1	03/03/2020	Introduction to Big Data: Motivations and Challenges	[slides: PDF]
Lecture 2	03/04/2020	MapReduce Programming Model	[slides: PDF]
Lecture 3	03/10/2020	Apache Spark + PySpark Tutorial (Part I)	[slides: PDF, notebook: ipynb]
Lecture 4	03/11/2020	PySpark Tutorial (Part II) + Clustering: Data Representation	[slides: PDF, notebook: ipynb]
Lecture 5	03/17/2020	Clustering: Distance Measures	[slides: PDF]
Lecture 6	03/18/2020	Clustering Algorithms: K-means	[slides: PDF]
Lecture 7	03/24/2020	Document Clustering with PySpark	[slides: PDF, notebook: ipynb]
Lecture 8	03/25/2020	Dimensionality Reduction (Principal Component Analysis)	[slides: PDF, notes: PDF]
Lecture 9	03/31/2020	Principal Component Analysis with PySpark	[notebook: ipynb]
Lecture 10	04/01/2020	Supervised Learning	[slides: PDF]
Lecture 11	04/07/2020	Linear Regression	[slides: PDF]
Lecture 12	04/08/2020	Linear Regression with PySpark	[notebook: ipynb]
Lecture 13	04/15/2020	Logistic Regression	[slides: PDF, notes: PDF]
Lecture 14	04/21/2020	Gradient Descent	[slides: PDF]
Lecture 15	04/22/2020	Decision Trees and Ensembles	[slides: PDF]
Lecture 16	04/28/2020	Evaluation Metrics for Classification	[slides: PDF]
Lecture 17	04/29/2020	Classification with PySpark	[notebook: ipynb]
Lecture 18	05/05/2020	Recommender Systems (Part I)	[slides: PDF]
Lecture 19	05/06/2020	Recommender Systems (Part II)	[slides: PDF]
Lecture 20	05/12/2020	Recommender Systems (Matrix Factorization) with PySpark	[notebook: ipynb]
Lecture 21	05/13/2020	Graph Link Analysis	[slides: PDF]
Lecture 22	05/20/2020	PageRank with PySpark	[slides: PDF, notes: PDF, notebook: ipynb]
Lecture 23	05/26/2020	Streaming Data Processing	[slides: PDF]
Lecture 24	05/27/2020	Streaming Classification with PySpark + The Last Take Home Message	[notebook: ipynb, slides: PDF]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2019-20.md

2019-20.md

Big Data Computing (2019-2020)

News

General Information

Class Schedule

Office Hours

Moodle Web Page

Description and Goals

Prerequisites

Exams

Recommended Textbooks

Syllabus

Environment Setup

PySpark + Google Colaboratory

Local Mode Setup

Prerequisites:

Installation:

Usage:

Class Schedules

Files

2019-20.md

Latest commit

History

2019-20.md

File metadata and controls

Big Data Computing (2019-2020)

News

General Information

Class Schedule

Office Hours

Moodle Web Page

Description and Goals

Prerequisites

Exams

Recommended Textbooks

Syllabus

Environment Setup

PySpark + Google Colaboratory

Local Mode Setup

Prerequisites:

Installation:

Usage:

Class Schedules