Skip to content

Latest commit

 

History

History
253 lines (214 loc) · 23.5 KB

2019-20.md

File metadata and controls

253 lines (214 loc) · 23.5 KB

Big Data Computing (2019-2020)

News | General Information | Syllabus | Environment Setup | Class Schedules

News

  • Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
    (NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)

  • February 2021 Exam Session
    Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
    (Please, see the announcement above for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)

  • January 2021 Exam Session: Final Grades
    Final grades are available at this link

  • January 2021 Exam Session: Project Presentation Schedule
    Presentations of the 4 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on January 20 at 9:00AM CET.

  • January 2021 Exam Session
    Registrations to the January 2021 exam session are now open on Infostud (id 745153), and so they will until January 17, 2021. Project submission week opens up on January 11, 2021 at 00:00 CET (Central European Time) and closes on January 17, 2021 at 23:59 CET.

  • November 2020 Exam Session: Final Grades
    Final grades are available at this link

  • November 2020 Exam Session: Project Presentation Schedule
    Presentations of the 2 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on November 5 at 9:30AM CET.

  • November 2020 Exam Session (Extra)
    Registrations to the November 2020 exam session (extra) are now open on Infostud (id 734617), and so they will until November 1, 2020. Project submission week opens up on October 26, 2020 at 00:00 CET (Central European Time) and closes on November 1, 2020 at 23:59 CET.
    NOTE: This extra session is reserved only to part-time or working students, students with learning disabilities, students who have not completed university exams within set time period, as well as students who are about to graduate. Anyone else is not allowed to register to this session; also, it is up to the single student to prove their eligibility to attend this session by filling this form. Additional information are available at this link (in italian).

  • September 2020 Exam Session: Final Grades
    Final grades are available at this link

  • September 2020 Exam Session: Project Presentation Schedule
    Presentations of the 6 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on September 15 at 9:30AM CEST.

  • September 2020 Exam Session
    Registrations to the September 2020 exam session are now open on Infostud (id 719728), and so they will until September 8, 2020. Project submission week opens up on September 2, 2020 at 00:00 CEST (Central European Summer Time) and closes on September 8, 2020 at 23:59 CEST.

  • July 2020 Exam Session: Final Grades
    Final grades are available at this link

  • July 2020 Exam Session: Project Presentation Schedule
    Presentations of the 7 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on July 22 at 9:30AM CEST.
    All the details on how to join this event have already been specified in a message on the Moodle forum.

  • July 2020 Exam Session
    Registrations to the July 2020 exam session are now open on Infostud (id 718065), and so they will until July 10, 2020. Project submission week opens up on July 4, 2020 at 00:00 CEST (Central European Summer Time) and closes on July 10, 2020 at 23:59 CEST.

  • June 2020 Exam Session: Final Grades
    Final grades are available at this link

  • June 2020 Exam Session: Project Presentation Schedule
    Presentations of the 11 projects that have been accepted for oral discussion will take place remotely via Google Meet on a two-day session:

    • Day 1: July 1 at 9:30AM CEST (6 projects)
    • Day 2: July 2 at 9:30AM CEST (5 projects)

    All the details on how to join these events have already been specified in a message on the Moodle forum.

  • June 2020 Exam Session
    Registrations to the June 2020 exam session are now open on Infostud (id 714643 and 716333), and so they will until June 19, 2020.
    Project submission week opens up on June 13, 2020 at 00:00 CEST (Central European Summer Time) and closes on June 19, 2020 at 23:59 CEST.

  • ### No class on Tuesday, May 19 ###
    As already announced, there will be only one class in the next week schedule. More specifically, our next class will be on Wednesday, May 20 at 3:00PM.

  • A document containing the main guidelines for the final project is now available here.

  • Online classes will be suspended from Thursday, April 9 to Tuesday, April 14, due to Easter holidays. Our next class will take place on Wednesday, April 15 at 3:00PM on Google Meet.

  • Online classes will also be recorded, so as to allow you to watch them offline at your own convenience.
    Video recordings will be compressed and uploaded to a shared Google Drive folder. I have already granted access to that folder to all the students who are enrolled to our Moodle web page at the time I am writing.
    In case some of you haven't subscribed to Moodle yet, please do so as soon as you can!
    Also, note that you must use your institutional credentials (i.e., @studenti.uniroma1.it) to access the shared folder above.

  • Due to the COVID-19 emergency, office hours are also suspended. Meanwhile, I will be reachable via email.

  • On next week, classes will be streamed online via Google Meet at the same time they were originally scheduled, i.e., Tuesday from 8:00AM to 10:00AM and Wednesday from 3:00PM to 6:00PM. To join the virtual room, please access the following URL on Tuesday a couple of minutes before the class starts using your institutional address: https://meet.google.com/pvs-cubs-gfw

  • Following the restrictions imposed by the latest ordinance issued by the Italian government to contrast the COVID-19 emergency, all classes are suspended until Friday, April 3 as reported on the Sapienza's website.

  • Due to unexpected issues, class lectures will start on Tuesday, March 3 rather than February, 25 as originally scheduled. Apologies for the very short notice.

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2019-2020 academic year.

Class Schedule

  • Tuesday from 8:00AM to 10:00AM (Room Alfa, Via Salaria 113)
  • Wednesday from 3:00PM to 6:00PM (Room Alfa, Via Salaria 113)

Office Hours

  • Tuesday from 2:00PM to 4:00PM, Room G39 located at the 2nd floor of Building G in viale Regina Elena 295.

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=8460

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project is available here.

Recommended Textbooks

No textbooks are mandatory to successfully follow this course. However, there is a huge set of references which may be worth mentioning, especially to those who wants to dig deeper into some specific topics. Among those, some readings I would like to suggest are as follows:

  • Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
  • Big Data Analysis with Python [Marin, Shukla, VK]
  • Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
  • Spark: The Definitive Guide [Chambers, Zaharia]
  • Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
  • Hadoop: The Definitive Guide [White]
  • Python for Data Analysis [Mckinney]

Syllabus

Introduction

  • The Big Data Phenomenon
  • The Big Data Infrastructure
    • Distributed File Systems (HDFS)
    • MapReduce (Hadoop)
    • Spark
  • PySpark + Google Colaboratory

Unsupervised Learning: Clustering

  • Similarity Measures
  • Algorithms: K-means
  • Example: Document Clustering

Dimensionality Reduction

  • Feature Extraction
  • Algorithms: Principal Component Analysis (PCA)
  • Example: PCA + Handwritten Digit Recognition

Supervised Learning

  • Basics of Machine Learning
  • Regression/Classification
  • Algorithms: Linear Regression/Logistic Regression/Random Forest
  • Examples:
    • Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
    • Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

  • Content-based vs. Collaborative filtering
  • Algorithms: k-NN, Matrix Factorization (MF)
  • Example: Movie Recommender System (MovieLens)

Graph Analysis

  • Link Analysis
  • Algorithms: PageRank
  • Example: Ranking (a sample of) the Google Web Graph

Real-time Analytics

  • Streaming Data Processing
  • Example: Twitter Hate Speech Detector

Environment Setup

PySpark + Google Colaboratory

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Google Colaboratory (or "Colab" for short). This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

  • Zero configuration required;
  • Free access to Google's powerful cloud infrastructure (including GPUs);
  • Easy sharing.

Of course, the same can be achieved also on your own local machine but that would require: (i) dealing with clumsy installation issues that are very specific to your platform, and (ii) sticking to "small" rather than real "big" data, as your machine cannot compare with Google's infrastructure!

Still, in case you would like to perform also local mode installation, the following are the steps (along with some references) you need to take.

Local Mode Setup

Prerequisites:

  • Install Python 3.6 (or later) via Anaconda along with Jupyter Notebook
  • Install Java 8
    • If your system has multiple JDK installations, use jenv to manage them (e.g., for macOS users, please refer to this link)
    • In your ~/.profile, ~/.bash_profile, or ~/.bashrc, let jenv for managing multiple JDKs by adding the following two lines:
      • export PATH="$HOME/.jenv/bin:$PATH"
      • eval "$(jenv init -)"
    • Run jenv enable-plugin export to allow jenv to automatically set JAVA_HOME upon changes to Java local/shell/global versions
    • In your ~/.profile, ~/.bash_profile, or ~/.bashrc, set default JAVA_HOME (system-wide) as follows:
      • export JAVA_HOME=$(/usr/libexec/java_home -v $(jenv version-name))

Installation:

  • Create a conda environment specifically for PySpark in combination with Python 3.6 (or later), and call it for instance "PySpark" (although you can choose any name you want):
    • conda create -n PySpark python=3.6
    • Install required packages inside the newly created conda environment either via conda or via pip:
      • conda activate PySpark
      • conda install pip
      • conda install numpy
      • conda install scipy
      • conda install pandas
      • conda install scikit-learn
      • conda install seaborn
      • conda install ipykernel
      • conda install findspark
    • Install any additional packages:
      • conda install autopep8
      • ...
      • conda deactivate
    • Prepare a kernel for the newly created environment on Jupyter Notebook:
      • conda activate PySpark
      • python -m ipykernel install --user --name PySpark --display-name "PySpark"
      • conda deactivate
    • Download from Apache the latest version of Spark (e.g., 2.4.5)
    • Untar the downloaded archive:
      • tar -xzf spark-2.4.5-bin-hadoop2.7.tgz
    • Move the directory to a local folder (e.g., /opt/, /opt/local/, /usr/local/, etc.) [might require sudo/administrator's password]:
      • mv spark-2.4.5-bin-hadoop2.7 /usr/local/spark-2.4.5
    • Create a symlink so as to allow multiple versions of Spark:
      • ln -s /usr/local/spark-2.4.5 /usr/local/spark
    • Update your ~/.profile, ~/.bash_profile, or ~/.bashrc file as follows:
      • export SPARK_HOME=/usr/local/spark
      • export PATH=$SPARK_HOME/bin:$PATH
    • NOTE: THIS STEP IS ONLY NEEDED IF YOU HAVE MULTIPLE JDK VERSIONS INSTALLED
      • Go to /usr/local/spark/conf and create a spark-env.sh file (copying it from the template provided)
      • Enforce Spark to run on top of JDK 1.8 by copy-pasting the following into spark-env.sh:
        • export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

Usage:

You can start running your local Spark installation either interactively by typing the shell command pyspark, or using Jupyter Notebook in combination with the virtual environment created as indicated above. Either way, before being able to import any PySpark library you will need to use findspark to allow Python to correctly locate your own PySpark installation. To do so, just type the following (again, either at the prompt after you have executed pyspark, or in the first code cell of your Jupyter Notebook, or anyway at the top of your Python script):

import findspark

findspark.init("${SPARK_HOME}")

where ${SPARK_HOME} contains the path to your local Spark installation (e.g., /usr/local/spark).


Class Schedules

Lecture # Date Topic Material
Lecture 1 03/03/2020 Introduction to Big Data: Motivations and Challenges [slides: PDF]
Lecture 2 03/04/2020 MapReduce Programming Model [slides: PDF]
Lecture 3 03/10/2020 Apache Spark + PySpark Tutorial (Part I) [slides: PDF, notebook: ipynb]
Lecture 4 03/11/2020 PySpark Tutorial (Part II) + Clustering: Data Representation [slides: PDF, notebook: ipynb]
Lecture 5 03/17/2020 Clustering: Distance Measures [slides: PDF]
Lecture 6 03/18/2020 Clustering Algorithms: K-means [slides: PDF]
Lecture 7 03/24/2020 Document Clustering with PySpark [slides: PDF, notebook: ipynb]
Lecture 8 03/25/2020 Dimensionality Reduction (Principal Component Analysis) [slides: PDF, notes: PDF]
Lecture 9 03/31/2020 Principal Component Analysis with PySpark [notebook: ipynb]
Lecture 10 04/01/2020 Supervised Learning [slides: PDF]
Lecture 11 04/07/2020 Linear Regression [slides: PDF]
Lecture 12 04/08/2020 Linear Regression with PySpark [notebook: ipynb]
Lecture 13 04/15/2020 Logistic Regression [slides: PDF, notes: PDF]
Lecture 14 04/21/2020 Gradient Descent [slides: PDF]
Lecture 15 04/22/2020 Decision Trees and Ensembles [slides: PDF]
Lecture 16 04/28/2020 Evaluation Metrics for Classification [slides: PDF]
Lecture 17 04/29/2020 Classification with PySpark [notebook: ipynb]
Lecture 18 05/05/2020 Recommender Systems (Part I) [slides: PDF]
Lecture 19 05/06/2020 Recommender Systems (Part II) [slides: PDF]
Lecture 20 05/12/2020 Recommender Systems (Matrix Factorization) with PySpark [notebook: ipynb]
Lecture 21 05/13/2020 Graph Link Analysis [slides: PDF]
Lecture 22 05/20/2020 PageRank with PySpark [slides: PDF, notes: PDF, notebook: ipynb]
Lecture 23 05/26/2020 Streaming Data Processing [slides: PDF]
Lecture 24 05/27/2020 Streaming Classification with PySpark + The Last Take Home Message [notebook: ipynb, slides: PDF]