Skip to content

Latest commit

 

History

History
285 lines (206 loc) · 17.4 KB

README.md

File metadata and controls

285 lines (206 loc) · 17.4 KB

Applied Data Analysis and Machine Learning

This site contains all material relevant for the course on Applied Data Analysis and Machine Learning.

Introduction

Probability theory and statistical methods play a central role in Science. Nowadays we are surrounded by huge amounts of data. For example, there are more than one trillion web pages; more than one hour of video is uploaded to YouTube every second, amounting to years of content every day; the genomes of 1000s of people, each of which has a length of more than a billion base pairs, have been sequenced by various labs and so on. This deluge of data calls for automated methods of data analysis, which is exactly what machine learning aims at providing.

Learning outcomes

This course aims at giving you insights and knowledge about many of the central algorithms used in Data Analysis and Machine Learning. The course is project based and through various numerical projects and weekly exercises you will be exposed to fundamental research problems in these fields, with the aim to reproduce state of the art scientific results. Both supervised and unsupervised methods will be covered. The emphasis is on a frequentist approach with an emphasis on predictions and correaltions. However, we will try, where appropriate, to link our machine learning models with a Bayesian approach as well. You will learn to develop and structure large codes for studying different cases where Machine Learning is applied to, get acquainted with computing facilities and learn to handle large scientific projects. A good scientific and ethical conduct is emphasized throughout the course. More specifically, after this course you will

  • Learn about basic data analysis, statistical analysis, Bayesian statistics, Monte Carlo sampling, data optimization and machine learning;
  • Be capable of extending the acquired knowledge to other systems and cases;
  • Have an understanding of central algorithms used in data analysis and machine learning;
  • Understand linear methods for regression and classification, from ordinary least squares, via Lasso and Ridge to Logistic regression and Kernel regression;
  • Learn about neural networks and deep learning methods for supervised and unsupervised learning. Emphasis on feed forward neural networks, convolutional and recurrent neural networks;
  • Learn about about decision trees, random forests, bagging and boosting methods;
  • Learn about support vector machines and kernel transformations;
  • Reduction of data sets and unsupervised learning, from PCA to clustering;
  • Autoencoders and Reinforcement Learning;
  • Work on numerical projects to illustrate the theory. The projects play a central role and you are expected to know modern programming languages like Python or C++ and/or Fortran (Fortran2003 or later).

Prerequisites and background

Basic knowledge in programming and mathematics, with an emphasis on linear algebra. Knowledge of Python or/and C++ as programming languages is strongly recommended and experience with Jupyter notebooks is recommended. We recommend also refreshing your knowledge on Statistics and Probability theory. The lecture notes at https://compphysics.github.io/MachineLearning/doc/LectureNotes/_build/html/intro.html offer a review of Statistics and Probability theory.

The course has two central parts

  1. Statistical analysis and optimization of data
  2. Machine learning algorithms and Deep Learning

Statistical analysis and optimization of data

The following topics are normally be covered

  • Basic concepts, expectation values, variance, covariance, correlation functions and errors;
  • Simpler models, binomial distribution, the Poisson distribution, simple and multivariate normal distributions;
  • Central elements of Bayesian statistics and modeling;
  • Gradient methods for data optimization,
  • Monte Carlo methods, Markov chains, Gibbs sampling and Metropolis-Hastings sampling;
  • Estimation of errors and resampling techniques such as the cross-validation, blocking, bootstrapping and jackknife methods;
  • Principal Component Analysis (PCA) and its mathematical foundation

Machine learning

The following topics are typically covered:

  • Linear Regression and Logistic Regression;
  • Neural networks and deep learning, including convolutional and recurrent neural networks
  • Decisions trees, Random Forests, Bagging and Boosting
  • Support vector machines
  • Bayesian linear and logistic regression
  • Boltzmann Machines and generative models
  • Unsupervised learning Dimensionality reduction, PCA, k-means and clustering
  • Autoenconders
  • Generative algorithms

Not all these topics may be covered by FYS-STK3155/4155. Some of them, like generative models and Bayesian statistics are covered by the advanced course FYS5429.

Hands-on demonstrations, exercises and projects aim at deepening your understanding of these topics.

Computational aspects play a central role and you are expected to work on numerical examples and projects which illustrate the theory and various algorithms discussed during the lectures. We recommend strongly to form small project groups of 2-3 participants, if possible.

Instructor information

Teaching team Fall 2023

Discord as discussion tool

Practicalities

  1. The sessions on Tuesdays and Wednesdays last four hours for each group (four in total) and will include lectures in a flipped mode (promoting active learning) and work on exercices and projects. The sessions will begin with lectures and questions and answers about the material to be covered every week.

  2. There are four groups, Tuesdays 815am-12pm and 1215pm-4pm and Wednesdays 815am-12pm and 1215pm-4pm. Please sign up as soon as possible for one of the groups. Max capacity per group is 30-40 participants.

  3. On Mondays we have a regular lecture. These lectures start at 1015am and end at 12pm. These lectures are recorded.

  4. Three projects which are graded and count 1/3 each of the final grade;

  5. A selected number of weekly assignments. The weekly assignments can be handed in and for all assignments you can get an extra score of 20 points to the final grade.

  6. The course is part of the CS Master of Science program, but is open to other bachelor and Master of Science students at the University of Oslo;

  7. The course is offered as a so-called cloned course, FYS-STK4155 at the Master of Science level and FYS-STK3155 as a senior undergraduate)course;

  8. Videos of teaching material are available via the links at https://compphysics.github.io/MachineLearning/doc/web/course.html;

  9. Weekly email with summary of activities will be mailed to all participants;

Grading

Grading scale: Grades are awarded on a scale from A to F, where A is the best grade and F is a fail. There are three projects which are graded and each project counts 1/3 of the final grade. The total score is thus the average from all three projects.

The final number of points is based on the average of all projects (including eventual additional points) and the grade follows the following table:

  • 92-100 points: A
  • 77-91 points: B
  • 58-76 points: C
  • 46-57 points: D
  • 40-45 points: E
  • 0-39 points: F-failed

In summary

Activity Fraction of total grade
First project, due October 7 1/3
Second project, due November 4 1/3
Third project, due December 9 1/3
Extra Credit (not mandatory), weekly exercise assignments, 10 in total (due Fridays) 10%

Required Technologies

Course participants are expected to have their own laptops/PCs. We use Git as version control software and the usage of providers like GitHub, GitLab or similar are strongly recommended. If you are not familiar with Git as version control software, the following video may be of interest, see https://www.youtube.com/watch?v=RGOj5yH7evk&ab_channel=freeCodeCamp.org

We will make extensive use of Python as programming language and its myriad of available libraries. You will find Jupyter notebooks invaluable in your work. You can run R codes in the Jupyter/IPython notebooks, with the immediate benefit of visualizing your data. You can also use compiled languages like C++, Rust, Julia, Fortran etc if you prefer. The focus in these lectures will be on Python.

If you have Python installed and you feel pretty familiar with installing different packages, we recommend that you install the following Python packages via pip as

  • pip install numpy scipy matplotlib ipython scikit-learn mglearn sympy pandas pillow

For OSX users we recommend, after having installed Xcode, to install brew. Brew allows for a seamless installation of additional software via for example

  • brew install python

For Linux users, with its variety of distributions like for example the widely popular Ubuntu distribution, you can use pip as well and simply install Python as

  • sudo apt-get install python

You can specify the python version you wish to install.

For various dependencies, we recommend installing a light variant of conda.

Python installers

If you don't want to perform these operations separately and venture into the hassle of exploring how to set up dependencies and paths, we recommend two widely used distrubutions which set up all relevant dependencies for Python, namely

which is an open source distribution of the Python and R programming languages for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment. Package versions are managed by the package management system conda.

is a Python distribution for scientific and analytic computing distribution and analysis environment, available for free and under a commercial license.

Furthermore, Google's Colab:https://colab.research.google.com/notebooks/welcome.ipynb is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. Try it out!

Useful Python libraries

Here we list several useful Python libraries we strongly recommend (if you use anaconda many of these are already there)

  • NumPy:https://www.numpy.org/ is a highly popular library for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

  • The pandas:https://pandas.pydata.org/ library provides high-performance, easy-to-use data structures and data analysis tools

  • Xarray:http://xarray.pydata.org/en/stable/ is a Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

  • Scipy:https://www.scipy.org/ (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.

  • Matplotlib:https://matplotlib.org/ is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

  • Autograd:https://github.com/HIPS/autograd can automatically differentiate native Python and Numpy code. It can handle a large subset of Python's features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives

  • JAX" https://jax.readthedocs.io/en/latest/index.html has now more or less replaced Autograd. JAX is Autograd and XLA, brought together for high-performance numerical computing and machine learning research. It provides composable transformations of Python+NumPy programs: differentiate, vectorize, parallelize, Just-In-Time compile to GPU/TPU, and more.

  • SymPy:https://www.sympy.org/en/index.html is a Python library for symbolic mathematics.

  • SymPy:https://www.sympy.org/en/index.html is a Python library for symbolic mathematics.

  • scikit-learn:https://scikit-learn.org/stable/ has simple and efficient tools for machine learning, data mining and data analysis

  • TensorFlow:https://www.tensorflow.org/ is a Python library for fast numerical computing created and released by Google

  • Keras:https://keras.io/ is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano

  • And many more such as pytorch:https://pytorch.org/, Theano:https://pypi.org/project/Theano/ etc

Textbooks

Recommended textbooks:

The lecture notes are collected as a jupyter-book at https://compphysics.github.io/MachineLearning/doc/LectureNotes/_build/html/intro.html. In addition to the electure notes, we recommend the books of Goodfellow et al. and Raschka et al. We will follow these texts closely and the weekly reading assignments refer to these two texts.

The text by Raschka et al. is well-adapted to this text and contains many coding examples. The weekly plans will include reading suggestions from these two textbooks. In addition, you may find the following textbooks interesting.

Additional textbooks:

General learning book on statistical analysis:

  • Christian Robert and George Casella, Monte Carlo Statistical Methods, Springer
  • Peter Hoff, A first course in Bayesian statistical models, Springer

Links to relevant courses at the University of Oslo