diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md index f1bd7f1..1ed41b5 100644 --- a/_episodes/02-regression.md +++ b/_episodes/02-regression.md @@ -17,15 +17,23 @@ keypoints: - "Scikit Learn includes a polynomial modelling function which is useful for modelling non-linear data." --- - ## About Scikit-Learn +# About Scikit-Learn [Scikit-Learn](http://github.com/scikit-learn/scikit-learn) is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean API. It has been built by hundreds of contributors from around the world, and is used across industry and academia. Scikit-Learn is built upon Python's [NumPy (Numerical Python)](http://numpy.org) and [SciPy (Scientific Python)](http://scipy.org) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is [some work](https://github.com/ogrisel/parallel_ml_tutorial) in this area. For this introduction to ML we are going to stick to processing small to medium datasets with Scikit-Learn, without the need for a graphical processing unit (GPU). -# Supervised Learning intro +# Supervised Learning -blah +Classical machine learning is often divided into two categories – Supervised and Unsupervised Learning. + +For the case of supervised learning we act as a "supervisor" or "teacher" for our ML-algorithms by providing the algorithm with "labelled data" that contains example answers of what we wish the algorithm to achieve. + +For instance, if we wish to train our algorithm to distinguish between images of cats and dogs, we would provide our algorithm with images that have already been labelled as "cat" or "dog" so that it can learn from these examples. If we wished to train our algorithm to predict house prices over time we would provide our algorithm with example data of house prices that are "labelled" with time values. + +Supervised learning is split up into two further categories: classification and regression. For classification the labelled data is discrete, such as the "cat" or "dog" example, whereas for regression the labelled data is continuous, such as the house price example. + +In this episode we will explore how we can use regression to build a "model" that can be used to make predictions. ## Linear Regression with Scikit-Learn diff --git a/_episodes/04-clustering.md b/_episodes/04-clustering.md index bd03ea1..f31d2d9 100644 --- a/_episodes/04-clustering.md +++ b/_episodes/04-clustering.md @@ -20,6 +20,19 @@ keypoints: - "Scikit-Learn has functions to create example data." --- +# Unsupervised Learning + +In episode 2 we learnt about Supervised Learning. Now it is time to explore Unsupervised Learning. + +Sometimes we do not have the luxury of using labelled data. This could be for a number of reasons: + +* We have labelled data, but not enough to accurately our train model +* Our existing labelled data is low-quality or innacurate +* It is too time-consuming to (manually) label more data +* We have data, but no idea what correlations might exist that we could model! + +In this case we need to use unsupervised learning. As the name suggests, this time we do not "supervise" the ML-algorithm by providing it labels, but instead we let it try to find its own patterns in the data and report back on any correlations that it might find. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. + # Clustering Clustering is the grouping of data points which are similar to each other. It can be a powerful technique for identifying patterns in data. diff --git a/_episodes/05-dimensionality-reduction.md b/_episodes/05-dimensionality-reduction.md index 6760aaa..f4c0335 100644 --- a/_episodes/05-dimensionality-reduction.md +++ b/_episodes/05-dimensionality-reduction.md @@ -16,7 +16,7 @@ keypoints: # Dimensionality reduction -As seen in the last episode, general clustering algorithms work well with low-dimensional data. In this episode we will work with higher-dimension data such as images of handwritten text or numbers. The dataset we will be using is the Modified National Institute of Standards and Technology (MNIST) dataset. The MNIST dataset contains 60,000 handwritten labelled images from 0-9. An illustration of the dataset is presented below. +As seen in the last episode, general clustering algorithms work well with low-dimensional data. In this episode we will work with higher-dimension data such as images of handwritten text or numbers. The dataset we will be using is the Modified National Institute of Standards and Technology (MNIST) dataset. The MNIST dataset contains 60,000 handwritten labelled images from 0-9. An illustration of the dataset is presented below. Our MNIST data has 3 dimensions: an x-component, a Y-component, and an alpha value at each (x,y) coordinate. ![MNIST example illustrating all the classes in the dataset](../fig/MnistExamples.png) @@ -44,9 +44,10 @@ y = digits.target Linear clustering approaches such as k-means would require all the images to be binned into a pre-determined number of clusters, which might not adequately capture the variability in the images. -Non-linear spectral clustering might fare better, but it would require the images to be projected into a higher dimension space, and separating the complex projections in higher-order spaces would necessitate complex non-linear separators. +Non-linear clustering, such as spectral clustering, might fare better, but requires the images to be projected into a higher dimension space. Separating the existing complex data in higher-order spaces would require complex non-linear separators to divide this data and really increase the computational cost of doing this. + +We can help reduce the computational cost of clustering by transforming our higher-dimension input dataset into lower-order projections. Conceptually this is done by determining which combination of variables explain the most variance in the data, and then working with those variables. -One option is to reduce the dimensions of the input dataset into a 2D vector space while preserving their local representations. This would transform the high-dimension input dataset into lower-order projections. These lower-order projections can then be separated using linear separators while preserving the variability of images within the dataset. ## Dimensionality reduction with Scikit-Learn We will look at two commonly used techniques for dimensionality reduction: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). Both of these techniques are supported by Scikit-Learn.