diff --git a/_episodes/02-regressionJens.md b/02-regressionJens.md similarity index 100% rename from _episodes/02-regressionJens.md rename to 02-regressionJens.md diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index 3edaf6c..4de21d6 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -1,10 +1,10 @@ --- title: "Introduction" -teaching: 30 +teaching: 20 exercises: 10 questions: -- What is machine learning? -- What are some useful machine learning techniques? +- "What is machine learning?" +- "What are some useful machine learning techniques?" objectives: - "Gain an overview of what machine learning is and the techniques available." - "Understand how machine learning and artificial intelligence differ." @@ -19,7 +19,7 @@ keypoints: # What is machine learning? -Machine learning is a set of techniques that enable computers to improve in their performance of a given task. This is similar in concept to how humans learn to make predictions based upon previous experience and knowledge. Machine learning encompasses a wide range of activities, but broadly speaking it can be used to: find trends in a dataset, classify data into groups or categories, make decisions and predictions based upon data, and even "learn" how to interact with an environment when provided with goals to achieve. +Machine learning is a set of techniques that enable computers to use data to improve in their performance of a given task. This is similar in concept to how humans learn to make predictions based upon previous experience and knowledge. Machine learning encompasses a wide range of activities, but broadly speaking it can be used to: find trends in a dataset, classify data into groups or categories, make decisions and predictions based upon data, and even "learn" how to interact with an environment when provided with goals to achieve. ### Machine learning in our daily lives @@ -45,15 +45,26 @@ Machine learning has quickly become an important technology and is now frequentl The term machine learning (ML) is often mentioned alongside artificial intelligence (AI) and deep learning (DL). Deep learning is a subset of machine learning, and machine learning is a subset of artificial intelligence. -AI is a broad term used to describe a system possessing a "general intelligence" that can be applied to solve problems, often mimicking the behaviour of intelligent biological systems. Another definition of AI dates back to the 1950s and Alan Turing's "Immitation Game". Turing said we could consider a system intelligent when it could fool a human into thinking they were talking to another human when they were actually talking to a computer. Modern attempts are getting close to fooling humans, but although there have been great advances in AI and ML research, human-like intelligence is only possible in a few specialist areas. +AI is a broad term used to describe a system possessing a "general intelligence" that can be applied to solve a diverse range problems, often mimicking the behaviour of intelligent biological systems. Another definition of AI dates back to the 1950s and Alan Turing's "Immitation Game". Turing said we could consider a system intelligent when it could fool a human into thinking they were talking to another human when they were actually talking to a computer. Modern attempts are getting close to fooling humans, but although there have been great advances in AI and ML research, human-like intelligence is only possible in a few specialist areas. -ML refers to techniques where a computer can "learn" patterns in data, usually by being shown many training examples. While computers can learn to solve specific problems, or multiple similar problems, they are not considered to possess a general intelligence. Computers often need hundreds or thousands of examples to learn a task and are confined to relatively simple classifications. A human-like system could learn much quicker, and potentially learn from a single example by using it's knowledge of many other problems. +ML refers to techniques where a computer can "learn" patterns in data, usually by being shown many training examples. While ML-algorithms can learn to solve specific problems, or multiple similar problems, they are not considered to possess a general intelligence. ML-algorithms often need hundreds or thousands of examples to learn a task and are confined to tasks such as simple classifications. A human-like system could learn much quicker than this, and potentially learn from a single example by using it's knowledge of many other problems. -DL is a particular field of machine learning where algorithms called neural networks are used to create highly-complex systems. Large collections of neural networks are able to learn from vast quantities of data. Deep learning can be used to solve a wide range of problems, but it can also require huge amounts of input data and computational resources to train. The image below shows some of the relationships between artificial intelligence, machine learning and deep learning. +DL is a particular field of machine learning where algorithms called neural networks are used to create highly-complex systems. Large collections of neural networks are able to learn from vast quantities of data. Deep learning can be used to solve a wide range of problems, but it can also require huge amounts of input data and computational resources to train. + +The image below shows the relationships between artificial intelligence, machine learning and deep learning. ![An infographic showing some of the relationships between AI, ML, and DL](../fig/01_AI_ML_DL_differences.png) The image above is by Tukijaaliwa, CC BY-SA 4.0, via Wikimedia Commons, original source +> ## Where have you encountered machine learning already? +> Now that we have explored machine learning in a bit more detail, discuss with the person next to you: +> +> 1. Where have I seen machine learning in use? +> 2. What kind of input data does that machine learning system use to make predictions/classifications? +> 3. Is there any evidence that your interaction with the system contributes to further training? +> 4. Do you have any examples of the system failing? +{: .challenge} + # What are some useful types of Machine Learning? This lesson will introduce you to some of the key concepts and sub-domains of ML such as supervised learning, unsupervised learning, and neural networks. @@ -67,7 +78,7 @@ The figure below provides a nice overview of some of the sub-domains of ML and t ### Garbage in = garbage out -There is a classic expression in computer science, "garbage in = garbage out". This means that if the input data we use is garbage then the ouput will be too. If, for eample, we try to use a machine learning system to find a link between two unlinked variables then it may well manage to produce a model attempting this, but the output will be meaningless. +There is a classic expression in computer science, "garbage in = garbage out". This means that if the input data we use is garbage then the ouput will be too. If, for example, we try to use a machine learning system to find a link between two unlinked variables then it may well manage to produce a model attempting this, but the output will be meaningless. ### Biases due to training data @@ -85,13 +96,4 @@ Sometimes ML algorithms become over-trained and subsequently don't perform well Machine learning techniques will return an answer based on the input data and model parameters even if that answer is wrong. Most systems are unable to explain the logic used to arrive at that answer. This can make detecting and diagnosing problems difficult. -> ## Where have you encountered machine learning already? -> Now that we have explored machine learning in a bit more detail, discuss with the person next to you: -> -> 1. Where have I seen machine learning in use? -> 2. What kind of input data does that machine learning system use to make predictions/classifications? -> 3. Is there any evidence that your interaction with the system contributes to further training? -> 4. Do you have any examples of the system failing? -{: .challenge} - {% include links.md %} diff --git a/_episodes/02-regression.md b/_episodes/02-regression.md index 169a0ac..1ed41b5 100644 --- a/_episodes/02-regression.md +++ b/_episodes/02-regression.md @@ -3,11 +3,11 @@ title: "Regression" teaching: 30 exercises: 20 questions: -- "How can I process data using Scikit-Learn?" +- "What is Supervised Learning?" +- "How can I model data and make predictions using regression?" objectives: -- "Be aware of the built-in linear regression functions in Scikit-Learn." -- "Measure the error between a regression model and real data." - "Apply linear regression with Scikit-Learn to create a model." +- "Measure the error between a regression model and real data." - "Analyse and assess the accuracy of a linear model using Scikit-Learn's metrics library." - "Understand how more complex models can be built with non-linear equations." - "Apply polynomial modelling to non-linear data using Scikit-Learn." @@ -17,15 +17,23 @@ keypoints: - "Scikit Learn includes a polynomial modelling function which is useful for modelling non-linear data." --- - ## About Scikit-Learn +# About Scikit-Learn [Scikit-Learn](http://github.com/scikit-learn/scikit-learn) is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean API. It has been built by hundreds of contributors from around the world, and is used across industry and academia. Scikit-Learn is built upon Python's [NumPy (Numerical Python)](http://numpy.org) and [SciPy (Scientific Python)](http://scipy.org) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is [some work](https://github.com/ogrisel/parallel_ml_tutorial) in this area. For this introduction to ML we are going to stick to processing small to medium datasets with Scikit-Learn, without the need for a graphical processing unit (GPU). -# Supervised Learning intro +# Supervised Learning + +Classical machine learning is often divided into two categories – Supervised and Unsupervised Learning. + +For the case of supervised learning we act as a "supervisor" or "teacher" for our ML-algorithms by providing the algorithm with "labelled data" that contains example answers of what we wish the algorithm to achieve. + +For instance, if we wish to train our algorithm to distinguish between images of cats and dogs, we would provide our algorithm with images that have already been labelled as "cat" or "dog" so that it can learn from these examples. If we wished to train our algorithm to predict house prices over time we would provide our algorithm with example data of house prices that are "labelled" with time values. + +Supervised learning is split up into two further categories: classification and regression. For classification the labelled data is discrete, such as the "cat" or "dog" example, whereas for regression the labelled data is continuous, such as the house price example. -blah +In this episode we will explore how we can use regression to build a "model" that can be used to make predictions. ## Linear Regression with Scikit-Learn diff --git a/_episodes/03-classification.md b/_episodes/03-classification.md index 33506e7..66be274 100644 --- a/_episodes/03-classification.md +++ b/_episodes/03-classification.md @@ -3,28 +3,23 @@ title: "Classification" teaching: 15 exercises: 20 questions: -- "How can I use scikit-learn to classify data?" +- "How can I classify data into known categories?" objectives: -- "Use two different methods to classify data" -- "Understand the difference between supervised and unsupervised learning" +- "Use two different supervised methods to classify data." +- "Learn about the concept of Hyper-parameters." - "Learn to validate and ?cross-validate? models" -[//]: # (- "Recall that scikit-learn has built in linear regression functions.") keypoints: - "Classification requires labelled data (is supervised)" -- --- # Classification -Classification is the process of assigning items to classes, based on observation of some features. Where regression uses observations (x) to predict a numerical value (y), classification predicts a categorical fit to a class. +Classification is a supervised method to recognise and group data objects into a pre-determined categories. Where regression uses labelled observations to predict a continuous numerical value, classification predicts a discrete categorical fit to a class. Classification in ML leverages a wide range of algorithms to classify a set of data/datasets into their respective categories. -## Supervised vs. unsupervised learning -(this is probably introduced in Regression, so not needed?) +In this lesson we are going to introduce the concept of supervised classification by classifying penguin data into different species of penguins using Scikit-Learn. ## The Penguin dataset -We're going to be using the penguins dataset, which comprises 342 observations of penguins of three different species: Adelie, Chinstrap & Gentoo. For each penguin we're given measurements of its bill length and depth (mm), flipper length (mm) and body mass (g). - -source: [HERE](https://github.com/allisonhorst/palmerpenguins) +We're going to be using the penguins dataset of Allison Horst, published [here](https://github.com/allisonhorst/palmerpenguins) in 2020, which is comprised of 342 observations of three species of penguins: Adelie, Chinstrap & Gentoo. For each penguin we have measurements of its bill length and depth (mm), flipper length (mm) and body mass (g), as well as information on its species, island, and sex. ~~~ import seaborn as sns @@ -34,12 +29,20 @@ dataset.head() ~~~ {: .language-python} -Our aim is to develop a classification model that will predict the species of a penguin given those measurements. +Our aim is to develop a classification model that will predict the species of a penguin based upon measurements of those variables. + +As a rule of thumb for ML/DL modelling, it is best to start with a simple model and progressively add complexity to in order to meet our desired classification performance. + +While we are learning some classification methods we will limit our dataset to only numerical values such as bill_length, bill_depth, flipper_length, and body_mass while we attempt to classify species. + +The above table contains multiple categorical objects such as species, If we attempt to include the other categorical fields, island and sex, we hinder classification performance due to the complexity of the data. ### Training-testing split When undertaking any machine learning project, it's important to be able to evaluate how well your model works. In order to do this, we set aside some data (usually 20%) as a testing set, leaving the rest as your training dataset. -{callout} It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you as the developer introducing bias to the model based on your own observations of data in the testing set. +> ## Why do we do this? +> It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you introducing bias to the model based on your own observations of data in the testing set, and can highlight when you are over-fitting on your training data. +{: .callout} ~~~ # Extract the data we need @@ -59,17 +62,17 @@ Having extracted our features (X) and labels (Y), we can now split the data ~~~ from sklearn.model_selection import train_test_split -X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0) +x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0) ~~~ {: .language-python} -We'll use X_train and y_train to develop our model, and only look at X_test and y_test when it's time to evaluate its performance. +We'll use x_train and y_train to develop our model, and only look at x_test and y_test when it's time to evaluate its performance. ### Visualising the data In order to better understand how a model might classify this data, we can first take a look at the data visually, to see what patterns we might identify. ~~~ -fig01 = sns.scatterplot(X_train, x=feature_names[0], y=feature_names[1], hue=dataset['species']) +fig01 = sns.scatterplot(x_train, x=feature_names[0], y=feature_names[1], hue=dataset['species']) plt.show() ~~~ {: .language-python} @@ -77,35 +80,33 @@ plt.show() As there are four measurements for each penguin, we need a second plot to visualise all four dimensions: ~~~ -fig23 = sns.scatterplot(X_train, x=feature_names[2], y=feature_names[3], hue=dataset['species']) +fig23 = sns.scatterplot(x_train, x=feature_names[2], y=feature_names[3], hue=dataset['species']) plt.show() ~~~ {: .language-python} -We can see that penguins from each species form fairly distinct spatial clusters in these plots, so that you could draw lines between those clusters to delineate each species. This is effectively what many classification algorithms do - using the training data to delineate the observation space (the 4 measurement dimensions) into classes. When given new observations, the model then finds which of those class areas that observation falls in to. +We can see that penguins from each species form fairly distinct spatial clusters in these plots, so that you could draw lines between those clusters to delineate each species. This is effectively what many classification algorithms do - using the training data to delineate the observation space, in this case the 4 measurement dimensions, into classes. When given new observations the model then finds which of those class areas that observation falls in to. ## Decision Tree We'll first apply a decision tree classifier to the data. Decisions trees are conceptually similar to flow diagrams (or more precisely for the biologists: dichotomous keys) - they split the classification problem into a binary tree of comparisons, at each step comparing a measurement to a value, and moving left or right down the tree until a classification is reached. (figure) -pros & cons - Training and using a decision tree in scikit-learn is straightforward: ~~~ from sklearn.tree import DecisionTreeClassifier, plot_tree clf = DecisionTreeClassifier() -clf.fit(X_train, y_train) +clf.fit(x_train, y_train) -clf.predict(X_test) +clf.predict(x_test) ~~~ {: .language-python} We can conveniently check how our model did with the .score() function, which will make predictions and report what proportion of them were accurate: ~~~ -clf.score(X_test, y_test) +clf.score(x_test, y_test) ~~~ {: .language-python} @@ -120,7 +121,9 @@ plt.show() ~~~ {: .language-python} -We can see from this that there's some very tortuous logic being used to tease out every single observation in the training set - for example the single purple Gentoo node at the bottom of the tree. If we truncated that branch to the second level (Chinstrap), we'd have a little inaccuracy, 5 non-Chinstraps in with 47 Chinstraps, but a less convoluted model. All of which is to say that, this model is clearly over-fit - it's developed a very complex delineation of the classification space in order to match every single observation, which will likely lead to poor results for new observations. +We can see from this that there's some very tortuous logic being used to tease out every single observation in the training set - for example the single purple Gentoo node at the bottom of the tree. If we truncated that branch to the second level (Chinstrap), we'd have a little inaccuracy, 5 non-Chinstraps in with 47 Chinstraps, but a less convoluted model. + +The tortuous logic, such as the bottom purple Gentoo node, is a clear indication that this model is over-fit - it has developed a very complex delineation of the classification space in order to match every single observation, which will likely lead to poor results for new observations. ### Visualising the classification space We can visualise the delineation produced, but only for two parameters at a time, so the model produced here isn't exactly that same as that used above: @@ -132,17 +135,17 @@ f1 = feature_names[2] f2 = feature_names[3] clf = DecisionTreeClassifier() -clf.fit(X_train[[f1, f2]], y_train) +clf.fit(x_train[[f1, f2]], y_train) -d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]]) +d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]]) # labels = [class_names[i] for i in y_train] -sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette='husl') +sns.scatterplot(x_train, x=f1, y=f2, hue=y_train, palette='husl') plt.show() ~~~ {: .language-python} -We can see that rather than clean lines between species, the decision tree produces orthogonal regions (as each decision only considers a single parameter). Again, we can see that the model is overfit - the decision space is far more complex than needed, with regions that only select a single point. +We can see that rather than clean lines between species, the decision tree produces orthogonal regions as each decision only considers a single parameter. Again, we can see that the model is overfit as the decision space is far more complex than needed, with regions that only select a single point. ## SVM Next, we'll look at another commonly used classification algorithm, and see how it compares. Support Vector Machines (SVM) work in a way that is conceptually similar to your own intuition when first looking at the data - they devise a set of hyperplanes that delineate the parameter space, such that each region contains ideally only observations from one class, and the boundaries fall between classes. @@ -156,9 +159,9 @@ Normalising maps each parameter to a new range, so that it has a mean of 0, and from sklearn import preprocessing scalar = preprocessing.StandardScaler() -scalar.fit(X_train) -X_train_scaled = pd.DataFrame(scalar.transform(X_train), columns=X_train.columns, index=X_train.index) -X_test_scaled = pd.DataFrame(scalar.transform(X_test), columns=X_test.columns, index=X_test.index) +scalar.fit(x_train) +x_train_scaled = pd.DataFrame(scalar.transform(x_train), columns=x_train.columns, index=x_train.index) +x_test_scaled = pd.DataFrame(scalar.transform(x_test), columns=x_test.columns, index=x_test.index) ~~~ {: .language-python} @@ -170,16 +173,16 @@ With this scaled data, training the models works exactly the same as before. from sklearn import svm SVM = svm.SVC(kernel='poly', degree=3, C=1.5) -SVM.fit(X_train_scaled, y_train) +SVM.fit(x_train_scaled, y_train) -SVM.score(X_test_scaled, y_test) +SVM.score(x_test_scaled, y_test) ~~~ {: .language-python} We can again visualise the decision space produced, also using only two parameters: ~~~ -x2 = X_train_scaled[[feature_names[0], feature_names[1]]] +x2 = x_train_scaled[[feature_names[0], feature_names[1]]] SVM = svm.SVC(kernel='poly', degree=3, C=1.5) SVM.fit(x2, y_train) @@ -201,8 +204,8 @@ max_depths = [1, 2, 3, 4, 5] accuracy = [] for i, d in enumerate(max_depths): clf = DecisionTreeClassifier(max_depth=d) - clf.fit(X_train, y_train) - acc = clf.score(X_test, y_test) + clf.fit(x_train, y_train) + acc = clf.score(x_test, y_test) accuracy.append((d, acc)) @@ -220,7 +223,7 @@ Reusing our visualisation code from above, we can inspect our simplified decisio ~~~ clf = DecisionTreeClassifier(max_depth=2) -clf.fit(X_train, y_train) +clf.fit(x_train, y_train) fig = plt.figure(figsize=(12, 10)) plot_tree(clf, class_names=class_names, feature_names=feature_names, filled=True, ax=fig.gca()) @@ -235,21 +238,22 @@ f1 = feature_names[2] f2 = feature_names[3] clf = DecisionTreeClassifier(max_depth=2) -clf.fit(X_train[[f1, f2]], y_train) +clf.fit(x_train[[f1, f2]], y_train) -d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]]) +d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]]) -sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette='husl') +sns.scatterplot(x_train, x=f1, y=f2, hue=y_train, palette='husl') plt.show() ~~~ {: .language-python} We can see that both the tree and the decision space are much simpler, but still do a good job of classifying our data. We've succeeded in reducing over-fitting. -{callout box thing} 'Max Depth' is an example of a *hyper-parameter* to the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning. +> ## 'Max Depth' is an example of a Hyper-Parameter +> 'Max Depth' is an example of a *hyper-parameter* to the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning. +{: .callout} -# September ### Note that care is needed when splitting data - You generally want to ensure that each class is represented proportionately in both training + testing (beware just taking the first 80%) - Sometimes you want to make sure a group is excluded from the train/test split, e.g.: when multiple samples come from one individual diff --git a/_episodes/04-clustering.md b/_episodes/04-clustering.md index 9653a4c..f31d2d9 100644 --- a/_episodes/04-clustering.md +++ b/_episodes/04-clustering.md @@ -3,8 +3,10 @@ title: "Clustering with Scikit-Learn" teaching: 15 exercises: 20 questions: +- "What is Unsupervised learning?" - "How can we use clustering to find data points with similar attributes?" objectives: +- "Understand the difference between supervised and unsupervised learning" - "Identify clusters in data using k-means clustering." - "Understand the limitations of k-means when clusters overlap." - "Use spectral clustering to overcome the limitations of k-means." @@ -18,6 +20,19 @@ keypoints: - "Scikit-Learn has functions to create example data." --- +# Unsupervised Learning + +In episode 2 we learnt about Supervised Learning. Now it is time to explore Unsupervised Learning. + +Sometimes we do not have the luxury of using labelled data. This could be for a number of reasons: + +* We have labelled data, but not enough to accurately our train model +* Our existing labelled data is low-quality or innacurate +* It is too time-consuming to (manually) label more data +* We have data, but no idea what correlations might exist that we could model! + +In this case we need to use unsupervised learning. As the name suggests, this time we do not "supervise" the ML-algorithm by providing it labels, but instead we let it try to find its own patterns in the data and report back on any correlations that it might find. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. + # Clustering Clustering is the grouping of data points which are similar to each other. It can be a powerful technique for identifying patterns in data. @@ -25,7 +40,7 @@ Clustering analysis does not usually require any training and is therefore known ## Applications of clustering * Looking for trends in data -* Reducing the data around a point to just that point using data compression (e.g. reducing colour depth in an image) +* Reducing the data around a point to just that point as a form of data compression (e.g. reducing colour depth in an image) * Pattern recognition ## K-means clustering diff --git a/_episodes/05-dimensionality-reduction.md b/_episodes/05-dimensionality-reduction.md index e6335ed..f4c0335 100644 --- a/_episodes/05-dimensionality-reduction.md +++ b/_episodes/05-dimensionality-reduction.md @@ -1,9 +1,9 @@ --- -title: "Reducing the Dimensionality of Data" -teaching: 0 -exercises: 0 +title: "Dimensionality reduction" +teaching: 10 +exercises: 10 questions: -- "How do we apply machine learning techniques to data with higher dimensions? +- How do we apply machine learning techniques to data with higher dimensions? objectives: - "Recall that most data is inherently multidimensional." - "Understand that reducing the number of dimensions can simplify modelling and allow classifications to be performed." @@ -16,7 +16,7 @@ keypoints: # Dimensionality reduction -As seen in the last episode, general clustering algorithms work well with low-dimensional data. In this episode we will work with higher-dimension data such as images of handwritten text or numbers. The dataset we will be using is the Modified National Institute of Standards and Technology (MNIST) dataset. The MNIST dataset contains 60,000 handwritten labelled images from 0-9. An illustration of the dataset is presented below. +As seen in the last episode, general clustering algorithms work well with low-dimensional data. In this episode we will work with higher-dimension data such as images of handwritten text or numbers. The dataset we will be using is the Modified National Institute of Standards and Technology (MNIST) dataset. The MNIST dataset contains 60,000 handwritten labelled images from 0-9. An illustration of the dataset is presented below. Our MNIST data has 3 dimensions: an x-component, a Y-component, and an alpha value at each (x,y) coordinate. ![MNIST example illustrating all the classes in the dataset](../fig/MnistExamples.png) @@ -44,9 +44,10 @@ y = digits.target Linear clustering approaches such as k-means would require all the images to be binned into a pre-determined number of clusters, which might not adequately capture the variability in the images. -Non-linear spectral clustering might fare better, but it would require the images to be projected into a higher dimension space, and separating the complex projections in higher-order spaces would necessitate complex non-linear separators. +Non-linear clustering, such as spectral clustering, might fare better, but requires the images to be projected into a higher dimension space. Separating the existing complex data in higher-order spaces would require complex non-linear separators to divide this data and really increase the computational cost of doing this. + +We can help reduce the computational cost of clustering by transforming our higher-dimension input dataset into lower-order projections. Conceptually this is done by determining which combination of variables explain the most variance in the data, and then working with those variables. -One option is to reduce the dimensions of the input dataset in 2D vector space while preserving their local representations. This would transform the high-dimension input dataset into lower-order projections. These lower-order projections can be easily separated using linear separators while preserving the variability of images within the dataset. ## Dimensionality reduction with Scikit-Learn We will look at two commonly used techniques for dimensionality reduction: Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE). Both of these techniques are supported by Scikit-Learn. @@ -114,7 +115,7 @@ The major drawback of applying t-SNE to datasets is the large computational requ > ## Exercise: Working in three dimensions > The above example has considered only two dimensions since humans -> can only visualize two dimensions well. However, there can be cases +> can visualize two dimensions very well. However, there can be cases > where a dataset requires more than two dimensions to be appropriately > decomposed. Modify the above programs to use three dimensions and > create appropriate plots. diff --git a/_episodes/06-neural-networks.md b/_episodes/06-neural-networks.md index c9f2bc1..510c63b 100644 --- a/_episodes/06-neural-networks.md +++ b/_episodes/06-neural-networks.md @@ -3,6 +3,7 @@ title: "Neural Networks" teaching: 20 exercises: 30 questions: +- "What are Neural Networks?" - "How can we classify images using a neural network?" objectives: - "Understand the basic architecture of a perceptron." diff --git a/_episodes/old_02-regression.md b/old_02-regression.md similarity index 100% rename from _episodes/old_02-regression.md rename to old_02-regression.md