Skip to content

Commit

Permalink
tweaks to classification
Browse files Browse the repository at this point in the history
  • Loading branch information
mike-ivs committed Mar 21, 2023
1 parent 259f598 commit 79f4c55
Showing 1 changed file with 42 additions and 40 deletions.
82 changes: 42 additions & 40 deletions _episodes/03-classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,12 @@ keypoints:

# Classification

Classification is the process of assigning items to classes, based on observation of some features. Where regression uses observations (x) to predict a numerical value (y), classification predicts a categorical fit to a class.
Classification is a supervised method to recognise and group data objects into a pre-determined categories. Where regression uses labelled observations to predict a continuous numerical value, classification predicts a discrete categorical fit to a class. Classification in ML leverages a wide range of algorithms to classify a set of data/datasets into their respective categories.

## Supervised vs. unsupervised learning
(this is probably introduced in Regression, so not needed?)
In this lesson we are going to introduce the concept of supervised classification by classifying penguin data into different species of penguins using Scikit-Learn.

## The Penguin dataset
We're going to be using the penguins dataset, which comprises 342 observations of penguins of three different species: Adelie, Chinstrap & Gentoo. For each penguin we're given measurements of its bill length and depth (mm), flipper length (mm) and body mass (g).

source: [HERE](https://github.com/allisonhorst/palmerpenguins)

Let's take a look at a structured toy dataset, The penguin dataset published in 2020 by Allison Horst. To introduce the concept of supervised classification in Scikit-Learn, we will look at the problem of classifying three different palmer penguin species (Chinstrap, Gentoo, Adelie) utilizing several characteristic attributes of these penguins in the dataset.
We're going to be using the penguins dataset of Allison Horst, published [here](https://github.com/allisonhorst/palmerpenguins) in 2020, which is comprised of 342 observations of three species of penguins: Adelie, Chinstrap & Gentoo. For each penguin we have measurements of its bill length and depth (mm), flipper length (mm) and body mass (g), as well as information on its species, island, and sex.

~~~
import seaborn as sns
Expand All @@ -34,14 +29,20 @@ dataset.head()
~~~
{: .language-python}

Our aim is to develop a classification model that will predict the species of a penguin given those measurements.
Our aim is to develop a classification model that will predict the species of a penguin based upon measurements of those variables.

As a rule of thumb for ML/DL modelling, it is best to start with a simple model and progressively add complexity to in order to meet our desired classification performance.

As seen in the table output above, there are various data points that can be utilized for our classification problem. A rule of thumb for ML/DL modelling is to start simple and progressively add complexity to meet the desired classification performance. The above table contains multiple categorical objects such as species, island and sex which might negatively skew classification performance. Therefore, we will limit our dataset to only numerical values such as bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g while making only species as our target class for classification.
While we are learning some classification methods we will limit our dataset to only numerical values such as bill_length, bill_depth, flipper_length, and body_mass while we attempt to classify species.

The above table contains multiple categorical objects such as species, If we attempt to include the other categorical fields, island and sex, we hinder classification performance due to the complexity of the data.

### Training-testing split
When undertaking any machine learning project, it's important to be able to evaluate how well your model works. In order to do this, we set aside some data (usually 20%) as a testing set, leaving the rest as your training dataset.

{callout} It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you as the developer introducing bias to the model based on your own observations of data in the testing set.
> ## Why do we do this?
> It's important to do this early, and to do all of your work with the training dataset - this avoids any risk of you introducing bias to the model based on your own observations of data in the testing set, and can highlight when you are over-fitting on your training data.
{: .callout}

~~~
# Extract the data we need
Expand All @@ -61,53 +62,51 @@ Having extracted our features (X) and labels (Y), we can now split the data
~~~
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
~~~
{: .language-python}

We'll use X_train and y_train to develop our model, and only look at X_test and y_test when it's time to evaluate its performance.
We'll use x_train and y_train to develop our model, and only look at x_test and y_test when it's time to evaluate its performance.

### Visualising the data
In order to better understand how a model might classify this data, we can first take a look at the data visually, to see what patterns we might identify.

~~~
fig01 = sns.scatterplot(X_train, x=feature_names[0], y=feature_names[1], hue=dataset['species'])
fig01 = sns.scatterplot(x_train, x=feature_names[0], y=feature_names[1], hue=dataset['species'])
plt.show()
~~~
{: .language-python}

As there are four measurements for each penguin, we need a second plot to visualise all four dimensions:

~~~
fig23 = sns.scatterplot(X_train, x=feature_names[2], y=feature_names[3], hue=dataset['species'])
fig23 = sns.scatterplot(x_train, x=feature_names[2], y=feature_names[3], hue=dataset['species'])
plt.show()
~~~
{: .language-python}

We can see that penguins from each species form fairly distinct spatial clusters in these plots, so that you could draw lines between those clusters to delineate each species. This is effectively what many classification algorithms do - using the training data to delineate the observation space (the 4 measurement dimensions) into classes. When given new observations, the model then finds which of those class areas that observation falls in to.
We can see that penguins from each species form fairly distinct spatial clusters in these plots, so that you could draw lines between those clusters to delineate each species. This is effectively what many classification algorithms do - using the training data to delineate the observation space, in this case the 4 measurement dimensions, into classes. When given new observations the model then finds which of those class areas that observation falls in to.

## Decision Tree
We'll first apply a decision tree classifier to the data. Decisions trees are conceptually similar to flow diagrams (or more precisely for the biologists: dichotomous keys) - they split the classification problem into a binary tree of comparisons, at each step comparing a measurement to a value, and moving left or right down the tree until a classification is reached.

(figure)

pros & cons

Training and using a decision tree in scikit-learn is straightforward:
~~~
from sklearn.tree import DecisionTreeClassifier, plot_tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
clf.fit(x_train, y_train)
clf.predict(X_test)
clf.predict(x_test)
~~~
{: .language-python}

We can conveniently check how our model did with the .score() function, which will make predictions and report what proportion of them were accurate:

~~~
clf.score(X_test, y_test)
clf.score(x_test, y_test)
~~~
{: .language-python}

Expand All @@ -122,7 +121,9 @@ plt.show()
~~~
{: .language-python}

We can see from this that there's some very tortuous logic being used to tease out every single observation in the training set - for example the single purple Gentoo node at the bottom of the tree. If we truncated that branch to the second level (Chinstrap), we'd have a little inaccuracy, 5 non-Chinstraps in with 47 Chinstraps, but a less convoluted model. All of which is to say that, this model is clearly over-fit - it's developed a very complex delineation of the classification space in order to match every single observation, which will likely lead to poor results for new observations.
We can see from this that there's some very tortuous logic being used to tease out every single observation in the training set - for example the single purple Gentoo node at the bottom of the tree. If we truncated that branch to the second level (Chinstrap), we'd have a little inaccuracy, 5 non-Chinstraps in with 47 Chinstraps, but a less convoluted model.

The tortuous logic, such as the bottom purple Gentoo node, is a clear indication that this model is over-fit - it has developed a very complex delineation of the classification space in order to match every single observation, which will likely lead to poor results for new observations.

### Visualising the classification space
We can visualise the delineation produced, but only for two parameters at a time, so the model produced here isn't exactly that same as that used above:
Expand All @@ -134,17 +135,17 @@ f1 = feature_names[2]
f2 = feature_names[3]
clf = DecisionTreeClassifier()
clf.fit(X_train[[f1, f2]], y_train)
clf.fit(x_train[[f1, f2]], y_train)
d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]])
d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]])
# labels = [class_names[i] for i in y_train]
sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette='husl')
sns.scatterplot(x_train, x=f1, y=f2, hue=y_train, palette='husl')
plt.show()
~~~
{: .language-python}

We can see that rather than clean lines between species, the decision tree produces orthogonal regions (as each decision only considers a single parameter). Again, we can see that the model is overfit - the decision space is far more complex than needed, with regions that only select a single point.
We can see that rather than clean lines between species, the decision tree produces orthogonal regions as each decision only considers a single parameter. Again, we can see that the model is overfit as the decision space is far more complex than needed, with regions that only select a single point.

## SVM
Next, we'll look at another commonly used classification algorithm, and see how it compares. Support Vector Machines (SVM) work in a way that is conceptually similar to your own intuition when first looking at the data - they devise a set of hyperplanes that delineate the parameter space, such that each region contains ideally only observations from one class, and the boundaries fall between classes.
Expand All @@ -158,9 +159,9 @@ Normalising maps each parameter to a new range, so that it has a mean of 0, and
from sklearn import preprocessing
scalar = preprocessing.StandardScaler()
scalar.fit(X_train)
X_train_scaled = pd.DataFrame(scalar.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(scalar.transform(X_test), columns=X_test.columns, index=X_test.index)
scalar.fit(x_train)
x_train_scaled = pd.DataFrame(scalar.transform(x_train), columns=x_train.columns, index=x_train.index)
x_test_scaled = pd.DataFrame(scalar.transform(x_test), columns=x_test.columns, index=x_test.index)
~~~
{: .language-python}

Expand All @@ -172,16 +173,16 @@ With this scaled data, training the models works exactly the same as before.
from sklearn import svm
SVM = svm.SVC(kernel='poly', degree=3, C=1.5)
SVM.fit(X_train_scaled, y_train)
SVM.fit(x_train_scaled, y_train)
SVM.score(X_test_scaled, y_test)
SVM.score(x_test_scaled, y_test)
~~~
{: .language-python}

We can again visualise the decision space produced, also using only two parameters:

~~~
x2 = X_train_scaled[[feature_names[0], feature_names[1]]]
x2 = x_train_scaled[[feature_names[0], feature_names[1]]]
SVM = svm.SVC(kernel='poly', degree=3, C=1.5)
SVM.fit(x2, y_train)
Expand All @@ -203,8 +204,8 @@ max_depths = [1, 2, 3, 4, 5]
accuracy = []
for i, d in enumerate(max_depths):
clf = DecisionTreeClassifier(max_depth=d)
clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
clf.fit(x_train, y_train)
acc = clf.score(x_test, y_test)
accuracy.append((d, acc))
Expand All @@ -222,7 +223,7 @@ Reusing our visualisation code from above, we can inspect our simplified decisio

~~~
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X_train, y_train)
clf.fit(x_train, y_train)
fig = plt.figure(figsize=(12, 10))
plot_tree(clf, class_names=class_names, feature_names=feature_names, filled=True, ax=fig.gca())
Expand All @@ -237,21 +238,22 @@ f1 = feature_names[2]
f2 = feature_names[3]
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X_train[[f1, f2]], y_train)
clf.fit(x_train[[f1, f2]], y_train)
d = DecisionBoundaryDisplay.from_estimator(clf, X_train[[f1, f2]])
d = DecisionBoundaryDisplay.from_estimator(clf, x_train[[f1, f2]])
sns.scatterplot(X_train, x=f1, y=f2, hue=y_train, palette='husl')
sns.scatterplot(x_train, x=f1, y=f2, hue=y_train, palette='husl')
plt.show()
~~~
{: .language-python}

We can see that both the tree and the decision space are much simpler, but still do a good job of classifying our data. We've succeeded in reducing over-fitting.

{callout box thing} 'Max Depth' is an example of a *hyper-parameter* to the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning.
> ## 'Max Depth' is an example of a Hyper-Parameter
> 'Max Depth' is an example of a *hyper-parameter* to the decision tree model. Where models use the parameters of an observation to predict a result, hyper-parameters are used to tune how a model works. Each model you encounter will have its own set of hyper-parameters, each of which affects model behaviour and performance in a different way. The process of adjusting hyper-parameters in order to improve model performance is called hyper-parameter tuning.
{: .callout}


# September
### Note that care is needed when splitting data
- You generally want to ensure that each class is represented proportionately in both training + testing (beware just taking the first 80%)
- Sometimes you want to make sure a group is excluded from the train/test split, e.g.: when multiple samples come from one individual
Expand Down

0 comments on commit 79f4c55

Please sign in to comment.