This repo is a collection of Jupyter Notebooks to accompany the Udacity Connect Intensive Machine Learning Nanodegree. The code is written for Python 2.7, but should be (mostly) compatible with Python 3.x.
If you haven't already done so, you'll need to download and install Python 2.7. If using Mac OS X, you may want to use Homebrew as a package manager, following these instructions to install Python 2.7 or Python 3. You can also use Anaconda as a package manager. Then, you can follow these instructions to install Jupyter notebook. These instructions explain how to install both Python 2 and Python 3 kernels.
You can follow these instructions to create a fork of the ConnectIntensive repo, and clone it to your local machine. Once you've done so, you can navigate to your local clone of the ConnectIntensive repo and follow these instructions to run the Jupyter Notebook App.
The required packages and libraries vary in each of these Jupyter Notebooks. The most commonly used ones are listed below:
Each Lesson Notebook lists its own specific prerequisites along with the objectives.
Most lesson notebooks have a corresponding solutions notebook with the outputs of each cell shown. For example, the notebook solutions-01.ipynb
displays the output and shows the solutions to the exercises from lesson-01.ipynb
.
lesson-00.ipynb
: Hello Jupyter Notebook!- A "hello world" notebook to introduce the Jupyter IDE
- Introduces import statements for commonly-used modules and packages
lesson-01.ipynb
: An intro to Statistical Analysis usingpandas
- Introduces the
Series
andDataFrame
objects inpandas
- Defines categorical variables
- Covers basic descriptive statistics: mean, median, min/max
- Label-based
.loc
and index-based location.iloc
inpandas
- Boolean indexing, how to slice a
DataFrame
inpandas
- Exercises in exploratory data analysis, emphasizing
groupby
andplot
- Introduces the
lesson-02.ipynb
: Working with the Enron Data Set- Covers the
pickle
module for saving objects - Magic commands in Jupyter notebooks
- Use of the
stack
andunstack
functions inpandas
- Exercises in exploratory data analysis on the Enron data set
- Covers the
lesson-03-part-01.ipynb
: Building and Evaluating Models withsklearn
(part 1)- Perform exploratory data analysis on a dataset
- Tidy a data set so that it will be compatible with the
sklearn
library- Use the
pandas.get_dummies()
method to convert categorical variables to dummy or indicator variables. - Impute missing values to ensure variables are numeric.
- Use the
lesson-03-part-02.ipynb
: Building and Evaluating Models withsklearn
(part 2)- Make decision tree classifiers on the tidied dataset from part 01
- Compute the accuracy score of a model on both the training and validation (testing) data
- Adjust hyperparameters to see the effects on model accuracy
- Use
export_graphviz
to visualize decision trees. - Introduce the Gini impurity
lesson-04-part-01.ipynb
: Bayes NLP Mini-Project- Understand how Bayes' Rule derives from conditional probability
- Write methods, applying Bayesian learning to simple word-prediction tasks
- Practice with python string methods, e.g.
str.split()
, and python dictionaries
lesson-05.ipynb
: Classification with Support Vector Machines- Introduces additional plotting functionality in
matplotlib.pyplot
- Boxplots for depicting interquartile range (IQR), median, max, min, outliers
- Scatterplots for 2-D representation of two features.
- Introduction to Support Vector Machines in
sklearn
- An introduction to kernels
- Hard-margin versus soft-margin SVMs
- Overview of
SVC
hyperparameters:C
,gamma
,degree
, etc.
- Visualize decision boundaries resulting from the different kernels
- Practice with the
GridSearchCV()
method
- Introduces additional plotting functionality in
lesson-06-part-01.ipynb
: Clustering Mini-Project- Perform k-means clustering on the Enron Data Set.
- Visualize different clusters that form before and after feature scaling.
- Plot decision boundaries that arise from k-means clustering using two features.
lesson-06-part-02.ipynb
: PCA Mini-Project- Perform Principal Component Analysis (PCA) on a large set of features.
- Recognize differences between
train_test_split()
andStratifiedShuffleSplit()
. - Introduce the
class_weight
parameter forSVC()
. - Visualize the eigenfaces (orthonormal basis of components) that result from PCA.
I find that learning Python from Jupyter Notebooks is addictive. Here are some other great resources.
- Thomas Corcoran's Connect Repo: More notebooks prepared by another talented MLND Session Lead
- Brandon Rhodes' PyCon 2015 Pandas Tutorial: One of my favorite introductions to
pandas
with an accompanying video lecture. - Jake VanderPlas' Scikit-learn Tutorial: An introduction to
sklearn
, also with an accompanying video lecture - Kevin Markham's Machine Learning with Text in Scikit-learn Tutorial: If you want to get started with NLP using
sklearn
, Kevin's tutorial is a great introduction (video lecture here).