Data science is a concept to unify statistics, data analysis, machine learning, domain knowledge and their related methods in order to understand and analyze actual phenomena with data.
TODO
https://www.youtube.com/watch?v=xxpc-HPKN28
Individuals vs characteristics
Population (census) vs sample
Parameters vs samples
Descriptive vs inferential
Normal distribution and empirical rule (68-95-99.7)
z-score
Inference: Estimation, Testing, Regression
Central limit theorem:
- The sampling distribution (the distribution of x-bars (mean of the sample) from all possible samples) is also a normal distribution.
- The mean of x-bars is equal to mean of the population.
- The standard deviation of the x-bars is standard deviation of population divided by sqrt(n).
p-value - the probability of getting a certain result given the null-hypothesis.
- Deep Learning Specialization, GitHub
- Building your Deep Neural Network - Step by Step, Google Colab
- Deep Neural Network - Application, Google Colab
DNN implementations:
- Python: TODO
- Python + numpy:
- TensorFlow: TODO
- Keras: TODO
It is better to find a single optimization metric, this way it will be easier to choose a better model. When it's not possible to choose a single optimization metric, you can add satisfying metrics. For example, error rate is an optimization metric and the time it takes to run the classification on an object is a satisfying metric.
Bayes error rate is the lowest possible error rate for any classifier of a random outcome (into, for example, one of two categories) and is analogous to the irreducible error.
If your algorithm is performing worse than a human, then to improve your algorithm you can:
- Get labeled data from humans.
- Gain insight from manual error analysis: why did a person get this right?
- Better analysis of bias/variance.
When to focus on bias and when on variance:
- If human error is 1%, train error is 8%, dev error is 10%, then focus on avoidable bias, i.e reducing the train error,
because it can potentially be reduced by 7 pp, compared to just 2 pp for dev error. To reduce the train error you can try the following:
- Train a bigger model
- Train longer
- Try better optimization algorithms: RMSProp, Adam
- Try a different NN architecture: RNN, CNN
- Hyperparameter search
- If human error is 7%, train error is 8%, dev error is 10%, then focus on variance, i.e reducing the dev error,
because it can potentially be reduced by 2 pp, compared to just 1 pp for train error. To reduce the dev error you can try the following:
- Add more data
- Regularization: L1, L2, dropout,
- Data augmentation
- Try a different NN architecture: RNN, CNN
- Hyperparameter search
Problems where ML significantly surpasses human-level performance:
- Online advertising
- Product recommendations
- Logistics (predicting transit time)
- Loan approvals
All of the above are ML on structured data as opposed to natural perception.
Advice from Andrej Karpathy for learning ML: try implementing a NN from scratch, without relying on any libraries like TensorFlow. This will help you learn how deep learning works under the hood.
Error Analysis: focus on what contributes most to the algorithm error. For example, if 90% of errors are due to blurry images and 10% are due to misclassified as a dog instead of a cat, then focus on blurry images to reduce the error.
Incorrectly labeled data:
- fix it if it contributes a significant portion to error;
- fix it across train, dev, test datasets universally. Otherwise it may introduce bias to the dataset.
It's important to make your dev and test datasets as close to real-world data, even if it results in train and dev/test datasets be drawn from different distributions. This way you optimise to the right target. In this case to perform bias/variance analysis introduce train-dev dataset, to measure the variance contribution to error.
Transfer Learning - using intermediate NN layers, that were pre-trained on some problem A, for a different problem B. For example problem A can be classifying cats and dogs, problem B can be classifying lung desease in radiology images. It makes sense when:
- Problems A and B have the same input.
- There is a lot more input for problem A than for problem B.
- Low level features from A could be helpful for learning B.
Multi-task Learning - training a NN for a classification problem where an input can be assigned multiple classes, for example an image which can contain cars, pedestrians, stop signs, traffic lights, or any combination of those. It can give better results than training a separate NN for each class, because the intermediate layers are reused.
End-to-end ML - solving a problem using just an ML algorithm without any hand-designed components as part of the whole system. For example, for a speech recognition task an end-to-end ML approach is to use audio as an input for an ML algorithm and the transcript as the output, as opposed to manually extracting features from the audio first, then phonemes, the words, and then generating a transcript.
- Pros: let the data speak, less hand-designing of components needed.
- Cons: may need large amount of data, excludes potentially useful hand-designed components.
Colab Notebooks:
- Rock-paper-scissors classfication
- Fashion MNIST image classification (with intermediate layers visualization)
- Cats vs dogs (with intermediate layers visualization)
Why convolutions:
- Parameter sharing
- Sparsity of connections
Computer Vision Networks:
- AlexNet
- VGG-16 - 16 layers of "same" ConvLayers and MaxPooling layers
- ResNet
- Inception Network
Face Recognition:
- One Shot Learning, Triplet Loss
Neural Style Transfer:
DeepFake Colab:
- https://colab.research.google.com/github/AliaksandrSiarohin/first-order-model/blob/master/demo.ipynb
YOLO (You Only Look Once) algorithm:
TODO: Add more details
- RNN (Recurrent Neural Network) - has a problem of exploding/vanishing gradients.
- LSTM (Long Short-Term Memory Network) - solves the problem of exploding/vanishing gradients by adding memory units.
- GRU (Gate Recurrent Unit) - simplified version of LSTM.
- Attention Model - adds attention mechanism to LSTM: Colab Notebook
- Andrew Ng - Co-Founder of Coursera; Stanford CS adjunct faculty. Former head of Baidu AI Group/Google Brain.
- Andrej Karpathy - Director of AI at Tesla.
- Geoffrey Hinton - Works for Google Brain, Professor at the University of Toronto.
- Pieter Abbeel - Director of the Berkeley Robot Learning Lab.
- Ian Goodfellow - Director of machine learning in the Special Projects Group at Apple.
- Ruslan Salakhutdinov - UPMC Professor of Computer Science at Carnegie Mellon University.
- Yuanqing Lin - CEO & Founder of Aibee, Former Head of Baidu Research.
- Yann LeCun - Professor at NYU. Chief AI Scientist at Facebook.
- Lex Fridman - Research in machine learning, autonomous vehicles and human-centered AI.
- Jeremy Howard - Distinguished research scientist: @usfca; Co-founder: http://fast.ai; Chair: http://WAMRI.ai.
AI for Medicine Specialization, Coursera
AI for Diagnosis:
- Applications of AI for diagnosis (mostly computer vision):
- Diagnosing edema in lungs from X-Rays scans.
- Dermatology: detecting whether a mole is a skin cancer: https://www.nature.com/articles/nature21056.
- Ophthalmology: diagnosing eye disorders using retinal fundus photos (e.g. diagnosing diabetic retinopathy).
- Histopathology: determining the extent to which a cancer has spread from microscopic images of tissues.
- Identifying tumors in MRI data - image segmentation. A CNN called U-Net is used for this.
- Challenges:
- Patient Overlap - as an example, the model can memorize a necklace on X-Rays images of a single patient and give an over-optimistic test evaluation. To fix this split train and test sets by patient, so that all images of the same patient are either in train or test sets.
- Set Sampling - when there is an imbalance dataset. Minority class sampling is used
- Ground Truth / Reference Standard - consensus voting.
AI for Prognosis:
- Applications of AI for prognosis (mostly applications of Survival analysis):
- Predicting risk of an event or when an event is likely to happen. E.g. death, heart attack or stroke, for people with a specific condition or for the general population. It's used to inform the patient and to guide the treatment.
- Risk of breast or ovarian cancer using data from blood tests.
- Risk of death for a person with a particular cancer.
- Risk of a heart attack.
- Risk of lung cancer recurrence after therapy.
- Predicting risk of an event or when an event is likely to happen. E.g. death, heart attack or stroke, for people with a specific condition or for the general population. It's used to inform the patient and to guide the treatment.
- Survival analysis is a field in statistics that is used to predict when
an event of interest will happen. The field emerged from medical research as a way to model
a patient's survival — hence the term "survival analysis".
- Censored data (end-of-study censoring, not-follow-up censoring) - we don't know the exact time of an event but we know that the event didn't happen before time X.
- Missing data: completely at random, at random, not at random. E.g. blood pressure measurements are missing for younger patients.
- Hazard, Survival to Hazard, Cumulative Hazard - functions that describe the probability of an event over time.
- C-index - a measure of performance for a survival model (concordance - patience with worse outcome should have higher risk score).
- Mortality score - the sum of hazards for different times.
- Python library for Survival analysis https://github.com/square/pysurvival/.
AI for Treatment:
- Applications of AI for treatment (mostly statistical methods):
- Treatment effect estimation - determining whether certain treatment will be effective for a particular patient. The input is features of the patient, e.g. age, blood pressure and the output is the number representing risk reduction or increase for an event e.g. stroke or heart attack. The data from randomized control trials is used to train the model.
- Treatment effect estimation:
- NNT (number needed to treat) = 1/ ARR (absolute risk reduction) - number of people who need to receive the treatment in order to benefit one of them.
- Factual - what happens to the patient with/without treatment - we know it. Counterfactual - what would happen to the patient without/with treatment - we don't know it.
- Average Treatment Effect - difference between means of outcomes with treatment and without treatment.
- Conditional Average Treatment Effect - Average Treatment Effect given some conditions on the patient, e.g. age, blood pressure.
- Two Tree Method (T-Learner) - build two decision trees to estimate risk with and without treatment, then subtract the values given by these trees.
- C-for-benefit - similar to C-index but for treatment effect estimator evaluation.
- The task of extracting labels from doctors' unstructured reports on images of lung X-Rays.
- occurrences of specific labels are searched for in the text. E.g. if the word "edema" is found in the report, go to the next step. Because "edema" has synonyms, a special medical thesaurus called SNOMED CT is used to find synonyms and related terms.
- A Negation Classification is used to determine absence of a disease, e.g. if the report contains "no edema" or "no evidence of edema". This requires labeled data. If there is no labeled data, then a simple Regex or Dependency Parse rules are used.
Applications of Deep Learning in Medicine:
TODO
-
Train, dev, and test datasets:
- Dev dataset prevents overfitting NN parameters (weights and biases) to the train data
- Test dataset prevents overfitting NN hyper-parameters (model architecture, number of layers types of layers) to the train and dev data.
-
What stage are we at? Stages of an ML project:
- Individual contributor
- Delegation
- Digitization
- Big Data and Analytics
- Machine Learning
-
CRISP-DM model
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
-
Precision, recall, accuracy, sensitivity, specificity
- Precision and recall https://en.wikipedia.org/wiki/Precision_and_recall
- Accuracy = Sensitivity * prevalence + Specificity * (1 - prevalence)
- F1-score = 2 / ((1/P) + (1/R)) - harmonic mean, average speed.
Log loss (cross entropy loss):
- Most of the economic value is created by supervised learning.
- tanh activation works better than sigmoid activation because it makes input data centered. Sigmoid should only be used in the last layer for classification because it's between 0 and 1 (probability).
- ReLU works better than sigmoid or tanh because, unlike sigmoid and tanh it's derivative is not approaching 0 for very large or small values of input.
- Weights of the layers of a neural net should be initialized with random small numbers. If they are initialized with zeroes then all neurons of a layer will train to the same weights. If they are initialized as big numbers and sigmoid or tanh activation is used, they will become saturated quickly and the learning will stall.
- Why deep neural nets work better than shallow for complex functions: to calculate XOR of N parameters a deep
neural net needs
log(N)
units, while a shallow neural net needs2^n
units. A shallow net would be much bigger.
GPT:
- GPT-3 vs Human Brain, Lex Fridman: https://www.youtube.com/watch?v=kpiY_LemaTc
- Write with Transformer (Get a modern neural network to auto-complete your thoughts): https://transformer.huggingface.co/
- 10 Minutes to Pandas
- Keras Hello World
- Fashion MNIST image classification with intermediate layers visualization
- Natality dataset in BigQuery:
- Deep Learning Specialization, Coursera
- Deep Learning Specialization, Python Notebooks on GitHub
- Deep Learning Specialization v2, Python Notebooks on GitHub
- Machine Learning with TensorFlow on GCP Specialization, Coursera
- Advanced Machine Learning with TensorFlow on GCP Specialization, Coursera
- TensorFlow and Keras in Practice Specialization, Coursera
- Neural Networks and Deep Learning, book by Michael Nielsen
- PyTorch at Tesla - Andrej Karpathy, YouTube
- Tesla Autonomy Day, YouTube
- AI for Medicine Specialization, Coursera
- Deep Learning State of the Art (2020) | MIT Deep Learning Series