diff --git a/.nojekyll b/.nojekyll
index 01d1fc7..17eff7e 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-dc5d3c6f
\ No newline at end of file
+a552ad32
\ No newline at end of file
diff --git a/sitemap.xml b/sitemap.xml
index eb8aa62..9902a5a 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,46 +2,46 @@
https://uw-psych.github.io/psych532-slides/slides/07-fair.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/06-machine-learning.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/08-viz.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/05-hpc.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/02-reproducibility.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/about.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/index.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/01-introduction.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/04-stats-with-big-data.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/09-ethics.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
https://uw-psych.github.io/psych532-slides/slides/03-working-with-big-data.html
- 2024-04-21T04:35:37.958Z
+ 2024-04-22T11:07:40.303Z
diff --git a/slides/04-stats-with-big-data.html b/slides/04-stats-with-big-data.html
index 94707c9..d14f9db 100644
--- a/slides/04-stats-with-big-data.html
+++ b/slides/04-stats-with-big-data.html
@@ -334,15 +334,15 @@
Doing statistics with big data
Statistics with Big Data
-- The curse of dimensionality.
-- The challenges of null hypothesis statistical testing.
-- Visualization as a solution.
-- Statistical solutions.
+
- Some challenges in standard data analysis methods.
+- Some solutions.
+- Estimating error.
- Resampling.
- The Jackknife.
- The Bootstrap.
+- The curse of dimensionality.
@@ -385,7 +385,7 @@ The Bayesian objection
But inference is often.
-- \(p(H_0 | data) is small\).
+- \(p(H_0 | data)\) is small.
Which may or may not be true depending on the prior of \(H_0\).
Making \(\alpha = 0.05\) even more arbitrary.
@@ -440,10 +440,29 @@ Explicit models
+
+Some challenges
+
+
+- To calculate error bars, we need an estimate of the standard error of the statistic.
+- For simple cases, this is derived from the variance of the sampling distribution.
+
+- What is the variance across multiple samples of size \(n\).
+
+- For some statistics (and with some assumptions), we can calculate this.
+
+- For example, the variance of the sampling distribution of the mean: \(\frac{\sigma}{\sqrt(n)}\)
+
+- For many statistics the sampling distribution is not well defined
+- But it can be computed empirically
+
+
+
+
Computing to the rescue
-
+
@@ -459,10 +478,19 @@ Resampling methods
The Jackknife
+
- Originally invented by statistician Maurice Quenouille in the 40’s.
- Championed by Tukey, who also named it for its versatility and utility.
-- The mechanics:
+
+
+
+
+
+The Jackknife
+
+
+- The algorithm:
- Consider the statistic \(\theta(X)\) calculated for data set \(X\)
- Let the sample size of the data be \(X\) be \(n\)
- For i in 1…\(n\)
@@ -476,13 +504,15 @@
The Jackknife
The estimate of the standard error \(SE(S)\) is:
-- $SE_= $
+- \(SE_\theta = \sqrt{ \frac{n-1}{n} \sum_{i}{ (\hat{\theta} - \theta_i) ^2 }}\)
+
-
+
The jackknife
+
- The bias of the jackknife is smaller than the bias of \(\theta\) (why?)
- Can also be used to estimate the bias of \(\theta\):
@@ -490,6 +520,7 @@
The jackknife
- \(\hat{B} = \hat{\theta} - \theta\)
+
@@ -501,7 +532,7 @@ Demo
Some limitations
- Assumes data is IID
-- Assumes that \(\theta\) is $ (,,^{2}) $
+- Assumes that \(\theta\) is \(\sim \mathcal{N}(\mu,\,\sigma^{2})\)
- Can fail badly with non-smooth estimators (e.g., median)
- We’ll talk about cross-validation next week.
- And we may or may not come back to permutations later on.
@@ -510,13 +541,45 @@ Some limitations
The bootstrap
-Invented by Bradley Efron - See interview for the back-story. - Very general in its application - Consider a statistic \(\theta(X)\) - For i in \(1...b\) - Sample \(n\) samples with replacement: \(X_b\) - In the pseudo-sample, calculate \(\theta(X_b)\) and store the value - Standard error is the sample standard deviation of \(\theta\): - \(\sqrt(\frac{1}{n-1} \sum_i{(\theta - \bar{\theta})^2})\) - Bias can be estimated as: - \(\theta{X} - \bar{theta}\) (why?) - The 95% confidence interval is in the interval between 2.5 and 97.5.
+Invented by Bradley Efron
+
+- See interview for the back-story.
+- Very general in its application
+
+
+
+
+The bootstrap
+
+
+- The algorithm:
+- Consider a statistic \(\theta(X)\)
+- For i in \(1...b\)
+
+- Sample \(n\) samples with replacement: \(X_b\)
+- In the pseudo-sample, calculate \(\theta(X_b)\) and store the value
+
+- Standard error is the sample standard deviation of \(\theta\):
+
+- \(\sqrt{\frac{1}{n-1} \sum_i{(\theta - \bar{\theta})^2}}\)
+
+- Bias can be estimated as:
+
+- \(\theta(X) - \bar{\theta}\) (why?)
+
+- The 95% confidence interval is in the interval between 2.5 and 97.5.
+
+
Why is the bootstrap so effective?
+
+
+- Alleviates distributional assumptions required with other methods.
-- Alleviates distributional assumptions required with other methods.
+- “non-parametric”
+
- Flexible to the statistic that is being interrogated
- Allows interrogating sampling procedures
@@ -526,6 +589,7 @@ Why is the bootstrap so effective?
- And other complex procedures.
- Efron argues that this is the natural procedure Fisher et al. would have preferred in the 20’s if they had computers.
+
@@ -540,6 +604,7 @@ A few pitfalls of the bootstrap
A few pitfalls
+
- Estimates of SE tend to bias downward in small samples.
@@ -559,6 +624,7 @@ A few pitfalls
- Residuals are preferable when considering a designed experiment with fixed levels of an IV.
+
@@ -572,6 +638,14 @@ Building on the bootstrap
+
+Further reading
+
+John Fox & Sanford Weisberg have an excellent chapter about “bootstrapping regression models” that has some excellent explanations and R code.
+Another set of explanations in a Kulesa et al. tutorial paper.
+
+
+
The curse of dimensionality
What about large \(p\)?
@@ -583,22 +657,48 @@ The curse of dimensionality
Data is sparser in higher dimensions
-
+
The distance between points increases rapidly
+
+
+
+Multi co-linearity
+
+- If the data is a \(n\)-by-\(p\) matrix:
+
+
+
\[
+\begin{bmatrix}
+X_{11} & X_{12} & \cdots & X_{1p} \\
+X_{21} & X_{22} & \cdots & X_{2p} \\
+\vdots & \vdots & \ddots & \vdots \\
+X_{n1} & X_{n2} & \cdots & X_{np}
+\end{bmatrix}
+\]
+
+
+
every column is a linear combination of other columns.
+
When \(p\) > \(n\) multi-colinearity exists
-
+That is, there exists \(\beta\) such that
+
+
\(X_{j} = \sum{\beta_j X_{-j}}\)
+
+
+
But multi-colinearity can exist even when \(p\) < \(n\) !
+
The false positive rate increases
-
+
Machine learning to the rescue?
diff --git a/slides/images/cod_distance.png b/slides/images/cod_distance.png
new file mode 100644
index 0000000..58f8ab3
Binary files /dev/null and b/slides/images/cod_distance.png differ
diff --git a/slides/images/cod_sparse.png b/slides/images/cod_sparse.png
new file mode 100644
index 0000000..2d256a7
Binary files /dev/null and b/slides/images/cod_sparse.png differ
diff --git a/slides/images/code_false_positives.png b/slides/images/code_false_positives.png
new file mode 100644
index 0000000..a56fbf2
Binary files /dev/null and b/slides/images/code_false_positives.png differ