mean(x)
: calculate the mean of a vectorx
mean(c(1,2,3,4,5))
mean(c(1,2,3,4,NA), na.rm = T)
: make sure to usena.rm = T
if your sample has NA values
sd(x)
: calculate the standard deviation of a vectorx
sd(c(1,2,3,4,5))
summary(x)
: calculate six number summary statistics of a vectorx
(min, q1, median, mean, q3, max)summary(c(1,2,3,4,5))
table(x)
: counts the number of unique values of xtable(ice_cream$ice_cream)
returns the number of students that like each of the three flavors of ice cream
sum(x)
: calculate sum of vectorx
sum(c(1,2,3,4,5))
prod(x)
: calculate product of vectorx
prod(c(4,5,2))
sqrt(x)
: calculate the sqare root of a valuesqrt(100)
seq(from, to, by)
: Generate a sequence of valuesseq(1, 10, 1)
seq(10, 1000, 20)
rep(x, times = 1)
: generates a vector ofx
repeatingtimes
timesrep(1, 10)
set.seed(x)
: sets seed for random number generator. If you provide the same seed number before your analysis you will get the same 'random' sample each time.set.seed(1753)
sample(x, size, replace = FALSE, prob = NULL)
: takes a sample ofsize
from vectorx
with our without replacementsample(seq(1,10,1), 3)
sample(seq(1,10,1), 3, replace = T)
- Can use
prob
as a vector of probabilities for each element ofx
:sample(c("H", "T"), 10, replace = T, prob = c(0.6, 0.4))
A note on distribution functions
- "r" functions (i.e.
rbinom()
) generate a vector of random variables following said distribution typerbinom(10, size = 1, prob = 0.5)
would simulate 10 coin flips with 1 being heads and 0 being tails
- "d" functions (i.e.
dbinom()
) returns the value of the probability density function of said distributiondbinom(5, 10, 0.37)
returns a single value: the probability of getting exactly 5 out of 10 successes when the probability of each success is 0.37
- "p" functions (i.e.
pbinom()
) returns the value of the cumulative density function of said distributionpbinom(5, 10, 0.37)
returns a single value: the probability of getting five OR LESS successes when the probability of each success is 0.37. This could be thought of as the area under the curve and is commonly used to answer questions with a normal distribution
- "q" functions (i.e.
qbinom()
) returns the value of the quantile (or z score) that gives you the given area under the curve. You can think of "q" and "p" functions as being inversesqbinom(0.879, 10, 0.37)
returns the value (5) where the area under the curve is 0.879. Note, the answer topbinom(5,10,0.37)
is 0.879.
Note: many of these functions have an optional argument lower.tail = TRUE/FALSE
meant to calculate the area under the curve to the left (TRUE) or right (FALSE) of your value
More explanation and examples can be found here
*binom
rbinom(n, size, prob)
- generate random variabledbinom(x, size, prob)
- calculate the probability of seeing a specific valuepbinom(q, size, prob)
- calculate the cumulative probability of seeing at most a specific value (inverse ofqbinom()
)qbinom(p, size, prob)
- calculate the specific value given a probability of at most that value (inverse ofpbinom()
)
*norm
rnorm(n, mean = 0, sd = 1)
- generate random variablednorm(x, mean = 0, sd = 1)
(rarely used)pnorm(q, mean = 0, sd = 1)
- calculate the probability (p-value) of a given quantile (z score) (inverse ofqnorm()
)qnorm(p, mean = 0, sd = 1)
- calculate the quantile (z score) of a given probability (p-value) (inverse ofpnorm()
)
Q-Q plots
# generate random normal data
data <- rnorm(100)
# plot Q-Q plot and line
qqnorm(data)
qqline(data)
*t
rt(n, df)
- generate random variabledt(x, df)
(rarely used)pt(q, df)
- calculate the probability (p-value) of a given quantile (or t statistic) (inverse ofqt()
)qt(p, df)
- calculate the quantile (t statistic) of a given probability (p-value) (inverse ofpt()
)
*hyper
rhyper(nn, m, n, k)
- generate random variable (not often used)dhyper(x, m, n, k)
- calculate the probability of seeing a specific valuephyper(q, m, n, k)
- calculate the cumulative probability (p-value) of a given value (inverse ofqhyper()
)qhyper(p, m, n, k)
- calculate the value/quantile of a given probability (p-value) (inverse ofphyper()
) (not often used)
*chisq
rchisq(n, df)
- generate random variable (not often used)dchisq(x, df)
- calculate the probability of seeing a specific valuepchisq(q, df)
- calculate the cumulative probability (p-value) of a given value (inverse ofqchisq()
)qchisq(p, df)
- calculate the value/quantile of a given probability (p-value) (inverse ofpchisq()
) (not often used)
*f
rf(n, df1, df2)
- generate random variable (not often used)df(x, df1, df2)
- calculate the probability of seeing a specific value (not often used)pf(q, df1, df2)
- calculate the cumulative probability (p-value) of a given value (inverse ofqf()
)qf(p, df1, df2)
- calculate the value/quantile of a given probability (p-value) (inverse ofpf()
)
Parametric
t.test(x, mu)
- One sample t-testt.test(x, y)
- Two (independent) sample t-testt.test(x, y, paired = T)
- Paired (dependent) sample t-testlm(y ~ x)
- linear model of y given xanova(lm(y ~ x))
- analysis of variance of y given x where x is categorical with several valuesTukeyHSD(model)
- Pairwise t-tests with adjusted p-values for multiple testing. Requiresaov()
as input.
Non-parametric
wilcox.test(x)
- One sample Mann-Whitney (or Wilcoxon Rank Sum) testwilcox.test(x, y)
- Two (independent) sample Mann-Whitney (or Wilcoxon Rank Sum) testwilcox.test(x, y, paired = T)
- Paired (dependent) sample test (aka Wilcoxon signed-rank)fisher.test(x, alternative = "greater")
- Fisher's exact test. Usealternative = greater
to test one-sided for enrichment. Can also use a non-directional alternative.x
is a 2x2 contingency tablechisq.test(x, correct = F)
- Chi-Square test for contingency tables and goodness-of-fit tests on count data. Default iscorrect = T
which applies a continuity correction. To get the same answers as in doing it by hand, you need to usecorrect = F
mcnemar.test(x, correct = F)
- Chi-square test for paired data (didn't really talk about in class, just a note in case it is useful in your research)kruskal.test(y ~ x)
- Non-parametric alternative foranova
. Uses rank-sum method.
Note: most hypothesis tests have the optional argument alternative = c("two.sided", "less", "greater")
for one-sided or two-sided tests
Choosing a test
quantile(x, probs)
- calculate the quantile of a sequence or distribution at a given probability (probs)
) (*Note:quantile(rnorm(100), probs = 0.95)
returns the 95% percentile of the random distribution).shapiro.test(x)
- test of non-normality. (P > 0.1 = normal distribution)power.t.test(n, delta, sd, sig.level, power)
- perform power test (provide 4/5 parameters and get the 5th)p.adjust(p, method)
- adjust p-values for multiple testing. Several differnt methods available includingfdr
andbonferroni
cor(x, y, method = c("pearson", "spearman", "kendall"))
- calculate correlation between two variables. Defaults to pearson (parametric) but can choose spearman or kendall as non-parametric optionscor.test(x, y, method = c("pearson", "spearman", "kendall"))
- test if the linear corrleation between two variables is different from zero. Defaults to pearson (parametric) but can choose spearman or kendall as non-parametric optionscov(x, y)
- calculates covariance between two variablessummary(model)
- returns summary statistics for a model (i.e. linear modellm()
)
- ggplot2 cheatsheet
- Base R plotting cheatsheet
qqnorm(data)
- Q-Q plot. Can also addqqline(data)
to add the line to test for normalityplot(model, 0)
- plot a linear modelplot(model, 1)
- plot residuals from a modelggplot2::qplot(x, y)
- easy ggplot plotsinteraction.plot(y, x1, x2)
- plot interaction between two variables (x1 and x2) for two-way anova