Skip to content

r_keyuxi

James Chen edited this page Apr 19, 2018 · 1 revision

#R notes

Author: 18-Yuxi Ke

基本分布的内置函数

四种函数

以正态分布为例,

  • dnorm() : p.d.f.
  • pnorm(): c.d.f.
  • qnorm(): quantile, 分位点
  • rnorm(): 根据分布生成随机数

例子:

dnorm(x, mean=0, sd=1, log = FALSE)# default

pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)

qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)

rnorm(n, mean=0, sd=1)

常用统计检验

R中的检验函数

t.test
var.test
wilcox.test# Wilcoxon signed rank test
binom.test
prop.test
fisher.test# Fisher's exact test (2 binomial distributions)
chisq.test

几个非参数检验

  • Sign Test: 检验中位数,可以使用’binom.test‘:
#sample size = 200, H1: m < 0.5
binom.test( sum(x>0), 200, 0.5, alternative = "1" )
  • Signed Rank Test: 检验样本是否关于一个点对称.

Wilcoxon signed rank statistic: $$\sum_{i=1}^{n} \psi (X_i) R^+$$, $$R^+$$是$$|X|$$在所有样本绝对值排序后所处的位置,$$\psi()$$是示性函数.

wilcox.test(x, y, "t", paired = "T")
  • Rank sum test: 双样本

Rankpool两组样本之后排序,求统计量,并分别求和

Mann-Whitney统计量:$$\sum_i\sum_j \psi (Y_j - X_i)$$

评估使用的检验是否合适

  • t test
  • Sign test (binom.test)
  • Wilcoxon signed rank test

Populations

  • Normal
  • Unform
  • Cauchy
  • Laplace
# 1. cutoff points
alpha<-0.05
qt(alpha,19,lower.tail=FALSE)#自由度为19的t test
qbinom(alpha,20,0.5,lower.tail=FALSE)
qsignrank(alpha,20,lower.tail=FALSE)

# declare
m0<-0
n_power<-matrix(data=0,nrow=151,ncol=3)
u_power<-matrix(data=0,nrow=151,ncol=3)
c_power<-matrix(data=0,nrow=151,ncol=3)
l_power<-matrix(data=0,nrow=151,ncol=3)

sample<-matrix(NA,nrow=4,ncol=20)
reject<-matrix(data=0,nrow=4,ncol=3)

# cutoff points
alpha<-0.05
t_cut<-qt(alpha,19,lower.tail=FALSE)
b_cut<-qbinom(alpha,20,0.5,lower.tail=FALSE)
w_cut<-qsignrank(alpha,20,lower.tail=FALSE)

# simulate
for (i in 0:150)
{
m<-3*i/150

for ( rtrial in 1:1000 )
{
# generate random samples
sample[1,]<-rnorm(20,mean=m,sd=1)
sample[2,]<-runif(20,min=m-0.5,max=m+0.5)
sample[3,]<-rcauchy(20,location=m,scale=1)
for (num in 1:20)
{
x<-rexp(1,rate=1)
y<-rbinom(1,1,0.5)
if (y)
x<-x+m
else
x<-m-x
sample[4,num]<-x
}

# tests for each distribution
for (dis in 1:4)
{
if ((mean(sample[dis,])-m0)/sqrt(var(sample[dis,])/20)>t_cut)
reject[dis,1]<-reject[dis,1]+1

if (sum(sample[dis,]>m0)>b_cut)
reject[dis,2]<-reject[dis,2]+1

sample_rank<-rank(abs(sample[dis,]) )
if ( sum(sample_rank*(sample[dis,]>m0)) > w_cut )
reject[dis,3]<-reject[dis,3]+1
}
}

# calculate power function
reject<-reject/1000
n_power[i+1,]<-reject[1,]
u_power[i+1,]<-reject[2,]
c_power[i+1,]<-reject[3,]
l_power[i+1,]<-reject[4,]

}

plot results

# x-axis
xaxis=seq(0,3,by=3/150)

# normal
plot(xaxis,n_power[,1],type="l",col="red",xlab="theta",ylab="power",ylim=c(0,1))
lines(xaxis,n_power[,2],col="green")
lines(xaxis,n_power[,3],col="blue")
title(main="normal")
legend(2,0.3,c("T test", "B test", "W test"),col = c(2,3,4),lty = c(1, 1, 1))

# uniform
plot(xaxis,u_power[,1],type="l",col="red",xlab="theta",ylab="power",ylim=c(0,1))
lines(xaxis,u_power[,2],col="green")
lines(xaxis,u_power[,3],col="blue")
title(main="uniform")
legend(2,0.3,c("T test", "B test", "W test"),col = c(2,3,4),lty = c(1, 1, 1))


# Cauchy
plot(xaxis,c_power[,1],type="l",col="red",xlab="theta",ylab="power",ylim=c(0,1))
lines(xaxis,c_power[,2],col="green")
lines(xaxis,c_power[,3],col="blue")
title(main="Cauchy")
legend(2,0.3,c("T test", "B test", "W test"),col = c(2,3,4),lty = c(1, 1, 1))


# Laplace
plot(xaxis,l_power[,1],type="l",col="red",xlab="theta",ylab="power",ylim=c(0,1))
lines(xaxis,l_power[,2],col="green")
lines(xaxis,l_power[,3],col="blue")
title(main="Laplace")
legend(2,0.3,c("T test", "B test", "W test"),col = c(2,3,4),lty = c(1, 1, 1))
Population Ranking
Normal $W\approx T&gt;B$
Uniform $W\approx T&gt;B$
Cauchy $B&gt;W&gt;T$
Laplace $W&gt;T&gt;B$

可以看到Wilcoxon signed rank test 一直有较好的表现;t test在normal或uniform分布下表现较好,但在Cauchy分布下很不理想,$\theta$ 较大时也难以正确地拒绝原假设;Sign test在normal或uniform分布下表现一般,但对于Cauchy分布这种比较奇特的情况效果较好。

应考虑到样本具体的性质,合理用概率模型建模,譬如当分布显然很skew的时候就不宜使用t test。

P-P plot & Q-Q plot

Q: quantile

评估真实数据是否符合某一理论分布

p.s. 也可以用percentile画P-P plot

# generate Q-Q plot against normal hypothesis
# adapted from http://bbs.pinggu.org/thread-2732625-1-1.html
# AirPassengers is built-in data in R
expeQ <- quantile(AirPassengers,seq(0,1,0.01))
theoQ <- qnorm(seq(0,1,0.01),mean(AirPassengers),sd(AirPassengers))
plot(theoQ, expeQ, ylim=c(0,max(AirPassengers)))
abline(a=0, b=1, col="red", lwd = 2)

经验分布函数

plot.ecdf(AirPassengers)

可以与pnorm生成的理论cdf相比较

Reference: slides from Statistical Methods and Applications