种群遗传变异数据处理流程
- gatk过滤snp (过滤前7779362,过滤后6310310)
conda install -c bioconda gatk4
mkdir ./tmp
##给未通过过滤阈值的snp生成一个标签“SNP_filter”.
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" VariantFiltration -V pop.variant.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O pop.variant.filter.vcf
##根据标签“SNP_filter”删除不满足要求的snp
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" SelectVariants -V pop.variant.vcf --exclude-filtered -O all.filtered.snp.vcf
注:-R ../01.ref/genome.fasta在有参考基因组的情况下需要这个命令,没有参考基因组可以略去
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" SelectVariants -V all.merge_raw.vcf --select-type SNP -O all.raw.snp.vcf
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" VariantFiltration -R ../01.ref/genome.fasta -V all.raw.snp.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O all.filter.snp.vcf
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" SelectVariants -R ../01.ref/genome.fasta -V all.filter.snp.vcf --exclude-filtered -O all.filtered.snp.vcf
##给未通过过滤阈值的snp生成一个标签“SNP_filter”.
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" VariantFiltration -V pop.variant.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O pop.variant.filter.vcf
##根据标签“SNP_filter”删除不满足要求的snp
gatk --java-options "-Xmx4g -Djava.io.tmpdir=./tmp" SelectVariants -V pop.variant.filter.vcf --exclude-filtered -O all.filtered.snp.vcf
注:-R ../01.ref/genome.fasta在有参考基因组的情况下需要这个命令,没有参考基因组可以略去
- plink进一步过滤snp (仅适用于全基因组重测序call的snp,不适用于转录组)
conda install -c bioconda plink
## filter missing and maf()
plink --vcf all.filtered2.snp.vcf --geno 0.2 --maf 0.05 --out all.missing_maf --recode vcf-iid --allow-extra-chr --set-missing-var-ids @:# --keep-allele-order
参数解释:--geno表示missing data rate,maf表示次等位基因的频率,目的是为了保留多态性较高的位点。--allow-extra-chr允许不同格式的染色体id。--set-missing-var-ids自动补齐缺失的snp。id--keep-allele-order保持原始vcf文件的顺序。
## filter LD (pass连锁的snp,做structure必须用到,做进化树或者PCA可用可不用。第一步,生成符合条件snp位点的id。注:这一步不要重复多次,因为每重复一次,位点都会减少。第二步,提取snp位点)
plink --vcf all.missing_maf.vcf --indep-pairwise 50 10 0.2 --out tmp.ld --allow-extra-chr --set-missing-var-ids @:#
plink --vcf all.missing_maf.vcf --make-bed --extract tmp.ld.prune.in --out all.LDfilter --recode vcf-iid --keep-allele-order --allow-extra-chr --set-missing-var-ids @:#
- vcftools过滤snp(适用于转录组,过滤前6310310,过滤后72870个snp)
conda install -c bioconda vcftools
##删除indels,只保留snp的位点
vcftools --vcf all.filtered.vcf --remove-indels --recode --recode-INFO-all --out all.filtered.snp.vcf
##过滤缺失率大于0.05,次等位基因频率(maf)小于0.05的位点
vcftools --vcf all.filtered.snp.vcf --max-missing 0.5 --min-meanDP 30 --maf 0.05 --recode --recode-INFO-all --out all.filtered2.snp
##此步骤选做,加了一个参数kept only biallelic SNPs
vcftools --vcf all.filtered.snp.vcf --max-missing 0.5 --min-meanDP 3 --maf 0.02 --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out all.filtered2.snp
#filter LD
mkdir ./tmp
plink --vcf all.filtered2.snp.recode.vcf --indep-pairwise 50 10 0.2 --out tmp.ld --allow-extra-chr --double-id --set-missing-var-ids @:#
plink --vcf all.filtered2.snp.recode.vcf --make-bed --extract tmp.ld.prune.in --out all.LDfilter --recode vcf-iid --keep-allele-order --double-id --allow-extra-chr --set-missing-var-ids @:#
##过滤后的snp进行计数
grep -v "#" all_keep_filtered5_snp.vcf.recode.vcf|wc -l
以下为补充
##当需要bed文件时
#使用VCFtools转换成plink需要的格式
vcftools --vcf all.filtered2.snp.recode.vcf --plink-tped --out all.filtered2.snp.recodebed
#生成bed文件
plink --tfile all.filtered2.snp.recodebed --make-bed --out all.filtered2.snp.recodebed
##删除缺失率过高的样本 (这一步不需要做,仅供学习参考)
#查看各个样本的缺失率
vcftools --vcf all.filtered2.snp.recode.vcf --missing-indv
#利用awk 输出缺失率》0.2的样本
awk '($5 >0.05){print $0}' out.imiss |cut -f1 >lowDP.indv
#去除高缺失率的样本
vcftools --vcf all.filtered2.snp.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out all.filtered3
- 进化树构建分析
软件安装
conda install -c bioconda raxml-ng
https://github.com/edgardomortiz/vcf2phylip下载vcf2phylip.py脚本
##基于vcf2phylip.py脚本将VCF文件转换成phylip格式
python vcf2phylip.py -i all.LDfilter.vcf
##使用raxml-ng建ML树
#选择最佳模型
下载软件 wget https://github.com/ddarriba/modeltest/files/3790700/modeltest-ng-0.1.6-static-linux64.tar.gz
nohup modeltest-ng-static -i all.LDfilter.tmp6.phy -d nt -p 6 2>&1 &
#建树
nohup raxml-ng -all -msa all.LDfilter.min4.phy --model GTR+I+G4 --bs-trees 100 --prefix out_treeBS --threads 2 --seed 123 1>raxml_bs.log 2>raxml_bs.err 2>&1 &
##使用iqtree建ML树
conda install -c bioconda iqtree
#对于snp数据
nohup iqtree -s all.LDfilter.min4.phy -m MFP+ASC -bb 1000 -bnni -nt AUTO -cmax 15 2>&1 &
- structure分析
软件安装
conda install -c bioconda admixture
#生成bed格式文件
plink --vcf all.LDfilter.vcf --make-bed --out all --allow-extra-chr --keep-allele-order --double-id --set-missing-var-ids @:#
#生成9个计算k值的脚本
seq 2 9 | awk '{print "admixture --cv -j2 all.bed "$1" 1>admix."$1".log 2>&1"}' > admixture.sh
#vi编辑器将所有生成的命令改成如下形式
nohup admixture --cv -j2 all.bed 2 1>admix.2.log 2>&1 &
nohup admixture --cv -j2 all.bed 3 1>admix.3.log 2>&1 &
#修改bim文件以供admixture识别。即将第一列替换成数字1
awk '{print $2, $3, $4, $5, $6}' OFS="\t" all.bim > change.bim
sed -i 's/^/1 /g' change.bim
手动将change.bim修改为all.bim
sh admixture.sh
#R包画图,需安装"argparser"和"pophelper"两个包
#for argparser
install.packages("argparser")
#for pophelper参考网站 http://www.royfrancis.com/pophelperShiny/
首先:install.packages(c("ggplot2","gridExtra","label.switching","tidyr","remotes",
"colourpicker","DT","highcharter","htmlwidgets","magrittr",
"markdown","RColorBrewer","shiny","shinyAce","shinyBS",
"shinythemes","shinyWidgets","viridisLite","writexl"),
repos = "http://cran.us.r-project.org")
第二:remotes::install_github('royfrancis/pophelper')
#基于树文件提取样本顺序,由于样本名称多样化,不一定前面就是“GA”,所以这一步可自己手动添加样本顺序
less file.nwk |sed 's/[(),]\+/\n/g' |grep "GA"|awk -F ":" '{print $1}' > sample_order.txt
##R包画图脚本,使用现成脚本,基于wnidows中的命令终端调用R进行画图
windows+R cmd 打开windows中的命令终端
#将R添加进环境变量,依据R的bin文件夹所在的位置
path=D:\R-4.0.0\bin
#依据相应文件夹,文件的位置画图
Rscript C:\Users\25012\Desktop\script\draw_admixture.R C:\Users\25012\Desktop\tmp7\structure\result C:\Users\25012\Desktop\tmp7\structure\all.nosex structure
- PCA分析
#--pca表示pca个数,pca个数要小于等于样本个数
plink --vcf all.LDfilter.vcf --pca 100 --out PCA_out --allow-extra-chr --double-id --set-missing-var-ids @:#
plink --vcf all.filtered2.snp.recode.vcf --pca 100 --out PCA_out --allow-extra-chr --double-id --set-missing-var-ids @:#
#计算主成分一,主成分二的方差解释度(PCA_out.eigenval特征值,里面是每个PCA的方差)
awk '{sum += $1}END{print sum}' PCA_out.eigenval
#123.343为上一步得到的值
awk '{print $1/123.343}' PCA_out.eigenval
#作图
path=D:\R-4.0.0\bin
Rscript C:\Users\25012\Desktop\script\draw_PCA.R C:\Users\25012\Desktop\tmp7\PCA_out.eigenvec 1 2 C:\Users\25012\Desktop\tmp7\sample.pop PCA_out_figure
- 群体选择分析
##计算PI值
#移除离群样本
vcftools --vcf all.LDfilter.vcf --recode --recode-INFO-all --stdout --remove-indv LTH_1 --remove-indv LTH_4 --remove-indv YH_2 --remove-indv YH_3 --remove-indv MG_1 --remove-indv MG_7 --remove-indv MG_10 > out.vcf
#指定计算某一类群
vcftools --vcf all.LDfilter.vcf --window-pi 1000 --window-pi-step 100 --keep /home/yuxiaolei/snp/tmp7/pi/remove_outlier/MG.txt --out ./MG.txt
#计算全部类群
vcftools --vcf all.LDfilter.vcf --window-pi 1000 --window-pi-step 100 --out ./Pi.pop
##计算种群间的ROD值
pi1=Pi.pop1.windowed.pi
pi2=Pi.pop2.windowed.pi
perl ../../script/merge2pi_ROD.pl $pi1 $pi2 >Pi.ROD
#去掉包含NA的行
grep -v "NA" Pi.ROD > Pi.ROD.table
#绘制ROD图
Rscript ../../script/manhattan_ROD.R Pi.ROD.table Pi.ROD.table.fig
##计算种群间的TajimaD值
#分别计算各亚群的
vcftools --vcf $vcf --TajimaD $window 1000 --keep ../../data/pop.SC.table --out TajimaD.pop1
vcftools --vcf $vcf --TajimaD $window 1000 --keep ../../data/pop.YZR.table --out TajimaD.pop2
or
#一次性计算全部亚群间的?
vcftools --vcf $vcf --TajimaD $window --keep ../../data/pop.SC.table --out TajimaD.pop1
#将多个亚群的TajimaD值合并在一个文件中
paste TajimaD.pop1.Tajima.D TajimaD.pop2.Tajima.D |awk '{print $1"\t"$2"\t"$4"\t"$8}' |sed 's/TajimaD\tTajimaD/pop1\tpop2/' |sed 's/^Chr0//' |grep -v "nan" > TajimaD.pop1.pop2.tab
#计算Fst
vcftools --vcf all.LDfilter.vcf --fst-window-size 1000 --fst-window-step 100 --weir-fst-pop ../../data/pop.SC.table --weir-fst-pop ../../data/pop.YZR.table --out ./Fst.pop1.pop2
vcftools --vcf all.LDfilter.vcf --fst-window-size 1000 --fst-window-step 100 --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YH.txt --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YBJ.txt --out ./Fst.YH.YBJ
vcftools --vcf all.LDfilter.vcf --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YH.txt --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YBJ.txt --out ./Fst2.YH.YBJ
#gcta
- 计算gene的Ed与pi之间的相关性
#首先分别对两个文件进行排序
sort -k 1b,1 CLL.txt> CLL.sorted.txt
sort -k 1b,1 CLL_ED.txt> CLL_ED.sorted.txt
#根据两个表格中第一列的重复值将它们所在的行提取出来,得到每个unigene的Ed和pi值
join -o 1.1 1.2 2.2 CLL_ED.sorted.txt CLL.sorted.txt >CLLrelationship.txt
#去除换行符
cat CLLrelationship.txt | tr -d "\r" > CLLrelationship2.txt
join YHrelationship2.txt GCrelationship2.txt QNDXrelationship2.txt LTHrelationship2.txt SZXrelationship2.txt LBCrelationship2.txt GRCrelationship2.txt MYXrelationship2.txt DDCrelationship2.txt LSDrelationship2.txt WQCrelationship2.txt NYGKrelationship2.txt LDSrelationship2.txt YBJrelationship2.txt LWSrelationship2.txt SWCrelationship2.txt MZrelationship2.txt DXXrelationship2.txt CLLrelationship2.txt >allrelationship.txt
join YHrelationship2.txt GCrelationship2.txt >allrelationship1.txt
join allrelationship1.txt QNDXrelationship2.txt >allrelationship2.txt
join allrelationship2.txt LTHrelationship2.txt >allrelationship3.txt
join allrelationship3.txt SZXrelationship2.txt >allrelationship4.txt
join allrelationship4.txt LBCrelationship2.txt >allrelationship5.txt
join allrelationship5.txt GRCrelationship2.txt >allrelationship6.txt
join allrelationship6.txt SZXrelationship2.txt >allrelationship7.txt
join allrelationship7.txt MYXrelationship2.txt >allrelationship8.txt
join allrelationship8.txt DDCrelationship2.txt >allrelationship9.txt
join allrelationship9.txt LSDrelationship2.txt >allrelationship10.txt
join allrelationship10.txt WQCrelationship2.txt >allrelationship11.txt
join allrelationship11.txt NYGKrelationship2.txt >allrelationship12.txt
join allrelationship12.txt LDSrelationship2.txt >allrelationship13.txt
join allrelationship13.txt YBJrelationship2.txt >allrelationship14.txt
join allrelationship14.txt LWSrelationship2.txt >allrelationship15.txt
join allrelationship15.txt SWCrelationship2.txt >allrelationship16.txt
join allrelationship16.txt MZrelationship2.txt >allrelationship17.txt
join allrelationship17.txt DXXrelationship2.txt >allrelationship18.txt
join allrelationship18.txt CLLrelationship2.txt >allrelationship19.txt
cat allrelationship19.txt | tr -d "\r" > allrelationship20.txt