种群遗传变异数据处理流程

gatk过滤snp （过滤前7779362，过滤后6310310）

conda install -c bioconda gatk4

mkdir ./tmp

##给未通过过滤阈值的snp生成一个标签“SNP_filter”.
gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  VariantFiltration -V pop.variant.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O pop.variant.filter.vcf

##根据标签“SNP_filter”删除不满足要求的snp
gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  SelectVariants -V pop.variant.vcf --exclude-filtered  -O all.filtered.snp.vcf
注：-R ../01.ref/genome.fasta在有参考基因组的情况下需要这个命令，没有参考基因组可以略去

gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  SelectVariants -V all.merge_raw.vcf --select-type SNP -O all.raw.snp.vcf

gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  VariantFiltration -R ../01.ref/genome.fasta -V all.raw.snp.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O all.filter.snp.vcf

gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  SelectVariants  -R ../01.ref/genome.fasta -V all.filter.snp.vcf --exclude-filtered  -O all.filtered.snp.vcf

##给未通过过滤阈值的snp生成一个标签“SNP_filter”.
gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  VariantFiltration -V pop.variant.vcf --filter-expression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name 'SNP_filter' -O pop.variant.filter.vcf

##根据标签“SNP_filter”删除不满足要求的snp
gatk  --java-options "-Xmx4g -Djava.io.tmpdir=./tmp"  SelectVariants -V pop.variant.filter.vcf --exclude-filtered  -O all.filtered.snp.vcf
注：-R ../01.ref/genome.fasta在有参考基因组的情况下需要这个命令，没有参考基因组可以略去

plink进一步过滤snp (仅适用于全基因组重测序call的snp，不适用于转录组)

conda install -c bioconda plink

## filter missing and maf（）
plink --vcf all.filtered2.snp.vcf --geno 0.2 --maf 0.05 --out all.missing_maf --recode vcf-iid  --allow-extra-chr --set-missing-var-ids @:# --keep-allele-order
参数解释：--geno表示missing data rate，maf表示次等位基因的频率，目的是为了保留多态性较高的位点。--allow-extra-chr允许不同格式的染色体id。--set-missing-var-ids自动补齐缺失的snp。id--keep-allele-order保持原始vcf文件的顺序。

## filter LD （pass连锁的snp，做structure必须用到，做进化树或者PCA可用可不用。第一步，生成符合条件snp位点的id。注：这一步不要重复多次，因为每重复一次，位点都会减少。第二步，提取snp位点）
plink --vcf  all.missing_maf.vcf  --indep-pairwise 50 10 0.2 --out tmp.ld   --allow-extra-chr --set-missing-var-ids @:# 

plink --vcf  all.missing_maf.vcf  --make-bed --extract tmp.ld.prune.in  --out all.LDfilter --recode vcf-iid  --keep-allele-order  --allow-extra-chr --set-missing-var-ids @:#

vcftools过滤snp（适用于转录组，过滤前6310310，过滤后72870个snp）

conda install -c bioconda vcftools

##删除indels，只保留snp的位点
vcftools --vcf all.filtered.vcf --remove-indels --recode --recode-INFO-all --out all.filtered.snp.vcf

##过滤缺失率大于0.05，次等位基因频率（maf)小于0.05的位点
vcftools --vcf all.filtered.snp.vcf --max-missing 0.5 --min-meanDP 30 --maf 0.05 --recode --recode-INFO-all --out all.filtered2.snp

##此步骤选做，加了一个参数kept only biallelic SNPs
vcftools --vcf all.filtered.snp.vcf --max-missing 0.5 --min-meanDP 3 --maf 0.02 --min-alleles 2 --max-alleles 2  --recode --recode-INFO-all --out all.filtered2.snp

#filter LD
mkdir ./tmp
plink --vcf  all.filtered2.snp.recode.vcf --indep-pairwise 50 10 0.2 --out tmp.ld --allow-extra-chr --double-id --set-missing-var-ids @:#

plink --vcf  all.filtered2.snp.recode.vcf --make-bed --extract tmp.ld.prune.in  --out all.LDfilter --recode vcf-iid  --keep-allele-order --double-id --allow-extra-chr --set-missing-var-ids @:# 

##过滤后的snp进行计数
grep -v "#" all_keep_filtered5_snp.vcf.recode.vcf|wc -l

以下为补充
##当需要bed文件时
#使用VCFtools转换成plink需要的格式
vcftools --vcf all.filtered2.snp.recode.vcf --plink-tped --out all.filtered2.snp.recodebed
#生成bed文件
plink --tfile all.filtered2.snp.recodebed --make-bed --out all.filtered2.snp.recodebed

##删除缺失率过高的样本 (这一步不需要做，仅供学习参考)
#查看各个样本的缺失率
vcftools --vcf all.filtered2.snp.recode.vcf --missing-indv

#利用awk 输出缺失率》0.2的样本
awk '($5 >0.05){print $0}' out.imiss |cut -f1 >lowDP.indv

#去除高缺失率的样本
vcftools --vcf all.filtered2.snp.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out all.filtered3

进化树构建分析

软件安装
conda install -c bioconda raxml-ng

https://github.com/edgardomortiz/vcf2phylip下载vcf2phylip.py脚本

##基于vcf2phylip.py脚本将VCF文件转换成phylip格式
python vcf2phylip.py -i all.LDfilter.vcf

##使用raxml-ng建ML树
#选择最佳模型
下载软件 wget https://github.com/ddarriba/modeltest/files/3790700/modeltest-ng-0.1.6-static-linux64.tar.gz

nohup modeltest-ng-static -i all.LDfilter.tmp6.phy -d nt -p 6 2>&1 &

#建树
nohup raxml-ng -all -msa all.LDfilter.min4.phy --model GTR+I+G4 --bs-trees 100 --prefix out_treeBS --threads 2 --seed 123 1>raxml_bs.log 2>raxml_bs.err 2>&1 &

##使用iqtree建ML树
conda install -c bioconda iqtree
#对于snp数据
nohup iqtree -s all.LDfilter.min4.phy -m MFP+ASC -bb 1000 -bnni -nt AUTO -cmax 15 2>&1 &

structure分析

软件安装
conda install -c bioconda admixture

#生成bed格式文件
plink --vcf all.LDfilter.vcf  --make-bed --out all  --allow-extra-chr --keep-allele-order --double-id --set-missing-var-ids @:#

#生成9个计算k值的脚本
seq 2 9 | awk '{print "admixture --cv -j2 all.bed "$1" 1>admix."$1".log 2>&1"}' > admixture.sh

#vi编辑器将所有生成的命令改成如下形式
nohup admixture --cv -j2 all.bed 2 1>admix.2.log 2>&1 &
nohup admixture --cv -j2 all.bed 3 1>admix.3.log 2>&1 &

#修改bim文件以供admixture识别。即将第一列替换成数字1
awk '{print $2, $3, $4, $5, $6}' OFS="\t" all.bim > change.bim
sed -i 's/^/1 /g' change.bim
手动将change.bim修改为all.bim
sh admixture.sh

#R包画图，需安装"argparser"和"pophelper"两个包
#for argparser
install.packages("argparser")

#for pophelper参考网站 http://www.royfrancis.com/pophelperShiny/
首先：install.packages(c("ggplot2","gridExtra","label.switching","tidyr","remotes",
                   "colourpicker","DT","highcharter","htmlwidgets","magrittr",
                   "markdown","RColorBrewer","shiny","shinyAce","shinyBS",
                   "shinythemes","shinyWidgets","viridisLite","writexl"),
                   repos = "http://cran.us.r-project.org")
第二：remotes::install_github('royfrancis/pophelper')

#基于树文件提取样本顺序,由于样本名称多样化，不一定前面就是“GA”，所以这一步可自己手动添加样本顺序
less file.nwk |sed 's/[(),]\+/\n/g' |grep "GA"|awk -F ":" '{print $1}' > sample_order.txt

##R包画图脚本，使用现成脚本，基于wnidows中的命令终端调用R进行画图
windows+R   cmd 打开windows中的命令终端
#将R添加进环境变量，依据R的bin文件夹所在的位置
path=D:\R-4.0.0\bin
#依据相应文件夹，文件的位置画图
Rscript C:\Users\25012\Desktop\script\draw_admixture.R C:\Users\25012\Desktop\tmp7\structure\result C:\Users\25012\Desktop\tmp7\structure\all.nosex structure

PCA分析

#--pca表示pca个数，pca个数要小于等于样本个数
plink --vcf all.LDfilter.vcf --pca 100 --out  PCA_out --allow-extra-chr --double-id --set-missing-var-ids @:#

plink --vcf all.filtered2.snp.recode.vcf --pca 100 --out  PCA_out --allow-extra-chr --double-id --set-missing-var-ids @:#


#计算主成分一，主成分二的方差解释度(PCA_out.eigenval特征值，里面是每个PCA的方差)
awk '{sum += $1}END{print sum}' PCA_out.eigenval
#123.343为上一步得到的值
awk '{print $1/123.343}' PCA_out.eigenval

#作图

path=D:\R-4.0.0\bin

Rscript C:\Users\25012\Desktop\script\draw_PCA.R C:\Users\25012\Desktop\tmp7\PCA_out.eigenvec 1 2 C:\Users\25012\Desktop\tmp7\sample.pop PCA_out_figure

群体选择分析

##计算PI值
#移除离群样本
vcftools --vcf all.LDfilter.vcf --recode --recode-INFO-all --stdout  --remove-indv  LTH_1  --remove-indv  LTH_4 --remove-indv  YH_2 --remove-indv  YH_3 --remove-indv  MG_1 --remove-indv  MG_7 --remove-indv  MG_10 > out.vcf
#指定计算某一类群
vcftools --vcf all.LDfilter.vcf --window-pi 1000 --window-pi-step 100 --keep /home/yuxiaolei/snp/tmp7/pi/remove_outlier/MG.txt --out ./MG.txt
#计算全部类群
vcftools --vcf all.LDfilter.vcf --window-pi 1000 --window-pi-step 100 --out ./Pi.pop

##计算种群间的ROD值
pi1=Pi.pop1.windowed.pi
pi2=Pi.pop2.windowed.pi
perl ../../script/merge2pi_ROD.pl $pi1 $pi2 >Pi.ROD

#去掉包含NA的行
grep -v "NA" Pi.ROD > Pi.ROD.table
#绘制ROD图
Rscript ../../script/manhattan_ROD.R   Pi.ROD.table  Pi.ROD.table.fig

##计算种群间的TajimaD值
#分别计算各亚群的
vcftools --vcf $vcf --TajimaD $window 1000 --keep ../../data/pop.SC.table --out TajimaD.pop1
vcftools --vcf $vcf --TajimaD $window 1000 --keep ../../data/pop.YZR.table --out TajimaD.pop2
or
#一次性计算全部亚群间的？
vcftools --vcf $vcf --TajimaD  $window  --keep  ../../data/pop.SC.table  --out TajimaD.pop1

#将多个亚群的TajimaD值合并在一个文件中
paste TajimaD.pop1.Tajima.D TajimaD.pop2.Tajima.D |awk '{print $1"\t"$2"\t"$4"\t"$8}' |sed 's/TajimaD\tTajimaD/pop1\tpop2/' |sed 's/^Chr0//' |grep -v "nan" > TajimaD.pop1.pop2.tab

#计算Fst
vcftools  --vcf all.LDfilter.vcf --fst-window-size 1000 --fst-window-step 100 --weir-fst-pop  ../../data/pop.SC.table --weir-fst-pop ../../data/pop.YZR.table --out  ./Fst.pop1.pop2

vcftools  --vcf all.LDfilter.vcf --fst-window-size 1000 --fst-window-step 100 --weir-fst-pop  /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YH.txt --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YBJ.txt --out ./Fst.YH.YBJ

vcftools  --vcf all.LDfilter.vcf --weir-fst-pop  /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YH.txt --weir-fst-pop /home/yuxiaolei/snp/tmp6/pi/remove_outlier/YBJ.txt --out ./Fst2.YH.YBJ

#gcta

计算gene的Ed与pi之间的相关性

#首先分别对两个文件进行排序
sort -k 1b,1 CLL.txt> CLL.sorted.txt
sort -k 1b,1 CLL_ED.txt> CLL_ED.sorted.txt
#根据两个表格中第一列的重复值将它们所在的行提取出来，得到每个unigene的Ed和pi值
join -o 1.1 1.2 2.2 CLL_ED.sorted.txt CLL.sorted.txt >CLLrelationship.txt
#去除换行符
cat CLLrelationship.txt | tr -d "\r" > CLLrelationship2.txt

join YHrelationship2.txt GCrelationship2.txt QNDXrelationship2.txt LTHrelationship2.txt SZXrelationship2.txt LBCrelationship2.txt GRCrelationship2.txt MYXrelationship2.txt DDCrelationship2.txt LSDrelationship2.txt WQCrelationship2.txt NYGKrelationship2.txt LDSrelationship2.txt YBJrelationship2.txt LWSrelationship2.txt SWCrelationship2.txt MZrelationship2.txt DXXrelationship2.txt CLLrelationship2.txt >allrelationship.txt

join YHrelationship2.txt GCrelationship2.txt >allrelationship1.txt

join allrelationship1.txt QNDXrelationship2.txt >allrelationship2.txt

join allrelationship2.txt LTHrelationship2.txt >allrelationship3.txt

join allrelationship3.txt SZXrelationship2.txt >allrelationship4.txt

join allrelationship4.txt LBCrelationship2.txt >allrelationship5.txt

join allrelationship5.txt GRCrelationship2.txt >allrelationship6.txt

join allrelationship6.txt SZXrelationship2.txt >allrelationship7.txt

join allrelationship7.txt MYXrelationship2.txt >allrelationship8.txt

join allrelationship8.txt DDCrelationship2.txt >allrelationship9.txt

join allrelationship9.txt LSDrelationship2.txt >allrelationship10.txt

join allrelationship10.txt WQCrelationship2.txt >allrelationship11.txt

join allrelationship11.txt NYGKrelationship2.txt >allrelationship12.txt

join allrelationship12.txt LDSrelationship2.txt >allrelationship13.txt

join allrelationship13.txt YBJrelationship2.txt >allrelationship14.txt

join allrelationship14.txt LWSrelationship2.txt >allrelationship15.txt

join allrelationship15.txt SWCrelationship2.txt >allrelationship16.txt

join allrelationship16.txt MZrelationship2.txt >allrelationship17.txt

join allrelationship17.txt DXXrelationship2.txt >allrelationship18.txt

join allrelationship18.txt CLLrelationship2.txt >allrelationship19.txt

cat allrelationship19.txt | tr -d "\r" > allrelationship20.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

种群遗传变异数据处理流程.md

种群遗传变异数据处理流程.md

Files

种群遗传变异数据处理流程.md

Latest commit

History

种群遗传变异数据处理流程.md

File metadata and controls