Skip to content
Katie Siewert edited this page Nov 14, 2019 · 4 revisions

BetaScan outputs a 2 or 3 column tab-delimited file, where the first column contains the coordinate of the core variant, and the second column contains its Beta score. If the -std flag is used, the third column is the value of the standardized statistic. Only core SNPs above your minimum folded frequency threshold, specified by -m, will appear in this file.

Caution: if variant calls within the specified window size of the core variant are not confident, the value of Beta may not be correct, because of artificially elevated or reduced number of mutations. For this reason, we encourage you to use quality filters.

Interpreting the output

After you get your Beta scores, there are a few things tasks you may want to do.

  1. Intersect with some measure of quality. The Beta statistics (and any statistic to capture selection) is sensitive to poor variant calling. If there is some sort of quality filter mask you can use, like there is with 1000 genomes, you can intersect it with your calls using the bedtools intersect function. To convert the BetaScan output to bed file format, you can use the below command. Replace Z with whatever chromosome Beta was calculated on.
awk 'NR>1 {print "chrZ",$1,$1+1,$2}' Betaoutput.txt > Betaoutput.bed
  1. Look at the top scores To look at the top scores you can use a simple sort command in your terminal. Note that high, positive values of Beta are evidence of balancing selection. Negative values of Beta don't have a significant meaning and should not be interpreted as evidence of selection. By chance, you would expect a fairly even distribution of both positive and negative scores. If you only see positive or only see negative scores, this could be a sign that something is going wrong.
sort -g -k 2 Betaoutput.txt | head
  1. Visualize your results First off, we don't recommend using a QQ-plot for most users. The distribution of Beta under the null hypothesis of no balancing selection is unknown (and is definitely not normal or uniform even under a basic demographic model) so a QQ-plot can be misleading. A Manhattan-type plot can be a good way to visualize where the top scores are in the genome. Note that in this case, the y-axis would be Beta score, not p-value. You could also decide to calculate an empirical p-value based on simulation results, if desired, and use that for the y-axis.
Clone this wiki locally