Skip to content
Katie Siewert edited this page Apr 14, 2021 · 5 revisions

How many samples are needed to run Beta?

We have found that in human a relatively low number of samples, around 5, is sufficient to detect a fairly large proportion of sites. Maximum power is obtained with a sample size of around 20 (so 10 diploid individuals).

How do I choose a value of p?

Results are fairly robust to choice. However we have found that a value of 2 performs well under a wide array of parameters, so we recommend using 2 (the default as of 12/11/18) unless you have a reason not to.

Should I use Beta1, Beta1* or Beta2?

If you only have a folded site frequency spectrum (i.e. you don't know what the ancestral versus derived alleles are) you need to used Beta1*.

If you have an unfolded site frequency spectrum, but don't have substitution information with an outgroup species, use Beta1.

If you have substitution information and a unfolded site frequency spectrum, use Beta2.

I have frequency information I calculated using the --freq command in vcftools (and the --derived flag if I'm using the unfolded betascan). How do I convert the vcf output format to the BetaScan output format?

The toolkit glactools is able to convert between vcfs and BetaScan format and is probably the most robust way to do this.

Alternatively, you can use the following command in unix:

awk -F "\t|:" '(NR>1) && ($8!='0') && ($8!='1') && ($3=='2') {OFS="\t"; print$2,$8*$4,$4}' yourfile.frq

This command reformats the .frq file and filters out positions that have more than 2 possible alleles, or are at frequency 0 or 100%. Make sure that you use the fold command if you haven't called ancestral/derived alleles. If you have called them, then this awk script assumes that the derived allele is the second allele listed in the .frq file outputted by vcftools, as it is when you use the --derived flag. Also, please double check that this command outputs the right thing from your .frq file! There could always be variations in the .frq format I don't know about.

I ran some simulations using the simulation software SLiM, and want to convert them into BetaScan format. Is there an easy way to do this?

Once again, awk can come to the aid:

awk '{OFS="\t"}{if ($1=="Genomes:") exit }(($2=="m3") || ($2=="m1")) && ($8!="100") {print $3,$8,"100"}' SLiMFile.out

The first thing to note is that SLiM has more than one output file format, and this awk command only works with the SLiM format, not the ms format. Note, in this example, there's two mutation types simulated: m1 and m3, and both are outputted. You should obviously modify this so it works with your simulation details. This command also assumes a sample size of 100. If this is not your sample size, you should replace "100" with your sample size in quotes.

What are some common mistakes when using BetaScan?

One common one is applying BetaScan (or other tests for balancing selection) to heavily structured populations. If the TMRCA of these subpopulations is long enough, BetaScan may not be able to distinguish SNPs that have a long TMRCA due to balancing selection from ones that have a long TMRCA due to population substructure. if you apply BetaScan to such a sample, it's important to look at each SNP and make sure it is shared between the different subpopulations.

It's also common to get poor quality results because of a poor genome assembly or poor variant calling. It's important to only apply tests for selection on regions where you are confident in the variant calls.

In addition, BetaScan has only been tested on diploid organisms. Hypothetically, it may be fine to apply to species of higher (but not lower) ploidy, but do so cautiously.

Finally, we always recommend doing simulations using appropriate demographic parameters for your species to verify that BetaScan (or any other scan for selection) is appropriate for use. We know this is a pain, but it's the responsible thing to do. Recombination rates, mutation rates, population substructure, ploidy etc. all can determine whether a test for selection is suitable. For example, in species where the recombination rate is much higher than the mutation rate, haplotypes may break down faster than they can accumulate signals of balancing selection.