Skip to content

Basic Usage

Katie Siewert edited this page Oct 24, 2019 · 1 revision

Getting Started

BetaScan is a command line program implemented in python.

Prerequisites

  • Python 2 -Language code is written in
  • Numpy -Python package used for arrays

Parameters

  • -i: Path of input file
  • -w: Total window size (default: 1000, corresponding to 500 bp on either side of the center (i.e. core) SNP)
  • -p: Value of p (default: 2)
  • -m: Minimum folded frequency of core SNP, exclusive, can range from 0 to .5 (default: 0)
  • -fold: Use folded version (default: false)
  • -o: Output file name (default: print to screen)
  • -B2: Calculate the Beta2 statistic.
  • -DivTime: An estimate of the divergence time between your species (only used with the -B2 flag).
  • -std: Instead of just return the Beta value, also return the standardized Beta value. This is equal to the Beta value divided by the square root of the variance. Note that this calculation is slower than calculating just the Beta values, however we do have plans to speed it up.
  • -theta: An estimate of the genome-wide mutation rate (only used with the -std option).
  • -thetaMap: Path to a file containing window-based mutation rate.
  • -onewin: Instead of using a sliding window, calculate Beta using all SNPs in input file

Explanation of parameters

  • w: The sliding window size to use. The expected size of the window in which most of the signal of balancing selection will be found is exponentially distributed with rate rho*T, where rho is the individual recombination rate, and T is an estimate of the age of the selection in generations. We found that somewhere around the 95th percentile on either side of the core site works well. To calculate the total window size in R, you can use the command 2qexp(.95, rhoT). Of course, in practice you don't know T, but a ballpark estimate should do.
  • -m: In theory, Beta has good performance down to very low and high frequencies. However, the chance of seeing an allele at very low or high equilibrium frequency that is under long-term balancing selection is very low. This is because genetic drift is expected to drift the allele out of the population (see Ewens & Thomson 1970 or Carr & Nassar 1970). We see this phenomenon for variants at folded frequency ~15% or less when we simulate overdominance with human parameters. In addition, poor variant calling can cause false variant calls at low allele frequencies.

    For this reason, we give the user the option to exclude all variants from consideration at or below a certain folded frequency. These variants will still be used when calculating the Beta score for other SNPs, but their score will not be calculated by the program. On top of decreasing false positives, this has the added benefit of speeding up run-time. Although by default the minimum frequency is zero, we recommend that the user be cautious with interpreting any signal of balancing selection, whether detected using Beta or another method, if the core SNP is at a very extreme frequency.

  • -fold: The default version of Beta takes into account the frequency of each variant. However, if ancestral state cannot be confidently called, perhaps due to there being no suitable outgroup, the folded version of Beta should be used. The formulation for this statistic can be found in the supplement of our 2017 paper.

  • -B2: This calculates Beta2, which is the new version detailed in our bioRxiv preprint. It has higher power than the Beta1 statistics, but requires substitution data and an estimate of divergence time.

  • -DivTime: An estimate of the divergence time between your two species. This can either be obtained from prior demographic inference of your species, or using the BALLET software. In practice, this value affects the relative ranking of loci very little, but will affect the standardized statistic. To convert from generations (g) to coalescent units (c), the formula is g=cNe2 where Ne is the effective population size.

  • -theta: The estimated mutation rate. It's equal to 2lN_e*u, where u is the locus neutral mutation rate, Ne is the effective population size and l is the ploidy. This can be based on prior estimates from your species, or you can estimate it from your own data. We have future plans to automate this calculation, but in the meantime you can calculate Watterson's theta on your data.

  • -thetaMap: Instead of using a constant mutation-rate across the genome as with the -theta flag, this flag instead allows the user to specify different mutation rates throughout the genome. For more information see the Mutation Rate Map section of the wiki.

  • -onewin: Instead of using a sliding window approach, Beta will be calculated using on SNPs and substitutions in the input file. This can be useful if you have a locus you want to calculate Beta on the entirety of (instead of just in units of -w basepairs). The input file format is the same as the standard format, but the position column can contain any arbitrary numbers, instead of needing to be basepairs. For instance, you could have the position column be in cM or some arbitrary SNP identifier number. This flag can be used with any of the three beta statistics but will only return standardized statistics, as Beta calculated on arbitrary window sizes are not comparable with each other and their significance cannot be assessed. Currently, the background mutation rate in the region is automatically calculated using Watterson's estimator on all SNPs in the region.

Sample Commands

To run BetaScan on our file SNPFreqs.txt with default parameters:

python BetaScan.py -i SNPFreqs.txt

To run with a 2000 base pair window, a p parameter value of 50 and using the folded version of Beta and with a specified output file path:

python BetaScan.py -i SNPFreqs.txt -w 2000 -p 50 -fold -o /path/chr1.beta.out

To run with a 5000 base pair window, a p parameter value of 20 and excluding all core SNPs that are of frequency 10% or less, or of frequency 90% or greater:

python BetaScan.py -i SNPFreqs.txt -w 5000 -p 20 -m .1

To output a gzipped file:

python BetaScan.py -i SNPFreqs.txt | gzip > chr1.beta.out.gz

To input a gzipped file:

python BetaScan.py -i < (zcat chr1.beta.gz )

To calculate standardized Beta2 with an estimated divergence time of 12.5 and an estimated mutation rate of 0.001:

python BetaScan.py -B2 -DivTime 12.5 -std -Theta 0.001 -i SNPFreqsWithSubs.txt
Clone this wiki locally