A program for the efficient computation of a number of population genetics summary statistics. msums can read ms-format data on (nearly) arbitrary numbers of populations.
The code in this project was originally created for internal use with only secondary consideration for easy consumption. As such it is poorly documented and at times - due to optimization requirements - awkward and complicated.
Furthermore, while we made every effort to ensure the code is correct, please use at your own risk. It is probably prudent to have a look at the (not terribly legible, sorry) code first to make sure it does what you think it does.
You need to have Git installed, then:
git clone https://github.com/mhinsch/msums
The installation requires the boost and boost-dev libraries. For Manjaro/Archlinux, type:
sudo pacman -S boost boost-libs
touch Makefile.dep
make
Do not pay attention to warnings like:
`stats_multi.h:365:8: warning: unused parameter ‘stop2’ [-Wunused-parameter]
double patterson_f4
^
stats_multi.h:365:8: warning: unused parameter ‘stop3’ [-Wunused-parameter]
stats_multi.h:365:8: warning: unused parameter ‘stop4’ [-Wunused-parameter]
...
stats_multi.h:343:8: warning: unused parameter ‘stop2’ [-Wunused-parameter]
double patterson_f3
^
stats_multi.h:343:8: warning: unused parameter ‘stop3’ [-Wunused-parameter]
...
stats_multi.h:325:8: warning: unused parameter ‘stop2’ [-Wunused-parameter]
double patterson_D(
^
In the following text:
- The suffixes 'mean' and 'std' stand for average and standard deviation, respectively.
- Populations are referred by indexes i and j and can range from 0 to n, where n is the number of sampled populations.
- As an example 'FST_ixj_mean' refers to the mean Fst across loci between populations i and j.
- The coding of ancestral/derived alleles follow the ms convention -> 0: ancestral, 1: derived.
- sampled alleles at a given SNP are represented by strings of 0s and 1s where populations are separated by a slash. E.g., the string 0100/110101 represents the alleles of a SNP for which we have sampled 4 and 6 chromosomes from population i and j, respectively.
- pairdif: sum of pairwise allele differences
- segr: number of segregating sites(i.e. SNPs) per locus
- singlet: overall number of singleton alleles (across all sites)
- thpi: Tajima's Theta, i.e. nucleotide diversity
- thW: Watterson Theta.
- flDstar: Fu & Li’s D*
- flFstar: Fu & Li’s F*
- tD: Tajima's D [(pi-theta)/variance]
- R2: Ramos-Onsins R2 test (Ramos-Onsins & Rozas, Mol.Biol.Evol. 2002)
- dixj_: Raw nucleotidic divergence between species i and j (Nei's Dxy, eq.12.66, Nei and Kumar 2000).
- dnixj_: Net nucleotidic divergence between species i and j (Nei's DA, eq.12.67, Nei and Kumar 2000).
- FSTixj_: (1-(pi_i+pi_j)/2)/pi_total; also called FST sensu Hudson or Nst.
- bialsitesixj_: mean number of bi-allelic sites per population (infinite site model). I.e. (number of bi-allelic sites in popi + number of bi-allelic sites in popj)/2
- multisitesixj_: mean number of multi-allelic (more than 2 alleles) sites per population (infinite site model). I.e. (number of bi-allelic sites in popi + number of bi-allelic sites in popj)/2 [to be checked]
- sfAixj_: number of sites fixed for the derived allele in species A and fixed for the ancestral allele in species B (1111/0000), where A=i and B=j.
- sfBixj_: number of sites fixed for the ancestral allele in species A and fixed for the derived allele in species B (0000/1111), where A=i and B=j.
- sfoutixj_: ? [to be checked]
- sxAixj_: number of sites that are polymorphic in species A and fixed for the ancestral allele in species B (0101/0000), where A=i and B=j .
- sxBixj_: number of sites that are fixed for the ancestral allele in species A and polymorphic in species B (0000/0101), where A=i and B=j .
- sxAfBixj_: number of sites polymorphic in species A and fixed for the derived allele in species B (0101/1111), where A=i and B=j.
- sxBfAixj_: number of sites fixed for the derived allele in species A and polymorphic in species B (1111/0010), where A=i and B=j.
- ssixj_ : number of sites with shared derived alleles between popi and popj (1010/1110). Mean over populations. [to be checked - this stat should be symetrical between pops, so that the mean equal the value in each pop actually unless we divide by the total number of pops (that might be different from 2 but that would be surprising.]
- Rfixj_: see Navascues et al. BMC Evol. Biol 2014. [to be detailed]
- Rsixj_: see Navascues et al. BMC Evol. Biol 2014. [to be detailed]
- Wx2s1ixj_ : see Navascues et al. BMC Evol. Biol 2014. [to be detailed]
- Wx1s2ixj_ : see Navascues et al. BMC Evol. Biol 2014. [to be detailed]
- pattDixj_: Patterson's D statistic used in the "ABBA-BABA" test (Patterson et al. Genetics 2012). [How it is implemented, does it not need 4 pops? Is it F2 maybe? See Patterson et al. Genetics 2012]
Those are Patterson's test described in Patterson et al. Genetics 2012.
- f3: is this test really implemented? [Martin can you confirm?]
- f4: is this test really implemented? [Martin can you confirm?]
- pattDixj_: Patterson's D, see insert paper here. [How it is implemented, does it not need 4 pops? Is it F2 maybe? Patterson et al. Genetics 2012] Is this test really implemented? [Martin can you confirm?]
Early versions of this program originated as a rewrite of mscalc (back then known under the somewhat unfortunate name AnalMS). Other versions of mscalc here and here.