These scripts will allow you to identify rare SNPs that discriminate individual strains and to track these SNPs between hosts to elucidate transmission patterns.
Before running these scripts, you'll need to have run:
merge_midas.py snps
read more.
- Scan across the entire genome of a patricular species
- At each genomic site, compute the presence-absence of the four nucleotides across metagenomic samples from unrelated individuals
- Identify SNPs (particular nucleotide at a genomic site) that rarely occur in different unrelated samples
- Because these SNPs are rarely found in different individuals, they serve as good markers of host-specific strains
strain_tracking.py id_markers --indir <PATH> --out <PATH> [options]
--samples STR
Comma-separated list of samples to use for training
Useful for specifying the subset of samples from unrelated subjects in SNP matrix
By default, all samples are used
--min_freq FLOAT
Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)
--min_reads INT
Minimum number of reads supporting allele per site for SNP calling (3)
--allele_freq INT
Maximum occurences of allele across samples (1)
Setting this to 1 (default) will pick alleles found in exactly 1 sample
--max_sites INT
Maximum number of genomic sites to process (use all)
Useful for quick tests
-
Use a subset of sample in SNP matrix for training
strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --samples sample1,sample2,sample3
-
Run a quick test
strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --max_sites 10000
-
Use strict criteria for pick marker alleles:
strain_tracking.py id_markers --indir indir --out outfile --min_freq 0.90 --min_reads 5 --allele_freq 1
- Compute the presence of marker SNPs (identified in Step 1) across all metagenomic samples, including from related individuals
- Quantify the number and fraction of marker SNPs that are shared between all pairs of metagenomic samples
- Based on a SNP sharing cutoff (e.g. 5%), determine if a strain is shared or not
- Because these SNPs are rarely found in unrelated individuals (Step 1), their presence in multiple samples is strong evidence of strain sharing/transmission
strain_tracking.py track_markers --indir /path/to/snps/species_id --out species_id.marker_sharing --markers species_id.markers [options]
--min_freq FLOAT
Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)
--min_reads INT
Minimum number of reads supporting allele per site for SNP calling (3)
--max_sites INT
Maximum number of genomic sites to process (use all)
Useful for quick tests
--max_samples INT
Maximum number of samples to process (use all)
Useful for quick tests