Computational methods used in Wu et al.

`EM`: Estimating parameters using expectation maximization (EM)

Use expectation maximization (EM) to estimate parameters for mixture of dirichlet-multinomials (MDM) models from counts-by-ref file generated by the python script.

Requirement

Scripts and usages

`search_em_mdm.R`: EM for trio data

3 parameters required.

Rscript search_em_mdm.R numTrial numComponent countsByRefFile

numTrail: The number of attemps to estimate the parameters
numComponent: The number of components in the MDM model. A suffix 'p' or "P" can be appended to the number, which will use an alternative filtering process (See Note below).
countsByRefFile: Counts-by-ref file generated from the python script get_base_counts_trio.py
The default read count filter is 10 and 150. This can be change by
```
upperLimit <- 150
lowerLimit <- 10
```

`search_em_mdm_single.R`: EM for single data

3 parameters required.

Rscript em_mdm_single.R numTrial numComponent countsByRefFileSingle

numTrail: The number of attemp to estimate the parameters. Same as above
numComponent: The number of components in the MDM model.
countsByRefFileSingle: Counts-by-ref file generated from the python script get_base_counts_single.py

`mdm.R`: Core functions

Usually don't use this directly.

Example

Run EM on 2 components MDM model with alternative filter 5 times on counts_example_byref.txt

Rscript search_em_mdm.R 5 2P counts_example_byref.txt

The expected output should be something like the following and you should have different ll values. The max ll we are able to obtain for the example dataset is -2593.181911539119

Searching for maximum likelihood of 2-component model....
Using potential heterozygous data with 10x to 150x coverage....
  Trial 1: ll = -2593.377037122174 (max ll = -2593.377037122174)
  Trial 2: ll = -2593.424801477642 (max ll = -2593.377037122174)
  Trial 3: ll = -2593.18827510774 (max ll = -2593.18827510774)
  Trial 4: ll = -2593.199191810304 (max ll = -2593.18827510774)
  Trial 5: ll = -2593.560652839873 (max ll = -2593.18827510774)
Saving results to em_mdm_example_2P_results_10275.RData ...

Note

There are two different filtering processes. The default filter used both read counts (default values: between 10 and 150) and previous SNPs calls result. In our case, we are using the results from 1000 Genomes Project. This is used to generate the True Heterozygote (TH) dataset in our paper.

An alternative filter only filtering on read counts. This is used to generate the Potential Heterozygote (PH) dataset in our paper. When the users specify the number of components, if a suffix "p" or "P" is appended (e.g. 3p or 4P), then the alternative filter is applied.

`SNP`: Calling heterozygous sites

Call heterozygous sites using Samtools and BCFtools. Filter the results and generate base counts-by-ref files for expectation maximization analyses.

Requirement

Samtools version 1.2 or later
BCFtools version 1.2 or later
Python 2.7
Recommendation: GNU Parallel

Scripts and usages

`pipelineCore.sh`: Main scriipt to run the analyses

Calls other scripts in the folder

The following parameters are required.

datasetName="CEU13_example" #Can be any name
chromosome=21 #Which chromosome would you like to perforem the analyses on
chromosomePosition=":9411180-9421180"  #Perfrom analyses on a specific region e.g. chromosomePosition=":10000-20000"
#chromosomePosition="" #Blank: perform analyses on the whole chromosame.

workingDir="./data/${datasetName}_C${chromosome}/"  #Location of the output files
scriptDir="./"  #Location of the scripts

variableSiteFile="./data/snp_chr21_v3_partial.vcf.gz" #Location of known snp file
refGenome="./data/human_g1k_v37_c21_partial.fasta" #Location of the reference file
dataDir="./data/"   #Location of the datafile
dataFiles=('CEU13_NA12878.bam' 'CEU13_NA12892.bam' 'CEU13_NA12891.bam') #File names of each individual

parallelCount=3 #Number of parallel jobs
numSplitFiles=5 #Number of split files
isTrio=1 #1:trio 0:single caller
isOriginalCaller=1 #[1|0] # BCFtools call:classic mode -c, --consensus-caller  or new mode -m, --multiallelic-caller
exome=0 #1: turn on exome analyses
exome_file=0 #Location of the exome files

`get_base_counts_trio.py`: Python script to parse the output for both trio and single callers

6 parameters required.

python2 get_base_counts_trio.py chromosome datasetName variableSiteFile workingDir pileupFile.gz pileupExomeFile

# chromosome: chromasome number (21)
# datasetName: Any name (CEU13_example)
# variableSiteFile: Location of known snp file (/data/snp_chr21_v3_partial.vcf.gz)
# workingDir: Location of the output files (/data/CEU13_example_C21/original/)
# pileupFile: pileupFile created by Samtools, MUST be is gzip format. (chr21_CEU13_example.pileups.gz)
# pileupExomeFile: pileupFile for exame. 0 if no exame. (0)

`get_base_counts_single.py`: Python script to parse the output for single caller

Same usage as above

`checkSnps.sh`: Calling SNPs using trio and single caller

Called by pipelineCore.sh, usually don't call this directly.
Trio: 9 parameters required.

checkSnps.sh chromosome isOriginalCaller outfileDir refGenome infilePrefix infileChild infileParent1 infileParent2 "startPos ndPos"

Single: 6 parameters required.

checkSnps.sh chromosome isOriginalCaller outfileDir refGenome infile "startPos endPos"

`setupParameters.sh`: Setup parameters for the pipeline

This should not be call directly.

Authors' script

These two scripts are used by authors to perform the analyese. They record all the dataset, sequence files, refenece genome ... etc used in the analyese.
The directories and location of these files need to be reconfigured if users
runCEU1X_ChromY.sh: Script for the CEU dataset.
- Usage: runCEU1X_ChromY.sh ceuDataset chromosomeNumber
- Example: runCEU1X_ChromY.sh 13 21 #Analyese on CEU13 chromosome 21
runCHM1_ChromY.sh: Script for the CHM1 dataset.
- Usage: runCHM1_ChromY.sh chromosomeNumber
- Example: runCHM1_ChromY.sh 21 #Analyese on CHM1 chromosome 21

Example

#setup example files
cd WuEtAl2016/SNP/data
tar -xzf example.tar.gz
./setupCEU13.sh
#Run example dataset
cd ..
./pipelineCore.sh

The expected output should be someting like the following

===== Setup other parameters =====
===== Samtools found. samtools --version-only: ...
...
...
===== Setup =====
DatasetName: CEU13_example
WorkingDir: ./data/CEU13_example_C21//original/
scriptDir: ./
dataDir: ./data/
VariableSiteFile: ./data/snp_chr21_v3_partial.vcf.gz
ReferenceGenome: ./data/human_g1k_v37_c21_partial.fasta
...
...
===== Parse output files =====
Parse calls from trio
Parse calls from single
Get variable sites
Process homos_ref
Process homos_alt
Process hets

`R`: R scripts

Various R scripts for generating summary tables and figures for the manuscripts.
Use with cation! Some sections of the scripts might be hard coded or design to run interactively.
summarySetup.R: Setup folders and filenames.
summaryFunctions.R: Lots of functions for various analyses.
summaryModel.R: Analyese results from CEU.
summaryModelChm1.R: Analyese results from CHM1.
summaryQqplotExp.R: QQ plots for CEU.
summaryQqplotExpChm1.R: QQ plots for CHM1.
summaryModelPlots.R: Phi plot.
summaryCompareMajMinCnvLcr.R: Analyses for major vs minor compoents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational methods used in Wu et al.

Table of Contents

`EM`: Estimating parameters using expectation maximization (EM)

Requirement

Scripts and usages

`search_em_mdm.R`: EM for trio data

`search_em_mdm_single.R`: EM for single data

`mdm.R`: Core functions

Example

Note

`SNP`: Calling heterozygous sites

Requirement

Scripts and usages

`pipelineCore.sh`: Main scriipt to run the analyses

`get_base_counts_trio.py`: Python script to parse the output for both trio and single callers

`get_base_counts_single.py`: Python script to parse the output for single caller

`checkSnps.sh`: Calling SNPs using trio and single caller

`setupParameters.sh`: Setup parameters for the pipeline

Authors' script

Example

`R`: R scripts

About

Releases 2

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
EM		EM
R		R
SNP		SNP
README.md		README.md

CartwrightLab/WuEtAl2017

Folders and files

Latest commit

History

Repository files navigation

Computational methods used in Wu et al.

Table of Contents

EM: Estimating parameters using expectation maximization (EM)

Requirement

Scripts and usages

search_em_mdm.R: EM for trio data

search_em_mdm_single.R: EM for single data

mdm.R: Core functions

Example

Note

SNP: Calling heterozygous sites

Requirement

Scripts and usages

pipelineCore.sh: Main scriipt to run the analyses

get_base_counts_trio.py: Python script to parse the output for both trio and single callers

get_base_counts_single.py: Python script to parse the output for single caller

checkSnps.sh: Calling SNPs using trio and single caller

setupParameters.sh: Setup parameters for the pipeline

Authors' script

Example

R: R scripts

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

`EM`: Estimating parameters using expectation maximization (EM)

`search_em_mdm.R`: EM for trio data

`search_em_mdm_single.R`: EM for single data

`mdm.R`: Core functions

`SNP`: Calling heterozygous sites

`pipelineCore.sh`: Main scriipt to run the analyses

`get_base_counts_trio.py`: Python script to parse the output for both trio and single callers

`get_base_counts_single.py`: Python script to parse the output for single caller

`checkSnps.sh`: Calling SNPs using trio and single caller

`setupParameters.sh`: Setup parameters for the pipeline

`R`: R scripts

Packages