Skip to content

Latest commit

 

History

History
executable file
·
173 lines (114 loc) · 8.84 KB

Execution_examples.md

File metadata and controls

executable file
·
173 lines (114 loc) · 8.84 KB

How to run AmpliSolve ??

First make sure that you have the correct environment, resolved all program’s dependencies and generated the binaries:

AmpliSolveErrorEstimation 

and

AmpliSolveVariantCalling

Prepare the input files

AmpliSolve requires a pre-processing step that utilizes ASEQ. To make the execution faster, this is a separate step that can be executed in parallel using a bash script for example in a cluster. ASEQ software is available at https://demichelislab.unitn.it/doku.php?id=public:aseq , but we also provide a simplified version of ASEQ named computeCounts (only the binary, please contact us for more info) in our Pre-compiled_binaries DIR. This program deploys the PILEUP mode of ASEQ which is the one we need for this pre-processing step.

We recommend to place all available normal samples (germline) in a separete DIR say NORMAL_BAM_DIR , and all tumour samples in another DIR named TUMOUR_BAM_DIR

To run ASEQ (or equally computeCounts) type:

./ASEQ [vcf=dummyVCF.txt] [bam=myBAM.bam] [threads=int] [mbq=int] [mrq=int] [mdc=int] [out=Out_DIR]

where:

  1. dummyVCF.txt is a VCF-like file that contains all positions in the gene panel. Remember that this is not the actual panel design, as the panel design contains amplicons written as intervals, whereas here we enumate all positions in the panel.

Assuming that we have a small panel with one amplicon chr8:23437012-23437020 the dummyVCF.txt will look like:

chr8	23437012	.	.	.	.	.	.
chr8	23437013	.	.	.	.	.	.
chr8	23437014	.	.	.	.	.	.
chr8	23437015	.	.	.	.	.	.
chr8	23437016	.	.	.	.	.	.
chr8	23437017	.	.	.	.	.	.
chr8	23437018	.	.	.	.	.	.
chr8	23437019	.	.	.	.	.	.
chr8	23437020	.	.	.	.	.	.
  1. myBam.bam is the bam files of interest. Remember that we need to run this command for ALL files in NORMAL_BAM_DIR and TUMOUR_BAM_DIR

  2. threads is the number of threads to use. Usually I put 8, but this depends on your system

  3. mbq, mrq , mdc are quality parameters ; we recommend the triplet 20-20-20 for Ion AmpliSeq data

  4. Out_DIR is the directory to store the read count files ending with .PILEUP.ASEQ

Since we need a DIR of normal read counts to compute the error levels, we must store all .PILEUP.ASEQ of normal samples in one DIR say NORMAL_ASEQ_DIR

In a similar fashion, we need a DIR for all tumour read counts to perform variant calling, so we must store all .PILEUP.ASEQ of tumour samples in another DIR say TUMOUR_ASEQ_DIR

Once we have all .PILEUP.ASEQ files for normal and tumour files stored in two separete DIRs say NORMAL_ASEQ_DIR and TUMOUR_ASEQ_DIR we are ready and we can run the AmpliSolve program.

Run AmpliSolveErrorEstimation

Once the normal input data are ready, we need first to estimate the error levels per position, nucleotide and strand. This is performed using program AmpliSolveErrorEstimation. Remember that the error estimation process is performed once per panel design and the same information can be re-used many times for variant calling.

Type:

./AmpliSolveErrorEstimation panel_design=/your/panel/design/file reference_genome=/your/reference/genome germline_dir=/your/dir/with/germline/read/count/files C_value=/the/error/rescaling/factor coverage_cutoff=/minimum/germline/coverage/required/for/error/estimation default_error=/platform/specific/error/level/if/normal/not/available output_dir=/dir/to/store/the/intermediate/results

where

  1. panel_degign is a bed (chrom TAB start TAB end) file that contains all amplicons coordinates

  2. reference_genome is the FASTA format file used as reference

  3. germline_dir is the DIR with the read counts generated by ASEQ (see how above). Following our previous example we give:

germline_dir=NORMAL_ASEQ_DIR
  1. C_value is the normalization factor described in the manuscript. For ctDNA screening we recommend the value 0.002,but this depends strongly on the detection limit we want to achieve.

  2. coverage_cutoff is the minimum level of coverage (FW and BW) per position required for error estimation. If individual positions have less coverage are excluded from error estimation. Default value is 100 for both strands.

  3. default_error is the sequencing platform default error. This parameter is used ONLY when germline_dir==not_available in order to assign default error levels for all positions. If there are germline files available, this parameter is simply skipped and the program generates position-specific error levels based on normals.

  4. output_dir is a DIR to store some intermediate results

One complete execution example looks as follows:

./AmpliSolveErrorEstimation panel_design=AmpliSeq_30genes_Designed-1.bed reference_genome=/Users/Reference_genome/hg19_chr.fa coverage_cutoff=150 default_error=0.012 germline_dir=NORMAL_ASEQ_DIR C_value=0.002 output_dir=FirstExecution

Execution of AmpliSolveErrorEstimation will produce a flat file that contains the error levels per position in the panel design

One example is the following:

chrom	position	reference	duplicate	Thres_A	Thres_C	Thres_G	Thres_T	Germ_Max_A	Germ_Max_C	Germ_Max_G	Germ_Max_T
chr8	23437012	T	NO	0.002000_0.002000	0.002000_0.002000	0.002000_0.002000	-1_-1	0	0	0	-
chr8	23437013	A	NO	-1_-1	0.002000_0.002000	0.002167_0.002000	0.002000_0.002000	-	0	0.00170648	0
chr8	23437014	T	NO	0.002000_0.002000	0.003513_0.002000	0.002000_0.002000	-1_-1	0	0.00172117	0	-
chr8	23437015	C	NO	0.002339_0.002000	-1_-1	0.002169_0.002000	0.002000_0.002000	0.00178253	-	0.000799361	0
chr8	23437016	A	NO	-1_-1	0.002000_0.002000	0.002543_0.003835	0.002000_0.002000	-	0	0.00338409	0
chr8	23437017	A	NO	-1_-1	0.002000_0.002000	0.003193_0.002319	0.002000_0.002000	-	0	0.00169779	0
chr8	23437018	G	NO	0.002861_0.002339	0.002000_0.002000	-1_-1	0.002000_0.002339	0.00272851	0	-	0.000273373
chr8	23437019	A	NO	-1_-1	0.002000_0.002000	0.002183_0.002595	0.002000_0.003189	-	0	0.00151057	0.00173461
chr8	23437020	A	NO	-1_-1	0.002168_0.002000	0.003176_0.002300	0.002000_0.002000	-	0.00131752	0.00334448	0

Once this file is ready, we need to give it as input to the AmpliSolveVariantCalling program that deploys a statistical framework for identifying SNVs.

Run AmpliSolveVariantCalling

Once the tumour input data are ready, and the error levels per position, nucleotide and strand are ready (see above), we can identify SNVs using the binary AmpliSolveVariantCalling

Type:

./AmpliSolveVariantCalling errorFile=/the/file/produced/by/AmpliSolveErrorEstimation tumour_dir=/your/dir/with/tumour/read/count/files output_dir=/dir/to/store/all/output coverage_cutoff=/coverage/per/strand/required/to/do/predictions p_value=/Fisher's/exact/test/p/value/for/strand/bias

where

  1. errorFile is the error estimation file produced just before using the AmpliSolveErrorEstimation

  2. tumour_dir is the DIR with the read counts generated by ASEQ (see how above). Following our previous example we give:

tumour_dir=TUMOUR_ASEQ_DIR
  1. output_dir is a DIR to store some intermediate results

  2. coverage_cutoff is the minimum level of coverage for FW and BW strand required to make a prediction. Positions with less coverage are not considered "callable". Default value is 100 for both strands.

  3. p_value is the p value of Fisher's exact test used to identify positions with strand bias. Default value is 0.05

One complete execution example looks as follows:

./AmpliSolveVariantCalling errorFile=positionSpecificNoise_0.0020.txt tumour_dir=TUMOUR_ASEQ_DIR output_dir=VariantCalling_Testing coverage_cutoff=100 p_value=0.05

Execution of AmpliSolveVariantCalling with this command will produce a flat file named VariantCalling_Testing/Summary_Variant_Info.txt that contains all the variants found for all tumour samples stored in the tumour_dir=TUMOUR_ASEQ_DIR. In the TODO list (sortly), we will enable to write the results per sample in the typical VCF format.

Can I find some toy data to test AmpliSolve ??

Yes, please check the Toy_data directory that contains:

AmpliSeq_30genes_Designed-1.bed

The gene panel design we used during our experimentation (see manuscript)

dummyVCF.txt 

The corresponding dummVCF file for the previous panel. This is the input VCF we need to run ASEQ for the pre-processing step.

positionSpecificNoise_0.0020.txt

The complete error estimation for the previous panel design. The calculations are based on a training set of 120 normal samples with C=0.002

NORMAL_ASEQ_DIR and TUMOUR_ASEQ_DIR

Contain 5 normal and 3 tumour samples in the .ASEQ format. You can use this toy examples to run AmpliSolveErrorEstimation and then call SNVs using AmpliSolveVariantCalling

Can I have pre-compiled binaries to run ?

Yes, we provide compiled binaries for the Mac we used during the developemnt for programs AmpliSolveErrorEstimation, AmpliSolveVariantCalling and computeCounts

The computeCounts binary, deploys the ASEQ PILEUP mode and it can be used during the input data generation process if the original ASEQ is not available.