First make sure that you have the correct environment, resolved all program’s dependencies and generated the binaries:
AmpliSolveErrorEstimation
and
AmpliSolveVariantCalling
AmpliSolve requires a pre-processing step that utilizes ASEQ. To make the execution faster, this is a separate step that can be executed in parallel using a bash script for example in a cluster. ASEQ software is available at https://demichelislab.unitn.it/doku.php?id=public:aseq , but we also provide a simplified version of ASEQ named computeCounts (only the binary, please contact us for more info) in our Pre-compiled_binaries DIR. This program deploys the PILEUP mode of ASEQ which is the one we need for this pre-processing step.
We recommend to place all available normal samples (germline) in a separete DIR say NORMAL_BAM_DIR , and all tumour samples in another DIR named TUMOUR_BAM_DIR
To run ASEQ (or equally computeCounts) type:
./ASEQ [vcf=dummyVCF.txt] [bam=myBAM.bam] [threads=int] [mbq=int] [mrq=int] [mdc=int] [out=Out_DIR]
where:
- dummyVCF.txt is a VCF-like file that contains all positions in the gene panel. Remember that this is not the actual panel design, as the panel design contains amplicons written as intervals, whereas here we enumate all positions in the panel.
Assuming that we have a small panel with one amplicon chr8:23437012-23437020 the dummyVCF.txt will look like:
chr8 23437012 . . . . . .
chr8 23437013 . . . . . .
chr8 23437014 . . . . . .
chr8 23437015 . . . . . .
chr8 23437016 . . . . . .
chr8 23437017 . . . . . .
chr8 23437018 . . . . . .
chr8 23437019 . . . . . .
chr8 23437020 . . . . . .
-
myBam.bam is the bam files of interest. Remember that we need to run this command for ALL files in NORMAL_BAM_DIR and TUMOUR_BAM_DIR
-
threads is the number of threads to use. Usually I put 8, but this depends on your system
-
mbq, mrq , mdc are quality parameters ; we recommend the triplet 20-20-20 for Ion AmpliSeq data
-
Out_DIR is the directory to store the read count files ending with .PILEUP.ASEQ
Since we need a DIR of normal read counts to compute the error levels, we must store all .PILEUP.ASEQ of normal samples in one DIR say NORMAL_ASEQ_DIR
In a similar fashion, we need a DIR for all tumour read counts to perform variant calling, so we must store all .PILEUP.ASEQ of tumour samples in another DIR say TUMOUR_ASEQ_DIR
Once we have all .PILEUP.ASEQ files for normal and tumour files stored in two separete DIRs say NORMAL_ASEQ_DIR and TUMOUR_ASEQ_DIR we are ready and we can run the AmpliSolve program.
Once the normal input data are ready, we need first to estimate the error levels per position, nucleotide and strand. This is performed using program AmpliSolveErrorEstimation. Remember that the error estimation process is performed once per panel design and the same information can be re-used many times for variant calling.
Type:
./AmpliSolveErrorEstimation panel_design=/your/panel/design/file reference_genome=/your/reference/genome germline_dir=/your/dir/with/germline/read/count/files C_value=/the/error/rescaling/factor coverage_cutoff=/minimum/germline/coverage/required/for/error/estimation default_error=/platform/specific/error/level/if/normal/not/available output_dir=/dir/to/store/the/intermediate/results
where
-
panel_degign is a bed (chrom TAB start TAB end) file that contains all amplicons coordinates
-
reference_genome is the FASTA format file used as reference
-
germline_dir is the DIR with the read counts generated by ASEQ (see how above). Following our previous example we give:
germline_dir=NORMAL_ASEQ_DIR
-
C_value is the normalization factor described in the manuscript. For ctDNA screening we recommend the value 0.002,but this depends strongly on the detection limit we want to achieve.
-
coverage_cutoff is the minimum level of coverage (FW and BW) per position required for error estimation. If individual positions have less coverage are excluded from error estimation. Default value is 100 for both strands.
-
default_error is the sequencing platform default error. This parameter is used ONLY when germline_dir==not_available in order to assign default error levels for all positions. If there are germline files available, this parameter is simply skipped and the program generates position-specific error levels based on normals.
-
output_dir is a DIR to store some intermediate results
./AmpliSolveErrorEstimation panel_design=AmpliSeq_30genes_Designed-1.bed reference_genome=/Users/Reference_genome/hg19_chr.fa coverage_cutoff=150 default_error=0.012 germline_dir=NORMAL_ASEQ_DIR C_value=0.002 output_dir=FirstExecution
Execution of AmpliSolveErrorEstimation will produce a flat file that contains the error levels per position in the panel design
One example is the following:
chrom position reference duplicate Thres_A Thres_C Thres_G Thres_T Germ_Max_A Germ_Max_C Germ_Max_G Germ_Max_T
chr8 23437012 T NO 0.002000_0.002000 0.002000_0.002000 0.002000_0.002000 -1_-1 0 0 0 -
chr8 23437013 A NO -1_-1 0.002000_0.002000 0.002167_0.002000 0.002000_0.002000 - 0 0.00170648 0
chr8 23437014 T NO 0.002000_0.002000 0.003513_0.002000 0.002000_0.002000 -1_-1 0 0.00172117 0 -
chr8 23437015 C NO 0.002339_0.002000 -1_-1 0.002169_0.002000 0.002000_0.002000 0.00178253 - 0.000799361 0
chr8 23437016 A NO -1_-1 0.002000_0.002000 0.002543_0.003835 0.002000_0.002000 - 0 0.00338409 0
chr8 23437017 A NO -1_-1 0.002000_0.002000 0.003193_0.002319 0.002000_0.002000 - 0 0.00169779 0
chr8 23437018 G NO 0.002861_0.002339 0.002000_0.002000 -1_-1 0.002000_0.002339 0.00272851 0 - 0.000273373
chr8 23437019 A NO -1_-1 0.002000_0.002000 0.002183_0.002595 0.002000_0.003189 - 0 0.00151057 0.00173461
chr8 23437020 A NO -1_-1 0.002168_0.002000 0.003176_0.002300 0.002000_0.002000 - 0.00131752 0.00334448 0
Once this file is ready, we need to give it as input to the AmpliSolveVariantCalling program that deploys a statistical framework for identifying SNVs.
Once the tumour input data are ready, and the error levels per position, nucleotide and strand are ready (see above), we can identify SNVs using the binary AmpliSolveVariantCalling
Type:
./AmpliSolveVariantCalling errorFile=/the/file/produced/by/AmpliSolveErrorEstimation tumour_dir=/your/dir/with/tumour/read/count/files output_dir=/dir/to/store/all/output coverage_cutoff=/coverage/per/strand/required/to/do/predictions p_value=/Fisher's/exact/test/p/value/for/strand/bias
where
-
errorFile is the error estimation file produced just before using the AmpliSolveErrorEstimation
-
tumour_dir is the DIR with the read counts generated by ASEQ (see how above). Following our previous example we give:
tumour_dir=TUMOUR_ASEQ_DIR
-
output_dir is a DIR to store some intermediate results
-
coverage_cutoff is the minimum level of coverage for FW and BW strand required to make a prediction. Positions with less coverage are not considered "callable". Default value is 100 for both strands.
-
p_value is the p value of Fisher's exact test used to identify positions with strand bias. Default value is 0.05
./AmpliSolveVariantCalling errorFile=positionSpecificNoise_0.0020.txt tumour_dir=TUMOUR_ASEQ_DIR output_dir=VariantCalling_Testing coverage_cutoff=100 p_value=0.05
Execution of AmpliSolveVariantCalling with this command will produce a flat file named VariantCalling_Testing/Summary_Variant_Info.txt that contains all the variants found for all tumour samples stored in the tumour_dir=TUMOUR_ASEQ_DIR. In the TODO list (sortly), we will enable to write the results per sample in the typical VCF format.
Yes, please check the Toy_data directory that contains:
AmpliSeq_30genes_Designed-1.bed
The gene panel design we used during our experimentation (see manuscript)
dummyVCF.txt
The corresponding dummVCF file for the previous panel. This is the input VCF we need to run ASEQ for the pre-processing step.
positionSpecificNoise_0.0020.txt
The complete error estimation for the previous panel design. The calculations are based on a training set of 120 normal samples with C=0.002
NORMAL_ASEQ_DIR and TUMOUR_ASEQ_DIR
Contain 5 normal and 3 tumour samples in the .ASEQ format. You can use this toy examples to run AmpliSolveErrorEstimation and then call SNVs using AmpliSolveVariantCalling
Yes, we provide compiled binaries for the Mac we used during the developemnt for programs AmpliSolveErrorEstimation, AmpliSolveVariantCalling and computeCounts
The computeCounts binary, deploys the ASEQ PILEUP mode and it can be used during the input data generation process if the original ASEQ is not available.