-
Notifications
You must be signed in to change notification settings - Fork 65
Understanding the input
Several samples prepared in exactly the same way as you prepare your test samples, in sorted BAM format. WISECONDOR tries to determine how read frequencies behave over several samples, and as of such, it requires several samples to train on. During development there were about 15 to 20 samples, which seemed to work reliably. For diagnostics, it's better to use more samples spread out over several sequencing runs so WISECONDOR learns the read frequency behaviour over different points in time with varying environmental properties. If you want to use it for diagnostics, try 100 as a target. On the other hand, I've had reports of people using a type of sequencer I do not have acces to and being able to correcly classify a trisomy 18 case using 4 reference samples and lower coverages than I ever tried, so if you want to try the method before spending all your budget you may experiment with a smaller set first.
There is read depth normalization, but it's done implicitly by the LOWESS GC-Correction. WISECONDOR previously had a seperate step to normalize the data (which was only useful after applying the RETRO-filter due to the read-towers or spikes in the data), but I decided to remove it as I applied LOWESS using a division:
correctedValue = sample[chrom][bin]/lowessCurve.pop(0)
.
As the actual read count of any bin gets scaled to about 1 using this step, the separate normalization step became obsolete, results with and without normalizing the data prior to GC-Correction showed no noticable differences.
The data we obtained from our lab contains enough reads to allow us to use this setting. We prefer using only the highest reliable data we can get over more, less reliably mapped data. Of course, you are free to test WISECONDOR with mismatches allowed, just make sure your reference set is build using the same settings as your test samples. WISECONDOR does not care for these mismatches, it only counts reads based on their position on the genome. I suspect a problem occurs when you use relatively long reads. We used 51 bp single end reads, which are rather affordable these days. If you happen to use longer reads, for example, 250 bp, mapping with 0 mismatches allowed will significantly decrease the percentage of reads mapped. Reasons for this are the increased chance of a sequencing errors the longer the sequence becomes and the higher chance of covering a SNP in the patient. If you work with such long reads, consider allowing mismatches (preferably near the end of reads as some mappers do) or cut the reads short before mapping, such as trimming them down to 51 bp.
Not much, WISECONDOR will probably not report any aberrated bins (not ones that are fetal anyway) but it won't check for fetal percentages, it simply assumes there is enough.
We believe about 12 million mappable reads is safe, in general 10 million seems enough and 8 million appears to be the absolute lower threshold. The more reads the merrier up to a certain point, keep in mind that a reduced amount of reads per bin increases the StandardDeviation for that bin relatively quickly when looking for small relative changes, thus decreasing WISECONDOR's sensitivity. Do realize that high coverages (I have no actual numbers but let's assume anything over 25 million per sample, giving us a window of 10 to 25 million reads in daily practice) may cause the RETRO filter to drop out reads that should be kept for analysis. If you work with siginificantly higher coverage samples consider experimenting with the RETRO filter options in your analysis.
I built a reference using relatively messy samples, what will a test on a good sample show me using this reference?
Hard to predict, depends on what you call messy as well. If you use mostly the same protocol to prepare the samples in the laboratory , WISECONDOR may happily get along with your messy reference set: Any structurally comparable behaviour among bins can still be identified when building the reference set and if the tested sample has completely different read depths over all bins, but the bins identified in the reference still do behave structurally alike, the calls are not necessarily bad.
The problem occurs when the read depths over bins start to behave differently, which may happen when the workflow in a laboratory changes although even that may be less influential than expected. Still, the general Bio-Informatics rule applies here:
- Rubbish in is rubbish out.
An interesting note on the side: We have seen pretty good results using a dirty reference set; while most samples were unaberrated, some in the set were (trisomy 21 for example), when using the built reference on the very same set of samples, WISECONDOR did call the trisomy 21 cases without mistakes. We do vouch for using reference sets that do not contian such errorneous samples but if there is no other data available, you may give it a shot.
If you run into issues, please create a ticket so I can take care of it.
If you have other troubles running WISECONDOR or any related questions, feel free to contact me through the e-mail adress on my GitHub page.