-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy allele length and sequence #42
Comments
In addition, also in other cases the count and the sequence don't perfectly add up. For instance the second ALT allele in the example below is 771 bp long if you count the A's and G's. However, I would have expected it to be 814 nucleotides. Both lenghts however fall within the range: 738-909. Are sequence and count determined based on different consensus sequences?
|
Thanks for reporting @ljohansson, really appreciate it. For the first case, there should be 2 ATL alleles, one shorter (11 RU) and one longer (332 RU) than the reference (50 RU). The GT will be "1:2" instead of "0:1". For the second case, the GT field is actually correct (1:2) as neither allele matches the REF, and both short and long sequences are displayed in the ALT field. The ALT sequence reported is actually chosen from one the support reads that matches closest in length (bp) to the assigned genotype. It seems 814 (assigned) and reported length (772) is pretty off. Can you post the TSV output of the second case so I can check if this (that 772 is the closest) is indeed the case. So after seeing the second case (which is similar to the first case), I don't know why the first case reported the wrong genotype (0:1 instead of 1:2) and missed reporting the longer sequence in the ALT field |
after a second look, l figured out the reason. Your guess is right, the script is using the lower bound of the size range (26, and minus 10%) to determine the compare against the reference length. That's why it thinks the second allele is thought to be REF size! |
Hi @readmanchiu:
|
You are correct that the current experiment is a targeted one. It was a region cut out using Cas9 and enriched and sequenced using a MinION. Therefore coverages are quite high in some cases, although not so much in the second case. I am not sure what would be a good implementation, but I have understood from a laboratory specialist that expanded STRs can be quite unstable and vary in length within a single person, so I guess a high spread may be present more often. However, in the current use-case given the read sequence, the '26' read seems off as a whole.
Do I understand correctly that it is not a consensus sequence in the VCF, but the sequence of a single selected read, including possible sequence errors unique for that read? |
You are right, the ALT sequence is extracted from one of the support read sequences. Yes, there may be sequencing errors. |
Hi @ljohansson, |
Hi @readmanchiu |
Dear @readmanchiu,
|
I think the 2 outlier reads threw the clustering off. Since you have a pretty deep sequencing depth, If you increase the parameter |
Changing the min_cluster_size to 5 indeed got rid of the two outliers. However, the two alleles still got combined into a single allele (see tsv output below). In this case this is important, because one of the two has the benign GAAGGA pattern and the other one the pathogenic GAAGAA pattern. As there are slightly more reads of the GAAGAA pattern (the ~151/302 RU allele), this one represents in the vcf file, but it could have been a GAAGGA read as well if the balance was slightly different. A second question: in our panel, the coverage is not equally high for all different repeats in the catalogue. Is there also a minimum percentage of usable reads that can be set? A min_cluster_percentage 0.2 for instance? This could be very useful since the lower covered regions might become problematic with too high min_cluster_sizes. On a side note: interestingly, the two outlier allels have aberrant patterns with the blue allele (from the picture in the previous post) showing a long stretch of GGAGGA (shortly interupted with a few AGGG) and the red one has a AGGGAGGG pattern. I guess these abberant sequences were not counted. Both alleles had a 'short' GAAGAA sequence at the end, which is probably what was counted as the repeat for those sequences..
|
Increasing the min_cluster_size should screen out the outliers, unfortunately there is a hidden and hard-coded threshold in Straglr that essentially merges clusters that are <= 50 bp apart. I guess I set this back in the days when the reads were way more noisy and I am not confident in reporting alleles with that much (little) size difference. An unrelated point ... I don't the any GAAGGA in the actual motif (column 9, after the read name) reported but GAAGGA is reported as the consensus motif column 4. I'm puzzled - I thought this is a bug but you confirmed your actually said the larger allele is GAAGGA whereas the smaller (benign) allele is GAAGAA. Can you confirm this is true? |
Hi, I am not sure if I am oversimplifying things, but possibly the merging of clusters can be dependent on the variation between read lenghts? Here, within both clusters the variation looks relatively low and when taking a Z=3 threshold, there is no overlap. In case of noisier data, the overlap would become apparent, leading to the current situation of merging the clusters.
In this context we may have another interesting case of a male patiënt with fragile X. Here, if the CGG repeat is between 44-200 there is a premutation and above 200 it becomes a full mutation. In our case the patient has a range in repeat length from 147 to 397 repeat units, leading to an average of 223 RU, which is pathogenic. I have learnt from a specialist that it is possible such patterns of different lengths can occur between cells because of instability of the repeat. Even though in this case the result of the average cluster size is 'pathogenic', if all reads were e.g. 50 bp shorter, it would fall below the threshold. The range is shown in the vcf, so it can be picked up, but it may be an interesting use-case to consider. |
I have tried using the spread, it works for loci with nice depths in clusters (like your case) but not others. And thanks for flagging me on the second case, I totally understand that the repeat sizes can be a smear, clustering into 2 alleles in these cases can be totally arbitrary. The ranges in size or copy numbers for each allele are reported in the |
Also adopted your idea of using a fractional number (between 0 and 1) for |
Hi @readmanchiu. Thank you so much for these developments. |
Thanks for reporting. This is a fairly complicated scenario essentially with 2 motif classes: I would need to first segregate reads based after motif detection, and then do size clustering within individual motif subgroup. |
@readmanchiu is there anywhere I can privately share the cram with you? |
if it's not too big can you try email first? [email protected] |
Dear @readmanchiu
I am currently using Straglr 1.5.0 and produced the following output for the FGF14 locus. If I am correct there is a discrepancy in the sequence shown in the vcf and the AL/ALR and AC/ACR. If my breakdown is correct then.
GT = 0/1 (meaning that there is one REF allele and one ALT allele)
DP = 187
AL=33.6/995.8
ALR=13-44/78-1216
AC=11.2/331.9
ACR=4.3-14.7/26.0-405.3
AD=106/72
No ALT_MOTIF (./.)
If I look at the sequences I see reference sequence 50 GAA repeat units and alternative sequence around 11 copies, matching the first allele. Combined with the 0/1 GT this would make a sample with two repeats, one of 11 RU and one of 50 RU. However, I was expecting a sequence of around 332 RU, fiven the AC values.
EDIT: could it be because the ref length (50) falls within the range 26.0-405.3? If so, then this is incorrectly so in my opinion. There are only a few reads supporting an intermediate length allelle with a length close to the reference length.
A slice of the middle of the tsv file also shows the two 11.2 and 331.9 alleles:
And the bed results:
These are the 26.0 copy read details (not sure if they help):
In IGV it looks like this
The subsetted read is the same one as in the normal cram file, but extracted to better see the GAA/GGA pattern.
Can you help me on how to interpret these results?
The text was updated successfully, but these errors were encountered: