-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the alignment of MashMap #12
Comments
|
Thanks for your reply! I'm sorry for the delayed response. As you recommended, I included several genomes of the same order of my target insect. Then I ran MashMap like this: $path2mashmap -t 20 -r contaminants.and.neighbors.fa -q third_all.fasta -s 500 --pi 80 -o mashmap4.out Here are the logs: Start time is 2018/07/17--13:53
>>>>>>>>>>>>>>>>>>
Reference = [contaminants.and.neighbors.fa]
Query = [third_all.fasta]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap4.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads = 20
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 16054318849
INFO, skch::Sketch::index, unique minimizers = 770476138
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 142744242) ... (255123, 1)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 4307 times during lookup.
INFO, skch::main, Time spent computing the reference index: 149116 sec
INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [6214500, 6214500, 6633142]
INFO, skch::main, Time spent mapping the query : 825763 sec
INFO, skch::main, mapping results saved in : mashmap4.out
Finish time is 2018/07/28--20:54 Then I counted the number of mapped reads (good - mapped to insects, bad - mapped to contaminants, ambivalent - mapped to both insects and contaminants) and unmapped ones. This is the numbers: No. of total reads: 6633142
No. of reads in the mashmap.output: 6214500
No. of good: 853515
No. of bad: 138799
No. of ambivalent: 5222186
No. of reads not in the mashmap.output: 418642 So most reads had been mapped to the reference but the majority of them had been mapped ambivalently, which are troublesome to extract good ones. Here is an example of this kind of reads: m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3500 4015 + NW_017852934.1 2683736 1681820 1682319 79.4204
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1000 1999 + NW_019280650.1 1003565 813077 813577 78.0766
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 2500 2999 - LJIG01019880.1 38067 34865 35364 79.3626
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 500 999 + NC_007418.3 31381287 24981081 24981580 79.5573
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3000 3499 - kraken:taxid|76857|NZ_CP022123.1 2521394 1537365 1537864 81.3159
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 0 499 - kraken:taxid|1202539|NC_018417.1 157543 40864 41363 79.4397
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1500 2499 + kraken:taxid|1936081|NZ_CP019389.1 3752836 1813582 1814081 76.7726 The first 4 lines are sequences from insects, and the last 3 lines are sequences from contaminants. As can be seen, in this case I don't know which alignment is more reliable and shoud be kept for downstream analysis. Thank you again for your quick reply! |
I still think many of the reads don't have a good (closely-related) reference in the DB which is causing false hits. I've few ideas to avoid this, and plan to implement them soon. Meanwhile, did you try my second suggestion in my previous response? |
Thank you for your reply! Sorry, which suggestion do you mean? include an insect genomes in the reference list? I already done that. I include 12 other insects of the same order of my target insects. By the way, I also mapped the same FASTA file to the same library using minimap2. And I found the majority of reads were unmapped. No. of reads in the SAM: 6633142
No. of mapped: 661646
No. of good: 490150
No. of bad: 125390
No. of ambivalent: 46106
No. of Unmapped: 5971496 I'm not an expert on "mapping" things, but I guess that maybe the reason why MashMap cannot distinguish them (when using Jacard similiarity to assign the reads to the reference). |
These results are good to know; it suggests that majority of "ambivalent" mashmap mappings are false-positives. This is a good feedback; will soon put additional checks to avoid this. For your downstream analysis, I guess it's clear that you've roughly 100,000 contaminant reads.
|
I counted the number of reads in different categories again. Since I 've used Here are the results: No. of total reads: 6633142
No. of reads in the mashmap.output: 6221577
No. of above threshold: 5909169
No. of good: 637490
No. of bad: 62995
No. of ambivalent: 5208684
No. of below threshold: 312408
No. of good: 222457
No. of bad: 79505
No. of ambivalent: 10446
No. of reads not in the mashmap.output: 418642 Most of reads had alignments above the threshold, so using that as the threshold maybe not appropriate in my case. Thank you for your help! |
Hello,
Following #6, I checked the outputs of MashMap using blastn and got some quesions with the alignment. I think I shoud open a new issue to understand the ouput of MashMap.
I wanted to remove posssible contaminants within the PacBio data, which was from several insects (whole organisms were used to extract DNA, so there maybe some bacteria DNA). I downloaed all the archaea, bacteria, fungi, protozoa, and viral sequences and merged them together as a contamination library. I also included mitochondrion sequences of the insect into the library.
Then I used MashMap with different parameters and got several outputs:
And the outputs from three runs varied:
As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me.
Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn.
The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it's because they used different databases. But the loweset ones were problematic. Most of them were 'No significant similarity found' when default parameters of blastn were used. And when I unselected 'Low complexity regions', the alignments were unreliable. There maybe something with 'low complexity regions' or 'repeat' somthething.
So my question are:
Thank you! Sorry if I missed something.
Bests,
Yiwei Niu
The text was updated successfully, but these errors were encountered: