Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue and fix for kreport2mpa.py script - domain misrepresented with "k" and taxon underscore flexibility #84

Closed
jmboccacino opened this issue Sep 15, 2023 · 2 comments

Comments

@jmboccacino
Copy link
Contributor

Hello,

First of all, thank you very much for developing the KrakenTools scripts, they are very helpful.

My colleague @IanVermes and I noticed that the script kreport2mpa.py, when generating the MPA-style report, was adding "k" instead of "d" to represent domains/superkingdoms. Please see the example below:

Expected output, generated by running kraken2 [...] --report path/to/output --use-mpa-style:

d__Eukaryota	20339891
d__Eukaryota|k__Metazoa	19889334
d__Eukaryota|k__Metazoa|p__Chordata	19889334
d__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia	19889334
d__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates	19889334
d__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae	19889334
d__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae|g__Homo	19889334 d__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae|g__Homo|s__Homo sapiens	19889334

Actual output, generated by running kreport2mpa.py:

k__Eukaryota	20339891
k__Eukaryota|k__Metazoa	19889334
k__Eukaryota|k__Metazoa|p__Chordata	19889334
k__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia	19889334
k__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates	19889334
k__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae	19889334
k__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae|g__Homo	19889334
k__Eukaryota|k__Metazoa|p__Chordata|c__Mammalia|o__Primates|f__Hominidae|g__Homo|s__Homo_sapiens	19889334

In our fork of your repository, we made a small amendment to fix that issue.

We also noticed that kreport2mpa.py introduces underscores in the taxon names, which can be seen in the actual output demonstrated above. This could become an issue if the user wanted to map the taxon names from the MPA-style output back to the ones in the Kraken2 report file, as the Kraken2 report file does not have underscores in the taxon names - please see the example below:

Kraken2 report, generated by running kraken2 [...] --report path/to/output --report-minimizer-data:

31.25	20339891	438892	231217348	8316675	D	2759	    Eukaryota
 30.58	19898435	9097	226780344	8248218	D1	33154	      Opisthokonta
 30.56	19889334	0	226686519	8248218	K	33208	        Metazoa
 30.56	19889334	0	226686519	8248218	K1	6072	          Eumetazoa
 30.56	19889334	0	226686519	8248218	K2	33213	            Bilateria
 30.56	19889334	0	226686519	8248218	K3	33511	              Deuterostomia
 30.56	19889334	0	226686519	8248218	P	7711	                Chordata
 30.56	19889334	0	226686519	8248218	P1	89593	                  Craniata
 30.56	19889334	0	226686519	8248218	P2	7742	                    Vertebrata
 30.56	19889334	0	226686519	8248218	P3	7776	                      Gnathostomata
 30.56	19889334	0	226686519	8248218	P4	117570	                        Teleostomi
 30.56	19889334	0	226686519	8248218	P5	117571	                          Euteleostomi
 30.56	19889334	0	226686519	8248218	P6	8287	                            Sarcopterygii
 30.56	19889334	0	226686519	8248218	P7	1338369	                              Dipnotetrapodomorpha
 30.56	19889334	0	226686519	8248218	P8	32523	                                Tetrapoda
 30.56	19889334	0	226686519	8248218	P9	32524	                                  Amniota
 30.56	19889334	0	226686519	8248218	C	40674	                                    Mammalia
 30.56	19889334	0	226686519	8248218	C1	32525	                                      Theria
 30.56	19889334	0	226686519	8248218	C2	9347	                                        Eutheria
 30.56	19889334	0	226686519	8248218	C3	1437010	                                          Boreoeutheria
 30.56	19889334	0	226686519	8248218	C4	314146	                                            Euarchontoglires
 30.56	19889334	0	226686519	8248218	O	9443	                                              Primates
 30.56	19889334	0	226686519	8248218	O1	376913	                                                Haplorrhini
 30.56	19889334	0	226686519	8248218	O2	314293	                                                  Simiiformes
 30.56	19889334	0	226686519	8248218	O3	9526	                                                    Catarrhini
 30.56	19889334	0	226686519	8248218	O4	314295	                                                      Hominoidea
 30.56	19889334	0	226686519	8248218	F	9604	                                                        Hominidae
 30.56	19889334	0	226686519	8248218	F1	207598	                                                          Homininae
 30.56	19889334	0	226686519	8248218	G	9605	                                                            Homo
 30.56	19889334	19889334	226686519	8248218	S	9606	                                                              Homo sapiens

We therefore edited the code so that the user can decide whether they would like the spaces to be replaced with underscores or not, adding arguments to remove (--remove-spaces) or keep (--keep-spaces) the spaces in each taxon's name.

Kind regards,
Jacqueline and @IanVermes

@Sabrin2020
Copy link

Sabrin2020 commented Sep 18, 2023

Hi @jacqueline , sorry my question is unrelated but can you help me with this issue #83

@jenniferlu717
Copy link
Owner

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants