Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqtk telo -m works partially with 8mer wasp telomere #201

Open
jbh-cas opened this issue Jun 9, 2023 · 5 comments
Open

seqtk telo -m works partially with 8mer wasp telomere #201

jbh-cas opened this issue Jun 9, 2023 · 5 comments

Comments

@jbh-cas
Copy link

jbh-cas commented Jun 9, 2023

Thank you for adding the telo command, it is much faster than the grep_telomere script I had put together.

I was hoping to use this with a wasp assembly to see its telomeres. As this pub mentions https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8977481 wasp telomeres can be from 5 to 9mers. Our assembly contains the 8mer TTATTGGG. Setting this with -m does not find any telos, whereas using the revcomp -m CCCAATAA finds this set but not the TTATTGGG at the bottom of the test file. I suspect it is the 6mer hash data structure. Thought I would mention it, though not sure how much effort you think is worth extending the -m usage scenarios.

best,
Jim Henderson

Test text I have been using:

>Chr1_chal_sis
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
CAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACC
ATAACCGCCCTTCCTTATGGTATAACCAACCCCACTATAATATAACCACTCCCACTTACGGCTTGACCACTCCCACTTACAACTTAACCACTCCAACTTGCGGTTTAACAACCCCCCTGG
CGTTTAAATACCCCCACATACGATTTAACCACCACCCTATACAGTTTAACTACCCTACTTGCGGTTTAACAATCCCCCTGGCGGTAAACCACTCCCCACTTATGGTTTAACCACTCCCAC
TTTACGGTTTGACAAGCCTCATATTTAGGTTTAACCACCCTACTTTAAGGTTTAACCAACCCCCTTTTATAGTATAACTATCCTCACTTATACTATGACAACCCCGCTCATAGTATAACT
ATCCCAACCTAACACATAACCACCCGTGCTTACGATATAACCCATCCTACTTACACTATAACCACCTACTTACAGTGTAACTACCCCAAATTATGATATAACCATCCCCACTTACAGTAC
AACCCCCCTAAATCATAACAAAACCACTACAACTTTACGGTTTAACCACCCCAGCTCATGATTTAACTTACAACATAACCATCCACCATTGTATAACCATCCCCATTTAATACACAACTA
CCCTCGCTTACAGTATAACCTACACAATTACGGTACAACCACCCTACTCACAATATAACCACCCCTACTTACGGTATAACAATCCGTAACATACCCCCGTCGGGTATAATTATTCCGTTA
CGGCAATAACCCACCAGTTACAGCATAAACCCTAATTACGGCAATGCCTCCAGTACGATATAACCGCAATACGGTATAACCCCAGTAAGGTATTACCTCAATACAGCACACCCTTGGTAC
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
CGTTTAAATACCCCCACATACGATTTAACCACCACCCTATACAGTTTAACTACCCTACTTGCGGTTTAACAATCCCCCTGGCGGTAAACCACTCCCCACTTATGGTTTAACCACTCCCAC
@lh3
Copy link
Owner

lh3 commented Jun 9, 2023

At least for human, (TTAGGG)n only occurs at the 3'-end of a chromosome. I assume Chr1_chal_sis is very long and you are only showing part of the sequence. If this is the case, seqtk telo wouldn't consider (TTATTGGG)n as part of a telomere because it is on the 5'-end.

@lh3
Copy link
Owner

lh3 commented Jun 9, 2023

Just checked the paper. They were using short reads. The inverted motif could be an assembly artifact. It would be more convincing if we have HiFi reads.

@jbh-cas
Copy link
Author

jbh-cas commented Jun 9, 2023

We do have a HiFi assembly made with hifiasm using PacBio long reads. And, as you say, the test text was a very truncated view of the Chr, thanks for pointing out the folly of using it as a reliable test.

I found the 8mer from the short read paper but confirmed its existence in our HiFi assembly using grep for the 8mer.
I'll go back to that and get additional info.

Thank you.

@lh3
Copy link
Owner

lh3 commented Jun 9, 2023

its existence in our HiFi assembly using grep for the 8mer.

Where are TTATTGGG 8-mers? Are they located on the 3'-ends of contigs, or following (CCCAATAA)n on the 5'-end like the example you have shown?

@jbh-cas
Copy link
Author

jbh-cas commented Jun 9, 2023

At the bottom or very near bottom.

I made a 8Mbp test file named tst2 with the CCCAATAA at top TTATTGGG at bottom and revcomp version tst2_rc to flip them and these were the results:

$ seqtk telo -m CCCAATAA tst2
Chr1r_chal_sis	0	960	8403000
960	8403000
$ seqtk telo -m CCCAATAA tst2_rc
Chr1r_chal_sis	8402040	8403000	8403000
960	8403000
$ seqtk telo -m TTATTGGG tst2
0	8403000
$ seqtk telo -m TTATTGGG tst2_rc
0	8403000

$ grep TTATTGGG tst2_rc -n 
70019:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70020:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70021:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70022:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70023:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70024:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70025:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG
70026:GGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTGGGTTATTG

I used seqtk to fold at 120 chars for the file, so first TTATTGGG hit is at 8,402,280 bases in.

I'm sure you have more important things, I just thought having tried in this non-standard case that I would pass it along. I can always use a squishy regex with grep to find as I have been doing with the vertebrate 6mer.

As an aside, I do hope that telomere awareness can be used in assemblers and scaffolders as an option. We run a telomere script after every assembly or use of yahs and show the records and their telos as TOP, TOP_NEAR, BOTTOM, BOTTOM_NEAR or MIDDLE.

Thanks for taking the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants