Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_Lastz on Cactus-447-mammalian-genome dataset #69

Open
KabitaBaral1 opened this issue Nov 4, 2024 · 3 comments
Open

make_Lastz on Cactus-447-mammalian-genome dataset #69

KabitaBaral1 opened this issue Nov 4, 2024 · 3 comments

Comments

@KabitaBaral1
Copy link

Hi,
I have a question regarding running LASTZ similar to what they did for the TOGA paper. In my case, I have Cactus 447 mammalian genome dataset. I converted it from Hal to fasta, removed ancestral sequences. and now I have two fasta files from that dataset: one with just human genome sequence and another with the rest 446 mammalian genomes as one fasta file. I am wondering if I can run make_lastz_chains on that query fasta file? thank you.

@MichaelHiller
Copy link
Collaborator

Good question. I think there is no point of extracting the genomic fasta seqs from the Cactus alignment and then aligning them again to human to get chains. If you want to do that, you can also just start with the full genomes of these species.

But I guess the best would be to extract pairwise alignments (in chain format) from the cactus alignment.
This should hopefully be possible, but how to do this is something that should pls be directed to Benedict Paten and the Cactus developers.

@KabitaBaral1
Copy link
Author

Hi Michael,
Thank you for getting back to me. I have a couple of follow-up questions.
I am trying to run LASTZ & then TOGA to get coordinates of protein-coding regions for all 447 mammalian genomes in the Cactus dataset.
I thought that similar to your TOGA paper, the approach would be to perform LASTZ and then TOGA on the dataset.
Is there a better way to do this? Or an alternative?
"If you want to do that, you can also start with the full genomes of these species." Could you please elaborate on this?
Thank you

@MichaelHiller
Copy link
Collaborator

Hi,

the coordinates of all orthologs that TOGA found are in the bed or gtf files we provided. If this is what you need, you don't have to run anything.

If you have new genomes, then the easiest is to align them using our lastz/chain pipeline to a reference and then running TOGA.

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants