-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgMLST schemes? #11
Comments
If you take a look at the indexer docs, they should look very familiar since you're a stringMLST user. If you have a scheme you've built with stringMLST, you can use the same input files for STing. The docs and examples focus around the 7 locus schemes since they tend to be easier to work with and faster to build. Did you have a particular scheme in mind? |
Thanks for the rapid response. The Neisseria cgMLST v1.0 cgc400 scheme on pubMLST (Harrison et al.). is what I am trying to work with. One problem I am having is downloading all of the allele sequences. pubMLST seems to limit the number I am able to download at once. The scheme has over 1.6K loci. https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=batchSequenceQuery |
It looks like the download links are in: https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=downloadAlleles&locus= format So let's try something a little brutish to get everything. First, head over to the export tab and just export the profile table for the STs (saved as profiles.txt): https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=job&id=BIGSdb_043821_1608874811_82059 Extract the header and clean it up: head -n 1 profiles.txt | tr '\t' '\n' | tail -n +2 | head Hopefully you see something like: Download the alleles: head -n 1 profiles.txt | tr '\t' '\n' | tail -n +2 | xargs -I nmb wget "https://pubmlst.org/bigsdb?db=pubmlst_neisseria_seqdef&page=downloadAlleles&locus=nmb" -O nmb.fasta That should get you all the files you need. I'm on my phone and haven't tested it though. Good luck! |
Oh, and merry Christmas 🙂 |
I figured there was some bash script I could use to pull the allele sequences out! Thanks so much I will test this later today. And a very blessed and Merry Christmas to you too!!! |
Hello again! That worked. However I'm having issues with creating the database. First I tried your demo but I get an error message:
I also tried building my own database file and I get a message stating that I'm missing the loci section. I'm not sure why that is as I followed your example database file format. I've uploaded my file. Thanks! |
The first set of errors looks like you're running python on a system with old/outdated (or completely missing) SSL certificates or that's sitting behind an SSL-stripping middleware. Running This config.txt works for me. |
Okay strangely...my tab delimited file does not seem to work...but it does work when I delete everything below [loci] in your file and then paste the relative paths to the fasta files in my directories. So I'm almost there. However now I get this error:
If you'd prefer I open a new issue please let me know. Thanks! |
That's just your machine running out of allocatable memory |
Hmmm...okay let me try submitting this as a job. Thanks! |
After submitting this as a job on my HPC...its still creating a database for the 1605 loci in the cgMLST scheme I am working on. So far there are over 500K database files that have been created and it is still not done. I've just left it running, but I was curious if this is normal? |
That sounds about right for a cg |
Happy New Year! Okay I'm almost there....Question I have my database in a folder called "Ng" containing 24K files for 4 alleles....the files look like this:
I just want to verify I am running this correctly, particularly the database prefix portion:
I should mention that even with these 4 alleles typer seems to be taking a very long time even on the compute cluster. Thanks again! |
|
Did you build the full profile and did it finish running? A ~1600 loci scheme shouldn't take more than 40-50mins on a 60-70x coverage sequencing run. Admittedly the machine we run things on has a fair amount of memory, but since you could build the database (which takes more memory) you shouldn't have any issues (it'll take ~14GB). If your reads are a mixture of more than one species though, (1) you shouldn't run them in STing, since it is designed for single sample runs, and (2) the allele selection and refinement step will take a long time |
I did build the full 1600 loci scheme. Admittedly, I do have 200+ WGS samples. If it takes an hour per sample that's essentially a week and a half of processing. I'll give it a try again. |
@ar0ch Okay finally ran my samples and it was relatively fast to run all 200. I did notice the profiler didn't predict any STs despite identifying the majority of alleles for each of the 1600+ loci. Happy to share the data. Can I email you? Thanks! |
Sure -- mail [@] aroonchande.com |
I messaged you. Did it go through? The first attempt bounced back. Thanks! |
Got it. I'll try and take a look this evening |
(I haven't forgeten this issue, just busy with other things) |
Could you explain a little bit how to download a cgMLST scheme for Neisseria spp.. for use with this tool? I'm interested in doing this for a few samples but the documentation seems relevant only to traditional 7 loci MLST. As a long time StringMLST user thanks for this tool!
The text was updated successfully, but these errors were encountered: