Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines for making your own database #70

Open
TreeT2 opened this issue Dec 8, 2017 · 5 comments
Open

Guidelines for making your own database #70

TreeT2 opened this issue Dec 8, 2017 · 5 comments

Comments

@TreeT2
Copy link

TreeT2 commented Dec 8, 2017

Just wondering if you have any tips for making your own database. Specifically can you use a multifasta file with different species and strains to make the database or are you better off using individual fasta files?
Thanks

@ondovb
Copy link
Member

ondovb commented Dec 8, 2017

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.

As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

@alienzj
Copy link

alienzj commented May 14, 2019

Hello, everyone~

when I do:

mash info RefSeq88n.msh | head -n 50

I found(part):

Header:
Hash function (seed):          MurmurHash3_x64_128 (0)
K-mer size:                    21 (64-bit hashes)
Alphabet:                      ACGT (canonical)
Target min-hashes per sketch:  1000
Sketches:                      127219
Sketches:
[Hashes]  [Length]    [ID]                                                                                                                         [Comment]
1000      143726002   GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz                                                                        [1870 seqs] NC_004354.4 Drosophila melanogaster chromosome X [...]
1000      3241953429  GCF_000001405.36_GRCh38.p10_genomic.fna.gz                                                                                   [557 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly [...]
1000      3257319537  GCF_000001405.38_GRCh38.p12_genomic.fna.gz                                                                                   [594 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p12 Primary Assembly [...]
1000      3231170666  GCF_000001515.7_Pan_tro_3.0_genomic.fna.gz                                                                                   [44449 seqs] NC_006468.4 Pan troglodytes isolate Yerkes chimp pedigree #C0471 (Clint) chromosome 1, Pan_tro 3.0, whole
genome shotgun sequence [...]

So, I want to build a latest refseq bacterial mash database from bleow ftp site:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.genomics.fna.gz

But I can't get genomics file like GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz.
Do I need to resplit the refseq release bacteria genomics file to each genomics file ?

Thanks~

@alienzj
Copy link

alienzj commented May 14, 2019

I think I should use ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/

@MarionHu
Copy link

MarionHu commented Jun 4, 2020

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.

As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

Hi,
Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?

@dominikjasiczek
Copy link

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.
As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

Hi,
Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?

Hi,
I would also like to know if the pre-sketched RefSeq is updated. And if not, as beginner in the field of bioinformatics, what action would you propose for me to do, so that i could update it myself?
Since i used mash in my project it seems like more recent entries are not included in the pre-sketched RefSeq. Thank you for the help in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants