Guidelines for making your own database #70

TreeT2 · 2017-12-08T12:36:40Z

Just wondering if you have any tips for making your own database. Specifically can you use a multifasta file with different species and strains to make the database or are you better off using individual fasta files?
Thanks

ondovb · 2017-12-08T19:43:17Z

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.

As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

alienzj · 2019-05-14T10:08:47Z

Hello, everyone~

when I do:

mash info RefSeq88n.msh | head -n 50

I found(part):

Header:
Hash function (seed):          MurmurHash3_x64_128 (0)
K-mer size:                    21 (64-bit hashes)
Alphabet:                      ACGT (canonical)
Target min-hashes per sketch:  1000
Sketches:                      127219
Sketches:
[Hashes]  [Length]    [ID]                                                                                                                         [Comment]
1000      143726002   GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz                                                                        [1870 seqs] NC_004354.4 Drosophila melanogaster chromosome X [...]
1000      3241953429  GCF_000001405.36_GRCh38.p10_genomic.fna.gz                                                                                   [557 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly [...]
1000      3257319537  GCF_000001405.38_GRCh38.p12_genomic.fna.gz                                                                                   [594 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p12 Primary Assembly [...]
1000      3231170666  GCF_000001515.7_Pan_tro_3.0_genomic.fna.gz                                                                                   [44449 seqs] NC_006468.4 Pan troglodytes isolate Yerkes chimp pedigree #C0471 (Clint) chromosome 1, Pan_tro 3.0, whole
genome shotgun sequence [...]

So, I want to build a latest refseq bacterial mash database from bleow ftp site:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.genomics.fna.gz

But I can't get genomics file like GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz.
Do I need to resplit the refseq release bacteria genomics file to each genomics file ?

Thanks~

alienzj · 2019-05-14T15:48:34Z

I think I should use ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/

MarionHu · 2020-06-04T14:43:40Z

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.

As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

Hi,
Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?

dominikjasiczek · 2020-06-11T12:00:48Z

This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.
As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/ directory, use find to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l.

Hi,
Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?

Hi,
I would also like to know if the pre-sketched RefSeq is updated. And if not, as beginner in the field of bioinformatics, what action would you propose for me to do, so that i could update it myself?
Since i used mash in my project it seems like more recent entries are not included in the pre-sketched RefSeq. Thank you for the help in advance!

ondovb added the enhancement label Dec 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidelines for making your own database #70

Guidelines for making your own database #70

TreeT2 commented Dec 8, 2017

ondovb commented Dec 8, 2017

alienzj commented May 14, 2019 •

edited

Loading

alienzj commented May 14, 2019

MarionHu commented Jun 4, 2020

dominikjasiczek commented Jun 11, 2020

Guidelines for making your own database #70

Guidelines for making your own database #70

Comments

TreeT2 commented Dec 8, 2017

ondovb commented Dec 8, 2017

alienzj commented May 14, 2019 • edited Loading

alienzj commented May 14, 2019

MarionHu commented Jun 4, 2020

dominikjasiczek commented Jun 11, 2020

alienzj commented May 14, 2019 •

edited

Loading