Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembly genomes #5

Open
jgnunes opened this issue Apr 9, 2018 · 0 comments
Open

Assembly genomes #5

jgnunes opened this issue Apr 9, 2018 · 0 comments

Comments

@jgnunes
Copy link
Contributor

jgnunes commented Apr 9, 2018

General information

The strategy to be applied is to run several assemblers in parallel and then compare their results. If the assembly validation shows that the assembly isn't good enough, then we re-run the assembly adjusting the parameters. This trategy is the usual protocol used by bioinformaticians, as suggested by the paper "Ten steps to get started in Genome Assembly and Annotation".
Should we use a de novo or a reference-guided strategy for the genomes' assembly? Since we already have a reference genome of high quality, it seems unreasonable not to take advantage of it. However, it is known that a reference-guided assembly has some disadvantages, as "the resulting assemblies may contain some biases towards the used reference (...)" and as a consequence "(...) more diverged regions may not be reconstructed and missing". (Lischer, H., 2017)

Some good candidate de novo assemblers are:

  • Masurca:
    • Recommended by the "Ten steps (...) paper" and used by Marcela for assembly of L. fortunei nuclear genome
    • "Hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies" (Zimin, A., 2013)
    • Requirements:
      • Linux system (maybe it works on other UNIX-like systems)
      • Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 5Tb disk space
    • Expected runtime:
      • Mammalian genomes (up to 3Gb): 15-20 days
  • AllPaths-LG:
    • Recommended by the "Ten steps (...) paper" and used by Marcela for assembly of L. fortunei nuclear genome
    • It uses a de Bruijn graph strategy (as explained at the paper of its first version (AllPaths))
    • Requirements:

Some good candidates for reference-guided assemblers are:

  • Heidi Lischer's pipeline

    • "We have shown that our extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference genome of a closely related species is used. The combination of reference mapping and de novo assembly provides a powerful strategy for genome assembly, as it combines the advantages of both approaches" (Lischer, H., 2017)
    • What assembler should we choose for the de novo assembly step?
      • The pipeline supports the following de novo assemblers:
        • AllPaths-LG (v 51279)
        • idba (v 1.1.1)
        • abyss-pe (v 1.5.2)
        • SOAPdenovo2 (v r240)
      • According to the reference paper, the best assemblies were obtained using the ALLPATHS-LG assembler: "The overall best assembly can be achieved with our reference-guided de novo assembly pipeline using ALLPATHS-LG. However, one should be aware that this is not an ultimate ranking. Other studies have shown that assemblers may perform quite differently on varying data sets and species" (Lischer, H., 2017)
    • Requirements:
      • The reference-guided strategy requires about 50% less RAM than the respective de novo assembly (see citation below). A estimate says one should need around 160 bytes of RAM per genome base, meaning around 256 GB RAM for the golden mussel genome (1,6 Gb). Then, a initial set of 0,5*256= 128 GB RAM could be tested.
      • "The reference-guided de novo with ALLPATHS-LG needs 16% less RAM than de novo assembly with ALLPATHS-LG. This is even more pronounced if the closer reference A. lyrata is used: only 109 GB memory is needed, which is less than half of the de novo assembly." Lischer, H., 2017
    • Runtime
      • "However, the lower memory requirements of the reference-guided de novo assembly approach comes with the cost of run time, which is much longer due to several de novo assembly and alignment steps." Lischer, H., 2017
  • MIRA

    • There isn't a general recommendation, but one can run a MIRA command to estimate the resources required. It's called miramem (find out more at MIRA's manual)
    • The mapping assembly can be found at topic 6 on the manual
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant