Assembly genomes #5

jgnunes · 2018-04-09T17:27:09Z

General information

The strategy to be applied is to run several assemblers in parallel and then compare their results. If the assembly validation shows that the assembly isn't good enough, then we re-run the assembly adjusting the parameters. This trategy is the usual protocol used by bioinformaticians, as suggested by the paper "Ten steps to get started in Genome Assembly and Annotation".
Should we use a de novo or a reference-guided strategy for the genomes' assembly? Since we already have a reference genome of high quality, it seems unreasonable not to take advantage of it. However, it is known that a reference-guided assembly has some disadvantages, as "the resulting assemblies may contain some biases towards the used reference (...)" and as a consequence "(...) more diverged regions may not be reconstructed and missing". (Lischer, H., 2017)

Some good candidate de novo assemblers are:

Masurca:
- Recommended by the "Ten steps (...) paper" and used by Marcela for assembly of L. fortunei nuclear genome
- "Hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies" (Zimin, A., 2013)
- Requirements:
  - Linux system (maybe it works on other UNIX-like systems)
  - Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 5Tb disk space
- Expected runtime:
  - Mammalian genomes (up to 3Gb): 15-20 days
AllPaths-LG:
- Recommended by the "Ten steps (...) paper" and used by Marcela for assembly of L. fortunei nuclear genome
- It uses a de Bruijn graph strategy (as explained at the paper of its first version (AllPaths))
- Requirements:
  - Linux/UNIX system
  - 512 Gb for mammalian sized genomes or
  - 160 bytes of RAM per genome base

Some good candidates for reference-guided assemblers are:

Heidi Lischer's pipeline
- "We have shown that our extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference genome of a closely related species is used. The combination of reference mapping and de novo assembly provides a powerful strategy for genome assembly, as it combines the advantages of both approaches" (Lischer, H., 2017)
- What assembler should we choose for the de novo assembly step?
  - The pipeline supports the following de novo assemblers:
    - AllPaths-LG (v 51279)
    - idba (v 1.1.1)
    - abyss-pe (v 1.5.2)
    - SOAPdenovo2 (v r240)
  - According to the reference paper, the best assemblies were obtained using the ALLPATHS-LG assembler: "The overall best assembly can be achieved with our reference-guided de novo assembly pipeline using ALLPATHS-LG. However, one should be aware that this is not an ultimate ranking. Other studies have shown that assemblers may perform quite differently on varying data sets and species" (Lischer, H., 2017)
- Requirements:
  - The reference-guided strategy requires about 50% less RAM than the respective de novo assembly (see citation below). A estimate says one should need around 160 bytes of RAM per genome base, meaning around 256 GB RAM for the golden mussel genome (1,6 Gb). Then, a initial set of 0,5*256= 128 GB RAM could be tested.
  - "The reference-guided de novo with ALLPATHS-LG needs 16% less RAM than de novo assembly with ALLPATHS-LG. This is even more pronounced if the closer reference A. lyrata is used: only 109 GB memory is needed, which is less than half of the de novo assembly." Lischer, H., 2017
- Runtime
  - "However, the lower memory requirements of the reference-guided de novo assembly approach comes with the cost of run time, which is much longer due to several de novo assembly and alignment steps." Lischer, H., 2017
MIRA
- There isn't a general recommendation, but one can run a MIRA command to estimate the resources required. It's called miramem (find out more at MIRA's manual)
- The mapping assembly can be found at topic 6 on the manual

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembly genomes #5

Assembly genomes #5

jgnunes commented Apr 9, 2018 •

edited

Loading

Assembly genomes #5

Assembly genomes #5

Comments

jgnunes commented Apr 9, 2018 • edited Loading

General information

Some good candidate de novo assemblers are:

Some good candidates for reference-guided assemblers are:

jgnunes commented Apr 9, 2018 •

edited

Loading