Skip to content

Latest commit

 

History

History
75 lines (55 loc) · 3.11 KB

README.md

File metadata and controls

75 lines (55 loc) · 3.11 KB

The Metagenome Assembly Pipeline

Summary

This workflow is developed by Brian Foster at JGI and original from his repo. It takes in paired-end Illumina short reads or PacBio long reads.

In short reads, the workflow reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The .wdl (Workflow Description Language) file includes five tasks: bbcms, assy, create_agp, read_mapping_pairs, and make_output.

In long reads, the workflow uses Flye for assembly, pbmm2 for alignment, Racon for polishing, and minimap2 for read mapping and coverage analysis. The :literal:.wdl (Workflow Description Language) file includes six tasks: combine_fastq, assy, racon, format_assembly, map, and make_info_file.

The Docker image and Dockerfile can be found here

microbiomedata/bbtools:39.03

microbiomedata/spades:4.0.0

Input files

  1. The path to the input FASTQ file (Illumina paired-end interleaved FASTQ or PacBio paired-end interleaved FASTQ) (recommended: output of the Reads QC workflow).

  2. Project name: nmdc:XXXXXX

  3. Memory (optional) e.g., "jgi_metaAssembly.memory": "105G"

  4. Threads (optional) e.g., "jgi_metaAssembly.threads": "16"

  5. Whether the input is short reads (boolean)

{
        "jgi_metaAssembly.input_files": ["https://portal.nersc.gov/project/m3408/test_data/smalltest.int.fastq.gz"],
        "jgi_metaAssembly.proj": "nmdc:XXXXXX",
        "jgi_metaAssembly.memory": "105G",
        "jgi_metaAssembly.threads": "16",
        "jgi_metaAssembly.shortRead": true
}

Output files

Below is a part list of all output files. The main assembly contigs output is in final_assembly/assembly.contigs.fasta.

# Short Reads
    output/
    ├── nmdc_XXXXXX_metaAsm.info
    ├── nmdc_XXXXXX_covstats.txt
    ├── nmdc_XXXXXX_contigs.fna
    ├── nmdc_XXXXXX_bbcms.fastq.gz
    ├── nmdc_XXXXXX_scaffolds.fna
    ├── nmdc_XXXXXX_assembly.agp
    ├── stats.json
    ├── nmdc_XXXXXX_pairedMapped.sam.gz
    └── nmdc_XXXXXX_pairedMapped_sorted.bam
# Long Reads
    output/
    ├── nmdc_XXXXXX_assembly.legend
    ├── nmdc_XXXXXX_contigs.fna
    ├── nmdc_XXXXXX_pairedMapped_sorted.bam
    ├── nmdc_XXXXXX_read_count_report.txt
    ├── nmdc_XXXXXX_metaAsm.info
    ├── nmdc_XXXXXX_summary.stats
    ├── nmdc_XXXXXX_scaffolds.fna
    ├── nmdc_XXXXXX_pairedMapped.sam.gz
    ├── stats.json
    ├── nmdc_XXXXXX_contigs.sam.stats
    ├── nmdc_XXXXXX_contigs.sorted.bam.pileup.basecov
    ├── nmdc_XXXXXX_assembly.agp
    └── nmdc_XXXXXX_contigs.sorted.bam.pileup.out

Link to Doc Site

Please refer here for more information.