# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
# Run on test data (use -f0 for small datasets)
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
./hifiasm -o test -t4 -f0 chr11-2M.fa.gz 2> test.log
awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa # get primary contigs in FASTA
# Assemble inbred/homozygous genomes (-l0 disables duplication purging)
hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log
# Assemble heterozygous with built-in duplication purging
hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz
# Trio binning assembly (requiring https://github.com/lh3/yak)
yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz)
yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz)
hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz
Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.
-
Hifiasm delivers high-quality assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.
-
Given sequence reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the Human Pangenome Project for the first batch of samples.
-
Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.
-
Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.
-
Hifiasm is trivial to install and easy to use. It does not required python, R or C++11 compilers and can be compiled into a single executable. The default setting works well with a variety of genomes.
A typical hifiasm command line looks like:
hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
where NA12878.fq.gz
provides the input reads, -t
sets the number of CPUs in
use and -o
specifies the prefix of output files. For this example, the
primary contigs are written to NA12878.asm.p_ctg.gfa
and alternate contigs to
NA12878.asm.a_ctg.gfa
. At the first run, hifiasm saves corrected reads and
overlaps to disk as NA12878.asm.*.bin
. It reuses the saved results to avoid
the time-consuming all-vs-all overlap calculation next time. You may specify
-i
to ignore precomputed overlaps and redo overlapping from raw reads.
Hifiasm purges haplotig duplications by default. For inbred or homozygous
genomes, you may disable purging with option -l0
. Old HiFi reads may contain
short adapter sequences at the ends of reads. You can specify -z20
to trim
both ends of reads by 20bp. For small genomes, use -f0
to disable the initial
bloom filter which takes 16GB memory at the beginning. For genomes much larger
than human, applying -f38
or even -f39
is preferred to save memory on k-mer
counting.
When parental short reads are available, hifiasm can generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with yak first and then do assembly:
yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz
Here NA12878.asm.hap1.p_ctg.gfa
and NA12878.asm.hap2.p_ctg.gfa
give the two
haplotype assemblies. In the binning mode, hifiasm does not purge haplotig
duplications by default. Because hifiasm reuses saved overlaps, you can
generate both primary/alternate assemblies and trio binning assemblies with
hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log
The second command line will run much faster than the first. You can also dump error corrected in FASTA and/or overlaps in PAF with
hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null
For non-trio assembly, hifiasm generates the following files:
- Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
- Haplotype-resolved processed unitig graph without small bubbles (prefix.p_utg.gfa). Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information.
- Primary assembly contig graph (prefix.p_ctg.gfa). This graph collapses different haplotypes.
- Alternate assembly contig graph (prefix.a_ctg.gfa). This graph consists of all assemblies that are discarded in primary contig graph.
For trio assembly, hifiasm generates the following files:
-
Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information.
-
Phased paternal/haplotype1 contig graph (prefix.hap1.p_ctg.gfa). This graph keeps the phased paternal/haplotype1 assembly.
-
Phased maternal/haplotype2 contig graph (prefix.hap2.p_ctg.gfa). This graph keeps the phased maternal/haplotype2 assembly.
Hifiasm writes error corrected reads to the prefix.ec.bin binary file and writes overlaps to prefix.ovlp.source.bin and prefix.ovlp.reverse.bin.
The following table shows the statistics of several hifiasm primary assemblies:
Dataset | Size | Cov. | Asm options | CPU time | Wall time | RAM | N50 |
---|---|---|---|---|---|---|---|
Mouse (C57/BL6J) | 2.6Gb | ×25 | -t48 -l0 | 172.9h | 4.8h | 76G | 21.1Mb |
Maize (B73) | 2.2Gb | ×22 | -t48 -l0 | 203.2h | 5.1h | 68G | 36.7Mb |
Strawberry | 0.8Gb | ×36 | -t48 -D10 | 152.7h | 3.7h | 91G | 17.8Mb |
Frog | 9.5Gb | ×29 | -t48 | 2834.3h | 69.0h | 463G | 9.3Mb |
Redwood | 35.6Gb | ×28 | -t80 | 3890.3h | 65.5h | 699G | 5.4Mb |
Human (CHM13) | 3.1Gb | ×32 | -t48 -l0 | 310.7h | 8.2h | 114G | 88.9Mb |
Human (HG00733) | 3.1Gb | ×33 | -t48 | 269.1h | 6.9h | 135G | 69.9Mb |
Human (HG002) | 3.1Gb | ×36 | -t48 | 305.4h | 7.7h | 137G | 98.7Mb |
Hifiasm can assemble a 3.1Gb human genome in several hours or a ~30Gb hexaploid redwood genome in a few days on a single machine. For trio binning assembly:
Dataset | Cov. | CPU time | Elapsed time | RAM | N50 |
---|---|---|---|---|---|
HG00733, [father], [mother] | ×33 | 269.1h | 6.9h | 135G | 35.1Mb (paternal), 34.9Mb (maternal) |
HG002, [father], [mother] | ×36 | 305.4h | 7.7h | 137G | 41.0Mb (paternal), 40.8Mb (maternal) |
NA12878, [father], [mother] | ×30 | 180.8h | 4.9h | 123G | 27.7Mb (paternal), 27.0Mb (maternal) |
Except NA12878, the assemblies above were produced by hifiasm v0.12 and can be downloaded at
ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/hifiasm-0.12/
NA12878 was assembled with an older version of hifiasm and is available at
ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/
For detailed description of options, please see man ./hifiasm.1
. The -h
option of hifiasm also provides brief description of options. If you have
further questions, please raise an issue at the issue
page.
- Purging haplotig duplications may introduce misassemblies.