GraphMB is a Metagenomic Binner developed for long-read assemblies, that takes advantage of graph machine learning algorithms and the assembly graph generated during assembly. It has been tested on (meta)flye assemblies.
Our peer-reviewed paper:
Andre Lamurias, Mantas Sereika, Mads Albertsen, Katja Hose, Thomas Dyhre Nielsen, Metagenomic binning with assembly graph embeddings, Bioinformatics, 2022;, btac557, https://doi.org/10.1093/bioinformatics/btac557
GraphMB was developed with Python 3.7. All of its dependencies are automatically installed. For pre-processing and post-processing, check this README.
Can be found here: https://graphmb.readthedocs.io/en/latest/
pip install graphmb
pip install https://github.com/AndreLamurias/GraphMB/releases/download/v0.1.5/graphmb-0.2.2-py3-none-any.whl
Clone this repository, and then:
cd GraphMB
python -m venv venv; source venv/bin/activate # optional
pip install .
Either build the docker image with the Dockerfile or pull from dockerhub:
docker pull andrelamurias/graphmb
Then you can run GraphMB on a container. The image includes the Strong100 dataset. If you want to use other datasets, use the -v option to mount the path to your data.
docker run -it andrelamurias/graphmb bash
conda create -n graphmb -c conda-forge make cmake libgcc python=3.7 pip tensorflow
conda activate graphmb
pip install graphmb
The only files required are the contigs in fasta format, and the assembly graph in GFA format. For optimal performance, the assembly graph should be generated with Flye 2.9, since it includes the number of reads mapping to each pair of contigs. Also, for better results, CheckM is run on each contig using the general Bacteria marker sets. This is optional though, you can just run the model for a number of epochs and pick the last model. The marker genes are also used to improve model training through a constrained loss function. By default, it runs with with early stopping.
In summary, you need to have a directory with these files (names can be changed with arguments):
- assembly.fasta: assembly graph edge sequences or contigs
- assembly_graph.gfa: assembly graph. it should have the sequences of assembly.fasta as nodes.
- assembly_depth.txt: output of
jgi_summarize_bam_contig_depths
- marker_gene_stats.csv (optional): output of CheckM for each contig
If you do not specify this directory the --assembly
, you have to use the full path of all files.
Otherwise, it will assume that the files are inside the directory.
You can get an example of these files here. Download from this link and extract to data/strong100. The datasets used in our experiments are available here These datasets include the VAE embeddings obtained with VAMB, which are automatically used by GraphMB. If you want to regenerate these embeddings, use this fork of VAMB.
If you have your assembly in some directory, with the files mentioned above:
graphmb --assembly data/strong100/ --outdir results/strong100/ --markers marker_gene_stats.tsv
The outdir will be created if it does not exist. The marker file is optional. You can specify the filenames:
graphmb --assembly data/strong100/ --outdir results/strong100/ --assembly_name edges.fasta --graph_file assembly_graph.gfa --depth edges_depth.txt --markers marker_gene_stats.tsv
By default GraphMB saves a TSV file mapping each contig to a bin to the assembly directory, as well as the weights and output embeddings of the best model.
The output directory can be changed with the --outdir
argument.
To prevent GraphMB from run clustering after each epoch, do not provide the markers param or set the --evalepochs
option to a higher number (default is 10)
This will make it run faster but the results might not be optimal.
graphmb --assembly data/strong100/ --outdir results/strong100/
To use GPU both for training and clustering, use the --cuda param:
graphmb --assembly data/strong100/ --outdir results/strong100/ --cuda
You can also run on CPU and limit the number of threads to use:
graphmb --assembly data/strong100/ --outdir results/strong100/ --numcores 4
GraphMB was tested on graphs where the nodes are contig paths and not full contigs. This gives more granularity to the algorithm and the edges are directly obtained from the assembly graph. However in some cases this might be inconvenient, for example if We implemented an option to use a graph where the nodes are contigs.
graphmb --assembly data/strong100/ --outdir results/strong100/ --assembly_name contigs.fasta --depth contig_depth.txt --contignodes
Our workflows are available here. On this section we present an overview of how to get your data ready for GraphMB.
- Assembly your reads with metaflye:
flye -nano-raw <reads_file> -o <output> --meta
- Filter and polish assembly if necessary (or extract edge sequences and polish edge sequences instead)
mv assembly.fasta contigs.fasta
awk '/^S/{print ">"$2"\n"$3}' assembly_graph.gfa | fold > assembly.fasta
- Convert assembly graph to contig-based graph if you want to use full contig instead of edges
- Run CheckM on sequences with Bacteria markers:
mkdir edges
cd edges; cat ../assembly.fasta | awk '{ if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fa")} print $0 > filename }'; cd ..
find edges/ -name "* *" -type f | rename 's/ /_/g'
# evaluate edges
checkm taxonomy_wf -x fa domain Bacteria edges/ checkm_edges/
checkm qa checkm_edges/Bacteria.ms checkm_edges/ -f checkm_edges_polished_results.txt --tab_table -o 2
- Get abundances with
jgi_summarize_bam_contig_depths
:
minimap2 -I 64GB -d assembly.mmi assembly.fasta # make index
minimap2 -I 64GB -ax map-ont assembly.mmi <reads_file> > assembly.sam
samtools sort assembly.sam > assembly.bam
jgi_summarize_bam_contig_depths --outputDepth assembly_depth.txt assembly.bam
- Now you should have all the files to run GraphMB: assembly.fasta, assembly_graph.gfa and assembly_depth.txt. The marker_gene_stats.csv file will be saved to checkm_edges/storage/.
We have only tested GraphMB on flye assemblies. Flye generates a repeat graph where the nodes do not correspond to full contigs. Depending on your setup, you need to either use the edges as contigs.
To evaluate the binning output, we run checkM with the --reduced-tree
option and count HQ bins (>90 completeness
and <5 contamination).
The bins will be written to fasta files if writebins
is included in the --post
argument.
Alternatively, the --post contig2bin
option can be used (on by default) and the bins from that file can be written
to fasta with src/write_fasta_bins.py
.
Then you can run checkM on the bins directory:
checkm lineage_wf -x fa --reduced_tree -t 4 --tab_table bins_dir/ outputdir/ -f outputdir/checkm.tsv
We also provide a script to filter the checkM output and get the number of HQ bins:
python src/process_checkm_results_expanded.py outputdir/checkm.tsv 90 5
GraphMB contains many parameters to run various experiments, from changing the data inputs, to architecture of the model, training loss, data preprocessing, and output type. The defaults were chosen to obtain the published results on the WWTP datasets, but may require some tuning for different datasets and scenarios. The full list of parameters is below but this section focused on the most relevant ones.
If --assembly is given, that is used as the base data directory. Every other path is in relation to that directory. Otherwise, every data file path must be given in relation to the current directory
Note about markers: this file is not mandatory but assumed to exist by default. This is because we have run all of our experiments with it. Without this, the number of HQ bins will be probably worse.
Where to write the output files, including caches, and what prefix to use. If not given, GraphMB writes to the data directory given by --assembly.
Ignore cache and reprocess data files.
Repeat experiment nrun times. Use --seed to specify the initial seed, which is changed with every run to get different results.
Run model training and clustering on GPU.
If the contigs given by the --assembly_name parameter are actual contigs and not assembly graph edges, use this parameter, which will transform the assembly graph to use full contigs as nodes.
Model to be used by GraphMB. By default it uses a Graph Convolution Network, and trains a Variational Autoencoder first to generate node features. The VAE embeddings are saved to make it faster to rerun the GCN.
Other models:
- sage_lstm: original GraphMB GraphSAGE model, requires DGL installation
- gat and sage: alternative GNN model to GCN (VAEGBin)
- gcn_ccvae, gat_ccvae, sage_ccvae: combined VAE and GNN models, trained end-to-end, or without GNN if layers_gnn=0
usage: graphmb [-h] [--assembly ASSEMBLY] [--assembly_name ASSEMBLY_NAME] [--graph_file GRAPH_FILE] [--edge_threshold EDGE_THRESHOLD] [--depth DEPTH] [--features FEATURES] [--labels LABELS] [--embs EMBS] [--model_name MODEL_NAME]
[--activation ACTIVATION] [--layers_vae LAYERS_VAE] [--layers_gnn LAYERS_GNN] [--hidden_gnn HIDDEN_GNN] [--hidden_vae HIDDEN_VAE] [--embsize_gnn EMBSIZE_GNN] [--embsize_vae EMBSIZE_VAE] [--batchsize BATCHSIZE]
[--batchtype BATCHTYPE] [--dropout_gnn DROPOUT_GNN] [--dropout_vae DROPOUT_VAE] [--lr_gnn LR_GNN] [--lr_vae LR_VAE] [--graph_alpha GRAPH_ALPHA] [--kld_alpha KLD_ALPHA] [--ae_alpha AE_ALPHA] [--scg_alpha SCG_ALPHA]
[--clusteringalgo CLUSTERINGALGO] [--kclusters KCLUSTERS] [--aggtype AGGTYPE] [--decoder_input DECODER_INPUT] [--vaepretrain VAEPRETRAIN] [--ae_only] [--negatives NEGATIVES] [--quick] [--classify] [--fanout FANOUT]
[--epoch EPOCH] [--print PRINT] [--evalepochs EVALEPOCHS] [--evalskip EVALSKIP] [--eval_split EVAL_SPLIT] [--kmer KMER] [--rawfeatures] [--clusteringloss] [--targetmetric TARGETMETRIC] [--concatfeatures]
[--no_loss_weights] [--no_sample_weights] [--early_stopping EARLY_STOPPING] [--nruns NRUNS] [--mincontig MINCONTIG] [--minbin MINBIN] [--mincomp MINCOMP] [--randomize] [--labelgraph] [--binarize] [--noedges]
[--read_embs] [--reload] [--markers MARKERS] [--post POST] [--skip_preclustering] [--outname OUTNAME] [--cuda] [--noise] [--savemodel] [--tsne] [--numcores NUMCORES] [--outdir OUTDIR]
[--assembly_type ASSEMBLY_TYPE] [--contignodes] [--seed SEED] [--quiet] [--read_cache] [--version] [--loglevel LOGLEVEL] [--configfile CONFIGFILE]
Train graph embedding model
options:
-h, --help show this help message and exit
--assembly ASSEMBLY Assembly base path
--assembly_name ASSEMBLY_NAME
File name with contigs
--graph_file GRAPH_FILE
File name with graph
--edge_threshold EDGE_THRESHOLD
Remove edges with weight lower than this (keep only >=)
--depth DEPTH Depth file from jgi
--features FEATURES Features file mapping contig name to features
--labels LABELS File mapping contig to label
--embs EMBS No train, load embs
--model_name MODEL_NAME
One of the implemented models
--activation ACTIVATION
Activation function to use(relu, prelu, sigmoid, tanh)
--layers_vae LAYERS_VAE
Number of layers of the VAE
--layers_gnn LAYERS_GNN
Number of layers of the GNN
--hidden_gnn HIDDEN_GNN
Dimension of hidden layers of GNN
--hidden_vae HIDDEN_VAE
Dimension of hidden layers of VAE
--embsize_gnn EMBSIZE_GNN, --zg EMBSIZE_GNN
Output embedding dimension of GNN
--embsize_vae EMBSIZE_VAE, --zl EMBSIZE_VAE
Output embedding dimension of VAE
--batchsize BATCHSIZE
batchsize to train the VAE
--batchtype BATCHTYPE
Batch type, nodes or edges
--dropout_gnn DROPOUT_GNN
dropout of the GNN
--dropout_vae DROPOUT_VAE
dropout of the VAE
--lr_gnn LR_GNN learning rate
--lr_vae LR_VAE learning rate
--graph_alpha GRAPH_ALPHA
Coeficient for graph loss
--kld_alpha KLD_ALPHA
Coeficient for KLD loss
--ae_alpha AE_ALPHA Coeficient for AE loss
--scg_alpha SCG_ALPHA
Coeficient for SCG loss
--clusteringalgo CLUSTERINGALGO
clustering algorithm
--kclusters KCLUSTERS
Number of clusters (only for some clustering methods)
--aggtype AGGTYPE Aggregation type for GraphSAGE (mean, pool, lstm, gcn)
--decoder_input DECODER_INPUT
What to use for input to the decoder
--vaepretrain VAEPRETRAIN
How many epochs to pretrain VAE
--ae_only Do not use GNN (ae model must be used and decoder input must be ae
--negatives NEGATIVES
Number of negatives to train GraphSAGE
--quick Reduce number of nodes to run quicker
--classify Run classification instead of clustering
--fanout FANOUT Fan out, number of positive neighbors sampled at each level
--epoch EPOCH Number of epochs to train model
--print PRINT Print interval during training
--evalepochs EVALEPOCHS
Epoch interval to run eval
--evalskip EVALSKIP Skip eval of these epochs
--eval_split EVAL_SPLIT
Percentage of dataset to use for eval
--kmer KMER
--rawfeatures Use raw features
--clusteringloss Train with clustering loss
--targetmetric TARGETMETRIC
Metric to pick best epoch
--concatfeatures Concat learned and original features before clustering
--no_loss_weights Using edge weights for loss (positive only)
--no_sample_weights Using edge weights to sample negatives
--early_stopping EARLY_STOPPING
Stop training if delta between last two losses is less than this
--nruns NRUNS Number of runs
--mincontig MINCONTIG
Minimum size of input contigs
--minbin MINBIN Minimum size of clusters in bp
--mincomp MINCOMP Minimum size of connected components
--randomize Randomize graph
--labelgraph Create graph based on labels (ignore assembly graph)
--binarize Binarize adj matrix
--noedges Remove all but self edges from adj matrix
--read_embs Read embeddings from file
--reload Reload data
--markers MARKERS File with precomputed checkm results to eval
--post POST Output options
--skip_preclustering Use precomputed checkm results to eval
--outname OUTNAME, --outputname OUTNAME
Output (experiment) name
--cuda Use gpu
--noise Use noise generator
--savemodel Save best model to disk
--tsne Plot tsne at checkpoints
--numcores NUMCORES Number of cores to use
--outdir OUTDIR, --outputdir OUTDIR
Output dir (same as input assembly dir if not defined
--assembly_type ASSEMBLY_TYPE
flye or spades
--contignodes Use contigs as nodes instead of edges
--seed SEED Set seed
--quiet, -q Do not output epoch progress
--read_cache Do not check assembly files, read cached files only
--version, -v Print version and exit
--loglevel LOGLEVEL, -l LOGLEVEL
Log level