LocalHGT: an ultrafast horizontal gene transfer detection method from large microbial communities

Install

git clone https://github.com/deepomicslab/LocalHGT.git --depth 1
cd LocalHGT/
conda env create --name localhgt -f environment.yml
conda activate localhgt
make

python scripts/main.py -h  # detect HGT breakpoints
python scripts/infer_HGT_event.py -h # detect complete HGT events

Test

cd test/
sh run_BKP_detection.sh
sh run_event_detection.sh

See output/test_sample.acc.csv for breakpoint results, and see test_event_output.csv for event results.

Construct reference database

Users can construct customized reference databases. The reference database (a single fasta file) should contain all the representative genomes of your concerning bacteria. For example, to analyze HGT events between bacteria A, B, and C, we should collect the representative genomes of A, B, and C and merge them into a single fasta file.

To analyze HGTs in the human gut microbiome, we use gut-specific representative genome collection of the UHGG database. The script paper_results/build_UHGG_reference.py can download the human gut-specific UHGG v1 database. The command is

usage: python build_UHGG_reference.py -h

Construct the reference from the gut-specific UHGG V1 database.

optional arguments:
  -r        <str> Generated reference file. (default: uhgg_v1.rep.fasta)
  -b        <str> Folder saves all the downloaded assemblies. (default:
              genomes/)
  -m        <int> Try this number of times until all the genomes are
              downloaded. (default: 10)
  -h, --help

Example: python paper_results/build_UHGG_reference.py -r my_ref.fasta -b genomes_dir -m 4

Then index the database using samtools

samtools faidx ref_database.fasta

Also, at the first run, LocalHGT will index the database automatically, which will take several hours.

Note:

reference index file size is approx (reference size) * 4 * (number of denoted hash functions), make sure the disk has enough space.

Run

Refine sequencing reads

It is highly advised to refine the sequencing reads using tools like fastp before calling HGT. We include fastp in the conda environment, and refine the reads like

fastp -i raw_1.fq -I raw_2.fq -o refine_1.fq -O refine_2.fq

Note:

Currently, LocalHGT only supports paired-end short-read sequencing data (e.g., Illumina data).

Detect HGT breakpoints

First, infer HGT breakpoints by running python main.py like

usage: python main.py -h

Detect HGT breakpoints from metagenomics sequencing data.

required arguments:
  -r             <str> reference file which contains all the representative
                   references of concerned bacteria. (default: None)
  --fq1          <str> unzipped fastq 1 file. (default: None)
  --fq2          <str> unzipped fastq 2 file. (default: None)
  -s             <str> Sample name. (default: sample)
  -o             <str> Output folder. (default: ./)

optional arguments:
  -k             <int> kmer length. (default: 32)
  -t             <int> number of threads. (default: 10)
  -e             <int> number of hash functions (1-9). (default: 3)
  -a             <0/1> 1 indicates retain reads with XA tag. (default: 1)
  -q             <int> minimum read mapping quality in BAM. (default: 20)
  --seed         <int> seed to initialize a pseudorandom number generator.
                   (default: 1)
  --use_kmer     <1/0> 1 means using kmer to extract HGT-related segment, 0
                   means using original reference. (default: 1)
  --hit_ratio    <float> minimum fuzzy kmer match ratio to extract a
                   reference fragment. (default: 0.1)
  --match_ratio  <float> minimum exact kmer match ratio to extract a
                   reference fragment. (default: 0.08)
  --max_peak     <int> maximum candidate BKP count. (default: 300000000)
  --sample       <float> down-sample in kmer counting: (0-1) means sampling
                   proportion, (>1) means sampling base count (bp). (default:
                   2000000000)
  --read_info    <0/1> 1 indicates including reads info, 0 indicates not
                   (just for evaluation). (default: 1)
  -h, --help

A command example:

python scripts/main.py -r reference.fa --fq1 /mnt/d/breakpoints/HGT/uhgg_length//species20_snp0.01_depth50_reads100_sample_0.1.fq --fq2 /mnt/d/breakpoints/HGT/uhgg_length//species20_snp0.01_depth50_reads100_sample_0.2.fq -s species20_snp0.01_depth50_reads100_sample_0  -o test

The detected HGT breakpoints are stored in the <sample name>.acc.csv file within the output folder.

Note:

With a small reference, we can skip extracting HGT-related segments by setting --use_kmer 0.
With a small reference, while maintaining the extraction of HGT-related segments, we can set a small value of -k to reduce memory usage.

Detect complete HGT events

Second, infer complete HGT events by matching breakpoints after detecting HGT breakpoints for all the samples.

usage: python infer_HGT_event.py -h

Infer complete HGT events by matching breakpoint pairs.

required arguments:
  -r        <str> Reference file. (default: None)
  -b        <str> Folder saves all the breakpoint results from all samples.
              (default: None)
  -f        <str> Output file to save all inferred HGT events. (default:
              complete_HGT_event.csv)

optional arguments:
  -n        <int> minimum supporting split read number (default: 2)
  -m        <int> minimum transfer sequence length (default: 500)
  -h, --help

Note:

It is recommended to detect HGT breakpoints for each sample and store the results in a common output folder. Subsequently, when detecting complete HGT events, specify the output folder using the -b parameter. This approach allows LocalHGT to consider all the samples collectively, resulting in more reliable results for complete HGT events.

Output interpretation

HGT breakpoints

The HGT breakpoints are saved in the *acc.csv file. Here is an example:

# the number of reads in the sample is: 41723899; Insert size is 681.
from_ref,from_pos,from_side,from_strand,to_ref,to_pos,to_side,to_strand,if_reverse,read_seq,ref_seq,similarity,from_split_reads,to_split_reads,cross_split_reads,pair_end
GUT_GENOME000031_3,41994,head,+,GUT_GENOME096518_4,692224,tail,+,False,GTGTCGGGGCTTATGATAATCATATCTTATTTTTC,GTGTCGGGGCTTATGATAATCATATCTTTTTTTTC,1.857,3,2,2,5
GUT_GENOME096518_4,725079,head,+,GUT_GENOME000031_3,41992,tail,+,False,TACGCGGAGGGATTATGGGAATGCTCACGGCAATCGAAATGGGAA,CCCGCGGCGGGATTATGGGAATGCTCACGGCAATCGAAATGGGAA,1.8,1,3,1,5

Interpret each column as:

Header	Description
from_ref	from reference ID
from_pos	from reference breakpoint position
from_side	from reference side
from_strand	from reference strand
to_ref	to reference ID
to_pos	to reference breakpoint position
to_side	to reference side
to_strand	to reference strand
if_reverse	if the transferred gene is reversely inserted to receptor
read_seq	used reads sequence in local alignment
ref_seq	aligned reference sequence in local alignment
similarity	alignment similarity in local alignment
from_split_reads	number of split reads mapped to the from_pos
to_split_reads	number of split reads mapped to the to_pos
cross_split_reads	number of split reads supported the breakpoint pair
pair_end	number of paired-end reads supported the breakpoint pair

Note:

Before conducting the analysis, it is advisable to filter out any instances of gene transfer between different contigs of the same species that may be present in the results. For example, GUT_GENOME000031_1 and GUT_GENOME000031_2 belong to the same species, the detected HGT breakpoint pairs or HGT events between them should be discarded.

HGT events

HGT event results are like

sample,receptor,insert_locus,donor,delete_start,delete_end,reverse_flag
species20_snp0,GUT_GENOME000149_2,369882,GUT_GENOME000537_12,40443,91965,True
species20_snp0,GUT_GENOME000189_7,22927,GUT_GENOME000534_3,78150,79127,True
species20_snp0,GUT_GENOME001261_3,336654,GUT_GENOME000202_1,72162,123917,False

Interpret each column as:

Header	Description
sample	Sample ID
receptor	receptor genome
insert_locus	the transfer sequence is inserted at this locus
donor	donor genome
delete_start	the start site of the transfer sequence on the donor genome
delete_end	the end site of the transfer sequence on the donor genome
reverse_flag	if the transferred gene is reversely inserted to receptor

Dependency

We recommend constructing the environment using conda with the environment.yml. Users can also prepare the env as follows:

Python 3.7+
Python Module: scipy, numpy, pandas, sklearn, pysam, scikit-bio, biopython, pyfaidx

Please install samtools and bwa, and add them to the system path, the version should be

samtools==1.11+
bwa-0.7.17+

Getting help

Should you have any queries, please feel free to contact us, we will reply as soon as possible ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
.vscode		.vscode
__pycache__		__pycache__
paper_results		paper_results
scripts		scripts
src		src
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LocalHGT: an ultrafast horizontal gene transfer detection method from large microbial communities

Install

Test

Construct reference database

Run

Refine sequencing reads

Detect HGT breakpoints

Detect complete HGT events

Output interpretation

HGT breakpoints

HGT events

Dependency

Getting help

About

Releases

Packages

Languages

yiqijiang17/LocalHGT

Folders and files

Latest commit

History

Repository files navigation

LocalHGT: an ultrafast horizontal gene transfer detection method from large microbial communities

Install

Test

Construct reference database

Run

Refine sequencing reads

Detect HGT breakpoints

Detect complete HGT events

Output interpretation

HGT breakpoints

HGT events

Dependency

Getting help

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages