Skip to content

Commit

Permalink
Merge pull request #181 from polio-nanopore/phylo-mods
Browse files Browse the repository at this point in the history
Phylo mods
  • Loading branch information
aineniamh authored Nov 5, 2023
2 parents 38fe0b6 + 462d2cd commit 02017ea
Show file tree
Hide file tree
Showing 11 changed files with 329 additions and 215 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/piranha.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ jobs:
run: piranha --version
- name: Run piranha with test data
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes01.csv -t 2 2>&1 | tee piranha.log
- name: Run piranha with all data
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 2>&1 | tee piranha_all.log
- name: Run piranha in phylo mode
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 -rp -ud -sd piranha/test/supp_data 2>&1 | tee piranha_phylo.log
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,17 +367,21 @@ and piranha will check which ones you have installed with your version of medaka

## Optional phylogenetics module **\*NEW FEATURE\***

Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. There are 3 additional dependencies needed if you wish to run this module:
Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. If you have previously installed piranha, are 3 additional dependencies needed if you wish to run this module:
- IQTREE2
- mafft
- jclusterfunk
The latest environment file contains this dependencies, so to install you can just update your environment (`conda env update -f environment.yml`) or run using the latest image for piranha GUI.

This module will cluster any consensus sequences generated during the run into `reference_group`, so either `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1` and will ultimately build one maximum-likelihood phylogeny for each reference group with consensus sequences in a given sequencing run. To annotate the phylogeny with certain metadata from the barcodes.csv file, specify columns to include with `-pcol/--phylo-metadata-columns`.

Piranha then extracts any relevant reference sequences from the installed reference file (identified by having `display_name=Sabin1-related` in their sequence header, or whichever reference group the relevant phylogeny will be for).

An optional file of local sequences can be supplied to supplement the phylogenetic analysis with `-ss/--supplementary-sequences`. This file should be in FASTA format, but does not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, this file should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.
An optional set of local sequences can be supplied to supplement the phylogenetic analysis. To supply them to piranha, point to the correct directory using `-sd,--supplementary-datadir`. The sequence files should be in FASTA format, but do not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, the sequence files should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.

This supplementary sequence files can be accompanied with csv metadata files (one row per supplementary sequence) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`.

This supplementary sequence file can be accompanied with a csv metadata file (one row per supplementary sequence) (`-sm/--supplementary-metadata`) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`
Piranha will iterate accross the directory supplied and amalgamate the FASTA files, retaining any sequences with `display_name=X` in the header description, where X can be one of `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1`. It then will read in every csv file it detects in this directory and attempts to match any metadata to the gathered fasta records. These will be added to the relevant phylogenies.

The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo`, which then triggers the following analysis steps:
- Amalgamate the newly generated consensus sequences for all barcodes into their respective reference groups.
Expand All @@ -391,6 +395,11 @@ The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo
- Annotate the tree newick files with the specified metadata (Default: just whether it's a new consensus sequence or not).
- Extract phylogenetic trees and embed in interactive report.

## Update local database **\*NEW FEATURE\***

If you supply a path to the `-sd,--supplementary-datadir` for the phylogenetics module, you have the option of updating this data directory with the new consesnsus sequences generated during the piranaha analysis. If you run with the `-ud,--update-local-database` flag, piranha will write out the new sequences and any accompanying metadata supplied into the directory provided.


## Output options

By default the output directory will be created in the current working directory and will be named `analysis-YYYY-MM-DD`, where YYYY-MM-DD is today's date. This output can be configured in a number of ways. For example, the prefix `analysis` can be overwritten by using the `-pre/--output-prefix new_prefix` flag (or `output_prefix: new_prefix` in a config file) and this will change the default behaviour to `new_prefix_YYYY-MM-DD`. It's good practice not to include spaces or special characters in your directory names.
Expand Down
9 changes: 4 additions & 5 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ channels:
- defaults
dependencies:
- python=3.10
- bcftools=1.11
- coreutils=9.1
- iqtree>=2.1
- cov-ert::jclusterfunk>=0.0.25
- mafft
- minimap2=2.17
- samtools=1.11
- bcftools=1.11
- tabix=1.11
- snakemake-minimal
- iqtree>=2.1
- mafft
- cov-ert::jclusterfunk>=0.0.25
- tabix=1.11
125 changes: 82 additions & 43 deletions piranha/analysis/phylo_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
seq_clusters = collections.defaultdict(list)

header = VALUE_PHYLO_HEADER

written = {}
for record in SeqIO.parse(sample_seqs,KEY_FASTA):
for ref_group in config[KEY_REFERENCES_FOR_CNS]:

Expand All @@ -26,8 +26,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
"""

fields = record.description.split(" ")
record_id = fields[0]
record_sample,reference_group,cns_id,epid,sample_date = record_id.split("|")

record_sample,reference_group,cns_id,epid,sample_date = record.id.split("|")

description_dict = {}
for field in fields[1:]:
Expand All @@ -37,14 +37,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
if ref_group == reference_group:

new_record = record
new_record.description = ""
new_record.id = record.id

barcode = description_dict[KEY_BARCODE]
name = record_id
name = new_record.id

new_record.description = name
new_record.id = name

seq_clusters[ref_group].append(new_record)

seq_metadata[name][KEY_NAME] = name
seq_metadata[name][KEY_SAMPLE] = record_sample
Expand All @@ -66,6 +64,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
call = "Sabin-like"

seq_metadata[name][KEY_CALL] = call
seq_clusters[ref_group].append(new_record)
continue


print(green("Reference groups for phylo pipeline:"))
Expand All @@ -74,15 +74,20 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence

if supplementary_sequences:
for record in SeqIO.parse(supplementary_sequences,KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)

seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_SOURCE] = "Background"
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group
if record:
for ref_group in seq_clusters:
if ref_group in record.description:


seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_SOURCE] = "Background"
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group

new_record = record
new_record.description = ""
seq_clusters[ref_group].append(new_record)

if supplementary_metadata:
with open(supplementary_metadata, "r") as f:
Expand All @@ -99,27 +104,29 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
seq_metadata[sample][col] = row[col]

for record in SeqIO.parse(reference_sequences, KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)
if record:
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)

seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id

seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group
seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group

if "Sabin" in record.description:
seq_metadata[record.id][KEY_SOURCE] = "Sabin"
else:
seq_metadata[record.id][KEY_SOURCE] = "Reference"
if "Sabin" in record.description:
seq_metadata[record.id][KEY_SOURCE] = "Sabin"
else:
seq_metadata[record.id][KEY_SOURCE] = "Reference"

for record in SeqIO.parse(outgroup_sequences, KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
new_record = record
new_record.description = ""
new_record.id = "outgroup"
new_record.description = "outgroup"

seq_clusters[ref_group].append(new_record)

with open(barcodes_csv, "r") as f:
Expand Down Expand Up @@ -162,7 +169,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
writer0.writerow(row)

with open(os.path.join(phylo_outdir, f"{i}.fasta"),"w") as fw:
SeqIO.write(seq_clusters[i], fw, KEY_FASTA)
records = seq_clusters[i]
record_dict = {}
for record in records:
record_dict[record.id] = record
unique_records = [r for r in record_dict.values()]
SeqIO.write(unique_records, fw, KEY_FASTA)

tree_annotations = config[KEY_TREE_ANNOTATIONS]
for i in header:
Expand All @@ -172,22 +184,49 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence

return list(seq_clusters.keys()),tree_annotations

def update_local_database(supplementary_sequences,sample_sequences,output_file):
with open(output_file,"w") as fw:
countall = 0
def update_local_database(sample_sequences,detailed_csv,new_db_seqs,new_db_metadata,config):

record_ids = {}
with open(new_db_seqs,"w") as fw:
countnew = 0
for record in SeqIO.parse(supplementary_sequences, "fasta"):
SeqIO.write(record, fw, "fasta")
countall+=1

for record in SeqIO.parse(sample_sequences, "fasta"):
new_record = record
desc_list = new_record.description.split(" ")
new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
new_record.description = " ".join(new_desc_list)
SeqIO.write(new_record, fw, "fasta")
countall+=1
countnew+=1
write_record = True

for i in desc_list:
if i.startswith("variant_count"):
count = int(i.split("=")[1])
if count < 6:
write_record = False

if write_record:
new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
new_record.description = " ".join(new_desc_list)

SeqIO.write(new_record, fw, "fasta")
countnew+=1
sample = record.id.split("|")[0]
record_ids[record.id] = sample

with open(new_db_metadata,"w") as fw:
with open(detailed_csv,"r") as f:
reader = csv.DictReader(f)
header = reader.fieldnames
header.append(config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN])

writer = csv.DictWriter(fw, fieldnames=header, lineterminator="\n")
writer.writeheader()
sample_data = {}
for row in reader:
sample = row[KEY_SAMPLE]
sample_data[sample] = row

for record_id in record_ids:
sample = record_ids[record_id]
row = sample_data[sample]
row[config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN]] = record_id
writer.writerow(row)

print(green(f"Local database updated with ")+ f"{countnew}"+ green(" newly generated records."))
print(green(f"Total records in local database:"), countall)
85 changes: 43 additions & 42 deletions piranha/analysis/stool_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,58 +33,59 @@ def gather_fasta_files(summary_info, barcodes_csv, input_cns_list,all_metdata,ru
cns_counter = collections.Counter()
for cns_file in input_cns_list:
for record in SeqIO.parse(cns_file, KEY_FASTA):
cns_info= record.description.split(" ")
ref,barcode,var_count,var_string=cns_info[0].split("|")

info = []
for row in analysis_info[barcode]:
if row[KEY_REFERENCE] == ref:
info = row
if record:
cns_info= record.description.split(" ")
ref,barcode,var_count,var_string=cns_info[0].split("|")

info = []
for row in analysis_info[barcode]:
if row[KEY_REFERENCE] == ref:
info = row

metadata = input_metadata[barcode]
metadata = input_metadata[barcode]

record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
cns_counter[record_id] += 1
record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
cns_counter[record_id] += 1

record_id += f"|CNS{cns_counter[record_id]}"
record_id += f"|CNS{cns_counter[record_id]}"

if KEY_EPID in metadata:
record_id += f"|{metadata[KEY_EPID]}"
else:
record_id += "|"
if KEY_EPID in metadata:
record_id += f"|{metadata[KEY_EPID]}"
else:
record_id += "|"

if KEY_DATE in metadata:
record_id += f"|{metadata[KEY_DATE]}"
else:
record_id += "|"
if KEY_DATE in metadata:
record_id += f"|{metadata[KEY_DATE]}"
else:
record_id += "|"

record_id += f" {KEY_BARCODE}={barcode}"
record_id += f" {KEY_REFERENCE}={ref}"
record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
record_id += f" {KEY_BARCODE}={barcode}"
record_id += f" {KEY_REFERENCE}={ref}"
record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"

if runname:
record_id += f" {KEY_RUNNAME}={runname}"
if runname:
record_id += f" {KEY_RUNNAME}={runname}"

if "Sabin" in ref:
record_id += f" {KEY_VARIANT_COUNT}={var_count}"
record_id += f" {KEY_VARIANTS}={var_string}"
else:
record_id += f" {KEY_VARIANT_COUNT}=NA"
record_id += f" {KEY_VARIANTS}=NA"
if "Sabin" in ref:
record_id += f" {KEY_VARIANT_COUNT}={var_count}"
record_id += f" {KEY_VARIANTS}={var_string}"
else:
record_id += f" {KEY_VARIANT_COUNT}=NA"
record_id += f" {KEY_VARIANTS}=NA"

if all_metdata:

for col in metadata:
if col != KEY_SAMPLE and col != KEY_BARCODE:
record_id += f" {col}={metadata[col]}"
"""
record header is:
>SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA
if all_metdata:
for col in metadata:
if col != KEY_SAMPLE and col != KEY_BARCODE:
record_id += f" {col}={metadata[col]}"
"""
record header is:
>SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA
if "all_metadata" then everything else gets added to the description
"""
fw.write(f">{record_id}\n{record.seq}\n")
handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")
if "all_metadata" then everything else gets added to the description
"""
fw.write(f">{record_id}\n{record.seq}\n")
handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")

for handle in handle_dict:
handle_dict[handle].close()
Expand Down
Loading

0 comments on commit 02017ea

Please sign in to comment.