Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phylo mods #181

Merged
merged 24 commits into from
Nov 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9338444
Initial script upload - no integration attempted
ZoeVance Oct 31, 2023
850740f
Fixing vcf parsing bug
ZoeVance Oct 31, 2023
55ce68b
Tidying python scripts
ZoeVance Oct 31, 2023
31fc728
adding env updates with version numbers
aineniamh Nov 3, 2023
98b3727
inital key for haplo pipe
aineniamh Nov 3, 2023
9804608
basic boilerplate pipeline
aineniamh Nov 3, 2023
0f88194
changing CLI from 2 args with fa and csv to just a single directory t…
aineniamh Nov 3, 2023
edbd6c1
modifying the phylo input qc and parsing to now parse through a direc…
aineniamh Nov 3, 2023
8116a5b
adding new phylo keys
aineniamh Nov 3, 2023
982fe97
moving update db code to after def of detailed csv. now call to updat…
aineniamh Nov 3, 2023
fef65b2
mod function for updating local database- now writes out fa and metad…
aineniamh Nov 3, 2023
c850c35
adding new keys to config
aineniamh Nov 3, 2023
a0db8d3
writing to correct out files
aineniamh Nov 3, 2023
bcaae8a
writing to amalgamated metadata file
aineniamh Nov 3, 2023
2bf7d89
changing to directory structure for eg supp data
aineniamh Nov 3, 2023
a725b09
updating phylo pipeline for unique and only record id in tree- annota…
aineniamh Nov 4, 2023
90a11e6
mod for name
aineniamh Nov 4, 2023
2240b47
update readme
aineniamh Nov 5, 2023
eee4df2
only writing seqs if more than 5 SNPs
aineniamh Nov 5, 2023
3b2e893
phylo mode to github actions
aineniamh Nov 5, 2023
5c89029
rm env for haplot
aineniamh Nov 5, 2023
d3dca41
rm haplo dev from this branch
aineniamh Nov 5, 2023
06835ce
reorder phylo input parsing so that it has access to runname and today
aineniamh Nov 5, 2023
462d2cd
pass over data fasta file matching runname.today.fasta
aineniamh Nov 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/piranha.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ jobs:
run: piranha --version
- name: Run piranha with test data
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes01.csv -t 2 2>&1 | tee piranha.log
- name: Run piranha with all data
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 2>&1 | tee piranha_all.log
- name: Run piranha in phylo mode
run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 -rp -ud -sd piranha/test/supp_data 2>&1 | tee piranha_phylo.log
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -364,17 +364,21 @@ and piranha will check which ones you have installed with your version of medaka

## Optional phylogenetics module **\*NEW FEATURE\***

Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. There are 3 additional dependencies needed if you wish to run this module:
Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. If you have previously installed piranha, are 3 additional dependencies needed if you wish to run this module:
- IQTREE2
- mafft
- jclusterfunk
The latest environment file contains this dependencies, so to install you can just update your environment (`conda env update -f environment.yml`) or run using the latest image for piranha GUI.

This module will cluster any consensus sequences generated during the run into `reference_group`, so either `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1` and will ultimately build one maximum-likelihood phylogeny for each reference group with consensus sequences in a given sequencing run. To annotate the phylogeny with certain metadata from the barcodes.csv file, specify columns to include with `-pcol/--phylo-metadata-columns`.

Piranha then extracts any relevant reference sequences from the installed reference file (identified by having `display_name=Sabin1-related` in their sequence header, or whichever reference group the relevant phylogeny will be for).

An optional file of local sequences can be supplied to supplement the phylogenetic analysis with `-ss/--supplementary-sequences`. This file should be in FASTA format, but does not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, this file should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.
An optional set of local sequences can be supplied to supplement the phylogenetic analysis. To supply them to piranha, point to the correct directory using `-sd,--supplementary-datadir`. The sequence files should be in FASTA format, but do not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, the sequence files should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.

This supplementary sequence files can be accompanied with csv metadata files (one row per supplementary sequence) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`.

This supplementary sequence file can be accompanied with a csv metadata file (one row per supplementary sequence) (`-sm/--supplementary-metadata`) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`
Piranha will iterate accross the directory supplied and amalgamate the FASTA files, retaining any sequences with `display_name=X` in the header description, where X can be one of `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1`. It then will read in every csv file it detects in this directory and attempts to match any metadata to the gathered fasta records. These will be added to the relevant phylogenies.

The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo`, which then triggers the following analysis steps:
- Amalgamate the newly generated consensus sequences for all barcodes into their respective reference groups.
Expand All @@ -388,6 +392,11 @@ The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo
- Annotate the tree newick files with the specified metadata (Default: just whether it's a new consensus sequence or not).
- Extract phylogenetic trees and embed in interactive report.

## Update local database **\*NEW FEATURE\***

If you supply a path to the `-sd,--supplementary-datadir` for the phylogenetics module, you have the option of updating this data directory with the new consesnsus sequences generated during the piranaha analysis. If you run with the `-ud,--update-local-database` flag, piranha will write out the new sequences and any accompanying metadata supplied into the directory provided.


## Output options

By default the output directory will be created in the current working directory and will be named `analysis-YYYY-MM-DD`, where YYYY-MM-DD is today's date. This output can be configured in a number of ways. For example, the prefix `analysis` can be overwritten by using the `-pre/--output-prefix new_prefix` flag (or `output_prefix: new_prefix` in a config file) and this will change the default behaviour to `new_prefix_YYYY-MM-DD`. It's good practice not to include spaces or special characters in your directory names.
Expand Down
9 changes: 4 additions & 5 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ channels:
- defaults
dependencies:
- python=3.10
- bcftools=1.11
- coreutils=9.1
- iqtree>=2.1
- cov-ert::jclusterfunk>=0.0.25
- mafft
- minimap2=2.17
- samtools=1.11
- bcftools=1.11
- tabix=1.11
- snakemake-minimal
- iqtree>=2.1
- mafft
- cov-ert::jclusterfunk>=0.0.25
- tabix=1.11
125 changes: 82 additions & 43 deletions piranha/analysis/phylo_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
seq_clusters = collections.defaultdict(list)

header = VALUE_PHYLO_HEADER

written = {}
for record in SeqIO.parse(sample_seqs,KEY_FASTA):
for ref_group in config[KEY_REFERENCES_FOR_CNS]:

Expand All @@ -26,8 +26,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
"""

fields = record.description.split(" ")
record_id = fields[0]
record_sample,reference_group,cns_id,epid,sample_date = record_id.split("|")

record_sample,reference_group,cns_id,epid,sample_date = record.id.split("|")

description_dict = {}
for field in fields[1:]:
Expand All @@ -37,14 +37,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
if ref_group == reference_group:

new_record = record
new_record.description = ""
new_record.id = record.id

barcode = description_dict[KEY_BARCODE]
name = record_id
name = new_record.id

new_record.description = name
new_record.id = name

seq_clusters[ref_group].append(new_record)

seq_metadata[name][KEY_NAME] = name
seq_metadata[name][KEY_SAMPLE] = record_sample
Expand All @@ -66,6 +64,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
call = "Sabin-like"

seq_metadata[name][KEY_CALL] = call
seq_clusters[ref_group].append(new_record)
continue


print(green("Reference groups for phylo pipeline:"))
Expand All @@ -74,15 +74,20 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence

if supplementary_sequences:
for record in SeqIO.parse(supplementary_sequences,KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)

seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_SOURCE] = "Background"
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group
if record:
for ref_group in seq_clusters:
if ref_group in record.description:


seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_SOURCE] = "Background"
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group

new_record = record
new_record.description = ""
seq_clusters[ref_group].append(new_record)

if supplementary_metadata:
with open(supplementary_metadata, "r") as f:
Expand All @@ -99,27 +104,29 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
seq_metadata[sample][col] = row[col]

for record in SeqIO.parse(reference_sequences, KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)
if record:
for ref_group in seq_clusters:
if ref_group in record.description:
seq_clusters[ref_group].append(record)

seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id

seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group
seq_metadata[record.id][KEY_NAME] = record.id
seq_metadata[record.id][KEY_SAMPLE] = record.id
seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
seq_metadata[record.id][KEY_CALL] = ref_group

if "Sabin" in record.description:
seq_metadata[record.id][KEY_SOURCE] = "Sabin"
else:
seq_metadata[record.id][KEY_SOURCE] = "Reference"
if "Sabin" in record.description:
seq_metadata[record.id][KEY_SOURCE] = "Sabin"
else:
seq_metadata[record.id][KEY_SOURCE] = "Reference"

for record in SeqIO.parse(outgroup_sequences, KEY_FASTA):
for ref_group in seq_clusters:
if ref_group in record.description:
new_record = record
new_record.description = ""
new_record.id = "outgroup"
new_record.description = "outgroup"

seq_clusters[ref_group].append(new_record)

with open(barcodes_csv, "r") as f:
Expand Down Expand Up @@ -162,7 +169,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
writer0.writerow(row)

with open(os.path.join(phylo_outdir, f"{i}.fasta"),"w") as fw:
SeqIO.write(seq_clusters[i], fw, KEY_FASTA)
records = seq_clusters[i]
record_dict = {}
for record in records:
record_dict[record.id] = record
unique_records = [r for r in record_dict.values()]
SeqIO.write(unique_records, fw, KEY_FASTA)

tree_annotations = config[KEY_TREE_ANNOTATIONS]
for i in header:
Expand All @@ -172,22 +184,49 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence

return list(seq_clusters.keys()),tree_annotations

def update_local_database(supplementary_sequences,sample_sequences,output_file):
with open(output_file,"w") as fw:
countall = 0
def update_local_database(sample_sequences,detailed_csv,new_db_seqs,new_db_metadata,config):

record_ids = {}
with open(new_db_seqs,"w") as fw:
countnew = 0
for record in SeqIO.parse(supplementary_sequences, "fasta"):
SeqIO.write(record, fw, "fasta")
countall+=1

for record in SeqIO.parse(sample_sequences, "fasta"):
new_record = record
desc_list = new_record.description.split(" ")
new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
new_record.description = " ".join(new_desc_list)
SeqIO.write(new_record, fw, "fasta")
countall+=1
countnew+=1
write_record = True

for i in desc_list:
if i.startswith("variant_count"):
count = int(i.split("=")[1])
if count < 6:
write_record = False

if write_record:
new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
new_record.description = " ".join(new_desc_list)

SeqIO.write(new_record, fw, "fasta")
countnew+=1
sample = record.id.split("|")[0]
record_ids[record.id] = sample

with open(new_db_metadata,"w") as fw:
with open(detailed_csv,"r") as f:
reader = csv.DictReader(f)
header = reader.fieldnames
header.append(config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN])

writer = csv.DictWriter(fw, fieldnames=header, lineterminator="\n")
writer.writeheader()
sample_data = {}
for row in reader:
sample = row[KEY_SAMPLE]
sample_data[sample] = row

for record_id in record_ids:
sample = record_ids[record_id]
row = sample_data[sample]
row[config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN]] = record_id
writer.writerow(row)

print(green(f"Local database updated with ")+ f"{countnew}"+ green(" newly generated records."))
print(green(f"Total records in local database:"), countall)
85 changes: 43 additions & 42 deletions piranha/analysis/stool_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,58 +33,59 @@ def gather_fasta_files(summary_info, barcodes_csv, input_cns_list,all_metdata,ru
cns_counter = collections.Counter()
for cns_file in input_cns_list:
for record in SeqIO.parse(cns_file, KEY_FASTA):
cns_info= record.description.split(" ")
ref,barcode,var_count,var_string=cns_info[0].split("|")

info = []
for row in analysis_info[barcode]:
if row[KEY_REFERENCE] == ref:
info = row
if record:
cns_info= record.description.split(" ")
ref,barcode,var_count,var_string=cns_info[0].split("|")

info = []
for row in analysis_info[barcode]:
if row[KEY_REFERENCE] == ref:
info = row

metadata = input_metadata[barcode]
metadata = input_metadata[barcode]

record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
cns_counter[record_id] += 1
record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
cns_counter[record_id] += 1

record_id += f"|CNS{cns_counter[record_id]}"
record_id += f"|CNS{cns_counter[record_id]}"

if KEY_EPID in metadata:
record_id += f"|{metadata[KEY_EPID]}"
else:
record_id += "|"
if KEY_EPID in metadata:
record_id += f"|{metadata[KEY_EPID]}"
else:
record_id += "|"

if KEY_DATE in metadata:
record_id += f"|{metadata[KEY_DATE]}"
else:
record_id += "|"
if KEY_DATE in metadata:
record_id += f"|{metadata[KEY_DATE]}"
else:
record_id += "|"

record_id += f" {KEY_BARCODE}={barcode}"
record_id += f" {KEY_REFERENCE}={ref}"
record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
record_id += f" {KEY_BARCODE}={barcode}"
record_id += f" {KEY_REFERENCE}={ref}"
record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"

if runname:
record_id += f" {KEY_RUNNAME}={runname}"
if runname:
record_id += f" {KEY_RUNNAME}={runname}"

if "Sabin" in ref:
record_id += f" {KEY_VARIANT_COUNT}={var_count}"
record_id += f" {KEY_VARIANTS}={var_string}"
else:
record_id += f" {KEY_VARIANT_COUNT}=NA"
record_id += f" {KEY_VARIANTS}=NA"
if "Sabin" in ref:
record_id += f" {KEY_VARIANT_COUNT}={var_count}"
record_id += f" {KEY_VARIANTS}={var_string}"
else:
record_id += f" {KEY_VARIANT_COUNT}=NA"
record_id += f" {KEY_VARIANTS}=NA"

if all_metdata:

for col in metadata:
if col != KEY_SAMPLE and col != KEY_BARCODE:
record_id += f" {col}={metadata[col]}"
"""
record header is:
>SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA
if all_metdata:
for col in metadata:
if col != KEY_SAMPLE and col != KEY_BARCODE:
record_id += f" {col}={metadata[col]}"
"""
record header is:
>SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA

if "all_metadata" then everything else gets added to the description
"""
fw.write(f">{record_id}\n{record.seq}\n")
handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")
if "all_metadata" then everything else gets added to the description
"""
fw.write(f">{record_id}\n{record.seq}\n")
handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")

for handle in handle_dict:
handle_dict[handle].close()
Expand Down
Loading
Loading