Merge pull request #181 from polio-nanopore/phylo-mods

Phylo mods
polio-nanopore · Nov 5, 2023 · 02017ea · 02017ea
2 parents 38fe0b6 + 462d2cd
commit 02017ea
Show file tree

Hide file tree

Showing 11 changed files with 329 additions and 215 deletions.
diff --git a/.github/workflows/piranha.yml b/.github/workflows/piranha.yml
@@ -26,5 +26,5 @@ jobs:
         run: piranha --version
       - name: Run piranha with test data
         run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes01.csv -t 2 2>&1 | tee piranha.log
-      - name: Run piranha with all data
-        run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 2>&1 | tee piranha_all.log
+      - name: Run piranha in phylo mode
+        run: piranha -i piranha/test/pak_run/demultiplexed --verbose -b piranha/test/pak_run/barcodes.csv -t 2 -rp -ud -sd piranha/test/supp_data 2>&1 | tee piranha_phylo.log
diff --git a/README.md b/README.md
@@ -367,17 +367,21 @@ and piranha will check which ones you have installed with your version of medaka
 
 ## Optional phylogenetics module **\*NEW FEATURE\***
 
-Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. There are 3 additional dependencies needed if you wish to run this module:
+Piranha allows the user to optionally run a phylogenetics module in addition to variant calling and consensus builing. If you have previously installed piranha, are 3 additional dependencies needed if you wish to run this module:
 - IQTREE2
 - mafft
 - jclusterfunk
+The latest environment file contains this dependencies, so to install you can just update your environment (`conda env update -f environment.yml`) or run using the latest image for piranha GUI.
+
 This module will cluster any consensus sequences generated during the run into `reference_group`, so either `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1` and will ultimately build one maximum-likelihood phylogeny for each reference group with consensus sequences in a given sequencing run. To annotate the phylogeny with certain metadata from the barcodes.csv file, specify columns to include with `-pcol/--phylo-metadata-columns`.
 
 Piranha then extracts any relevant reference sequences from the installed reference file (identified by having `display_name=Sabin1-related` in their sequence header, or whichever reference group the relevant phylogeny will be for). 
 
-An optional file of local sequences can be supplied to supplement the phylogenetic analysis with `-ss/--supplementary-sequences`. This file should be in FASTA format, but does not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, this file should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.
+An optional set of local sequences can be supplied to supplement the phylogenetic analysis. To supply them to piranha, point to the correct directory using `-sd,--supplementary-datadir`. The sequence files should be in FASTA format, but do not need to be aligned. To allow piranha to assign the sequences to the relevant phylogeny, the sequence files should have the reference group annotated in the header in the format `display_name=Sabin1-related`, for example.
+
+This supplementary sequence files can be accompanied with csv metadata files (one row per supplementary sequence) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`. 
 
-This supplementary sequence file can be accompanied with a csv metadata file (one row per supplementary sequence) (`-sm/--supplementary-metadata`) and this metadata can be included in the final report and annotated onto the phylogenies (`-smcol/--supplementary-metadata-columns`). By default, the metadata is matched to the FASTA sequence name with a column titled `sequence_name` but this header name can be configured by specifying `-smid/--supplementary-metadata-id-column`
+Piranha will iterate accross the directory supplied and amalgamate the FASTA files, retaining any sequences with `display_name=X` in the header description, where X can be one of `Sabin1-related`, `Sabin2-related`, `Sabin3-related` or `WPV1`. It then will read in every csv file it detects in this directory and attempts to match any metadata to the gathered fasta records. These will be added to the relevant phylogenies.
 
 The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo`, which then triggers the following analysis steps:
 - Amalgamate the newly generated consensus sequences for all barcodes into their respective reference groups.
@@ -391,6 +395,11 @@ The phylogenetic pipeline is activated by running with the flag `-rp/--run-phylo
 - Annotate the tree newick files with the specified metadata (Default: just whether it's a new consensus sequence or not). 
 - Extract phylogenetic trees and embed in interactive report.
 
+## Update local database  **\*NEW FEATURE\***
+
+If you supply a path to the `-sd,--supplementary-datadir` for the phylogenetics module, you have the option of updating this data directory with the new consesnsus sequences generated during the piranaha analysis. If you run with the `-ud,--update-local-database` flag, piranha will write out the new sequences and any accompanying metadata supplied into the directory provided.
+
+
 ## Output options
 
 By default the output directory will be created in the current working directory and will be named `analysis-YYYY-MM-DD`, where YYYY-MM-DD is today's date. This output can be configured in a number of ways. For example, the prefix `analysis` can be overwritten by using the `-pre/--output-prefix new_prefix` flag (or `output_prefix: new_prefix` in a config file) and this will change the default behaviour to `new_prefix_YYYY-MM-DD`. It's good practice not to include spaces or special characters in your directory names. 

diff --git a/environment.yml b/environment.yml
@@ -5,13 +5,12 @@ channels:
   - defaults
 dependencies:
   - python=3.10
+  - bcftools=1.11
   - coreutils=9.1
+  - iqtree>=2.1
+  - cov-ert::jclusterfunk>=0.0.25
   - mafft
   - minimap2=2.17
   - samtools=1.11
-  - bcftools=1.11
-  - tabix=1.11
   - snakemake-minimal
-  - iqtree>=2.1
-  - mafft
-  - cov-ert::jclusterfunk>=0.0.25
+  - tabix=1.11
diff --git a/piranha/analysis/phylo_functions.py b/piranha/analysis/phylo_functions.py
@@ -14,7 +14,7 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
     seq_clusters = collections.defaultdict(list)
 
     header = VALUE_PHYLO_HEADER
-
+    written = {}
     for record in SeqIO.parse(sample_seqs,KEY_FASTA):
         for ref_group in config[KEY_REFERENCES_FOR_CNS]:
 
@@ -26,8 +26,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
             """
 
             fields = record.description.split(" ")
-            record_id = fields[0]
-            record_sample,reference_group,cns_id,epid,sample_date = record_id.split("|")
+
+            record_sample,reference_group,cns_id,epid,sample_date = record.id.split("|")
 
             description_dict = {}
             for field in fields[1:]:
@@ -37,14 +37,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
             if ref_group == reference_group:
 
                 new_record = record
+                new_record.description = ""
+                new_record.id = record.id
 
                 barcode = description_dict[KEY_BARCODE]
-                name = record_id
+                name = new_record.id
 
-                new_record.description = name
-                new_record.id = name
-
-                seq_clusters[ref_group].append(new_record)
 
                 seq_metadata[name][KEY_NAME] = name
                 seq_metadata[name][KEY_SAMPLE] = record_sample
@@ -66,6 +64,8 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
                         call = "Sabin-like"
 
                 seq_metadata[name][KEY_CALL] = call
+                seq_clusters[ref_group].append(new_record)
+                continue
 
 
     print(green("Reference groups for phylo pipeline:"))
@@ -74,15 +74,20 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
 
     if supplementary_sequences:
         for record in SeqIO.parse(supplementary_sequences,KEY_FASTA):
-            for ref_group in seq_clusters:
-                if ref_group in record.description:
-                    seq_clusters[ref_group].append(record)
-
-                    seq_metadata[record.id][KEY_NAME] = record.id
-                    seq_metadata[record.id][KEY_SAMPLE] = record.id
-                    seq_metadata[record.id][KEY_SOURCE] = "Background"
-                    seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
-                    seq_metadata[record.id][KEY_CALL] = ref_group
+            if record:
+                for ref_group in seq_clusters:
+                    if ref_group in record.description:
+
+
+                        seq_metadata[record.id][KEY_NAME] = record.id
+                        seq_metadata[record.id][KEY_SAMPLE] = record.id
+                        seq_metadata[record.id][KEY_SOURCE] = "Background"
+                        seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
+                        seq_metadata[record.id][KEY_CALL] = ref_group
+
+                        new_record = record
+                        new_record.description = ""
+                        seq_clusters[ref_group].append(new_record)
 
     if supplementary_metadata:
         with open(supplementary_metadata, "r") as f:
@@ -99,27 +104,29 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
                             seq_metadata[sample][col] = row[col]
 
     for record in SeqIO.parse(reference_sequences, KEY_FASTA):
-        for ref_group in seq_clusters:
-            if ref_group in record.description:
-                seq_clusters[ref_group].append(record)
+        if record:
+            for ref_group in seq_clusters:
+                if ref_group in record.description:
+                    seq_clusters[ref_group].append(record)
 
-                seq_metadata[record.id][KEY_NAME] = record.id
-                seq_metadata[record.id][KEY_SAMPLE] = record.id
-
-                seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
-                seq_metadata[record.id][KEY_CALL] = ref_group
+                    seq_metadata[record.id][KEY_NAME] = record.id
+                    seq_metadata[record.id][KEY_SAMPLE] = record.id
+                    
+                    seq_metadata[record.id][KEY_REFERENCE_GROUP] = ref_group
+                    seq_metadata[record.id][KEY_CALL] = ref_group
 
-                if "Sabin" in record.description:
-                    seq_metadata[record.id][KEY_SOURCE] = "Sabin"
-                else:
-                    seq_metadata[record.id][KEY_SOURCE] = "Reference"
+                    if "Sabin" in record.description:
+                        seq_metadata[record.id][KEY_SOURCE] = "Sabin"
+                    else:
+                        seq_metadata[record.id][KEY_SOURCE] = "Reference"
 
     for record in SeqIO.parse(outgroup_sequences, KEY_FASTA):
         for ref_group in seq_clusters:
             if ref_group in record.description:
                 new_record = record
+                new_record.description = ""
                 new_record.id = "outgroup"
-                new_record.description = "outgroup"
+
                 seq_clusters[ref_group].append(new_record)
 
     with open(barcodes_csv, "r") as f:
@@ -162,7 +169,12 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
                         writer0.writerow(row)
 
             with open(os.path.join(phylo_outdir, f"{i}.fasta"),"w") as fw:
-                SeqIO.write(seq_clusters[i], fw, KEY_FASTA)
+                records = seq_clusters[i]
+                record_dict = {}
+                for record in records:
+                    record_dict[record.id] = record
+                unique_records = [r for r in record_dict.values()]
+                SeqIO.write(unique_records, fw, KEY_FASTA)
 
     tree_annotations = config[KEY_TREE_ANNOTATIONS]
     for i in header:
@@ -172,22 +184,49 @@ def get_seqs_and_clusters(sample_seqs,supplementary_sequences,reference_sequence
 
     return list(seq_clusters.keys()),tree_annotations
 
-def update_local_database(supplementary_sequences,sample_sequences,output_file):
-    with open(output_file,"w") as fw:
-        countall = 0
+def update_local_database(sample_sequences,detailed_csv,new_db_seqs,new_db_metadata,config):
+
+    record_ids = {}
+    with open(new_db_seqs,"w") as fw:
         countnew = 0
-        for record in SeqIO.parse(supplementary_sequences, "fasta"):
-            SeqIO.write(record, fw, "fasta")
-            countall+=1
 
         for record in SeqIO.parse(sample_sequences, "fasta"):
             new_record = record
             desc_list = new_record.description.split(" ")
-            new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
-            new_record.description = " ".join(new_desc_list)
-            SeqIO.write(new_record, fw, "fasta")
-            countall+=1
-            countnew+=1
+            write_record = True
+
+            for i in desc_list:
+                if i.startswith("variant_count"):
+                    count = int(i.split("=")[1])
+                    if count < 6:
+                        write_record = False
+
+            if write_record:
+                new_desc_list = [i for i in desc_list if not i.startswith("barcode=")]
+                new_record.description = " ".join(new_desc_list)
+
+                SeqIO.write(new_record, fw, "fasta")
+                countnew+=1
+                sample = record.id.split("|")[0]
+                record_ids[record.id] = sample
+
+    with open(new_db_metadata,"w") as fw:
+        with open(detailed_csv,"r") as f:
+            reader = csv.DictReader(f)
+            header = reader.fieldnames
+            header.append(config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN])
+
+            writer = csv.DictWriter(fw, fieldnames=header, lineterminator="\n")
+            writer.writeheader()
+            sample_data = {}
+            for row in reader:
+                sample = row[KEY_SAMPLE]
+                sample_data[sample] = row
+
+            for record_id in record_ids:
+                sample = record_ids[record_id]
+                row = sample_data[sample]
+                row[config[KEY_SUPPLEMENTARY_METADATA_ID_COLUMN]] = record_id
+                writer.writerow(row)
 
     print(green(f"Local database updated with ")+ f"{countnew}"+ green(" newly generated records."))
-    print(green(f"Total records in local database:"), countall)
diff --git a/piranha/analysis/stool_functions.py b/piranha/analysis/stool_functions.py
@@ -33,58 +33,59 @@ def gather_fasta_files(summary_info, barcodes_csv, input_cns_list,all_metdata,ru
         cns_counter = collections.Counter()
         for cns_file in input_cns_list:
             for record in SeqIO.parse(cns_file, KEY_FASTA):
-                cns_info= record.description.split(" ")
-                ref,barcode,var_count,var_string=cns_info[0].split("|")
-
-                info = []
-                for row in analysis_info[barcode]:
-                    if row[KEY_REFERENCE] == ref:
-                        info = row
+                if record:
+                    cns_info= record.description.split(" ")
+                    ref,barcode,var_count,var_string=cns_info[0].split("|")
+
+                    info = []
+                    for row in analysis_info[barcode]:
+                        if row[KEY_REFERENCE] == ref:
+                            info = row
 
-                metadata = input_metadata[barcode]
+                    metadata = input_metadata[barcode]
 
-                record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
-                cns_counter[record_id] += 1
+                    record_id = f"{metadata[KEY_SAMPLE]}|{info[KEY_REFERENCE_GROUP]}"
+                    cns_counter[record_id] += 1
 
-                record_id += f"|CNS{cns_counter[record_id]}"
+                    record_id += f"|CNS{cns_counter[record_id]}"
 
-                if KEY_EPID in metadata:
-                    record_id += f"|{metadata[KEY_EPID]}"
-                else:
-                    record_id += "|"
+                    if KEY_EPID in metadata:
+                        record_id += f"|{metadata[KEY_EPID]}"
+                    else:
+                        record_id += "|"
 
-                if KEY_DATE in metadata:
-                    record_id += f"|{metadata[KEY_DATE]}"
-                else:
-                    record_id += "|"
+                    if KEY_DATE in metadata:
+                        record_id += f"|{metadata[KEY_DATE]}"
+                    else:
+                        record_id += "|"
 
-                record_id += f" {KEY_BARCODE}={barcode}"
-                record_id += f" {KEY_REFERENCE}={ref}"
-                record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
+                    record_id += f" {KEY_BARCODE}={barcode}"
+                    record_id += f" {KEY_REFERENCE}={ref}"
+                    record_id += f" {KEY_REFERENCE_MATCH_FIELD}={info[KEY_REFERENCE_GROUP]}"
 
-                if runname:
-                    record_id += f" {KEY_RUNNAME}={runname}"
+                    if runname:
+                        record_id += f" {KEY_RUNNAME}={runname}"
 
-                if "Sabin" in ref:
-                    record_id += f" {KEY_VARIANT_COUNT}={var_count}"
-                    record_id += f" {KEY_VARIANTS}={var_string}"
-                else:
-                    record_id += f" {KEY_VARIANT_COUNT}=NA"
-                    record_id += f" {KEY_VARIANTS}=NA"
+                    if "Sabin" in ref:
+                        record_id += f" {KEY_VARIANT_COUNT}={var_count}"
+                        record_id += f" {KEY_VARIANTS}={var_string}"
+                    else:
+                        record_id += f" {KEY_VARIANT_COUNT}=NA"
+                        record_id += f" {KEY_VARIANTS}=NA"
 
-                if all_metdata:
-
-                    for col in metadata:
-                        if col != KEY_SAMPLE and col != KEY_BARCODE:
-                            record_id += f" {col}={metadata[col]}"
-                """
-                record header is:
-                >SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA 
+                    if all_metdata:
+                        
+                        for col in metadata:
+                            if col != KEY_SAMPLE and col != KEY_BARCODE:
+                                record_id += f" {col}={metadata[col]}"
+                    """
+                    record header is:
+                    >SAMPLE|REFERENCE_GROUP|CNS_ID|EPID|DATE barcode=barcode01 variant_count=8 variants=17:CT;161:CT;427:GA;497:AC;507:CT;772:AG;822:CT;870:CA 
 
-                if "all_metadata" then everything else gets added to the description
-                """
-                fw.write(f">{record_id}\n{record.seq}\n")
-                handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")
+                    if "all_metadata" then everything else gets added to the description
+                    """
+                    fw.write(f">{record_id}\n{record.seq}\n")
+                    handle_dict[barcode].write(f">{record_id}\n{record.seq}\n")
 
     for handle in handle_dict:
         handle_dict[handle].close()