Merge pull request #57 from opentargets/il-panelapp-parser

[Il panelapp parser] Version for release before rewrite
opentargets · Jan 7, 2021 · 67812ea · 67812ea
2 parents e7f839b + e324548
commit 67812ea
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -5,9 +5,12 @@ Each folder in module corresponds corresponds to a datasource.
 In each folder we have one or more standalone python scripts.
 
 Generally these scripts:
-1. map the disease terms (if any) to our ontology, sometimes using [OnToma](https://ontoma.readthedocs.io)
-2. save the mappings in https://github.com/opentargets/mappings
-3. Read the **github mappings** to generate evidence objects (JSON strings) according to our JSON schema
+1. map the disease terms (if any) to our ontology in various ways:
+      - by using [OnToma](https://ontoma.readthedocs.io)
+      - by using the [RareDiseasesUtils](https://github.com/opentargets/evidence_datasource_parsers/blob/master/common/RareDiseasesUtils.py) script
+      - by using [Ontology Utils](https://github.com/opentargets/ontology-utils)
+      - by importing manually curated files. Some of these are stored in the [mappings repo](https://github.com/opentargets/mappings)
+2. Once the mapping is handled, evidence objects are generated in the form of JSON strings according to our JSON schema
 
 Code used by more than one script (that does not live in a python package)
 is stored in the `common` folder and imported as follows:
@@ -86,6 +89,7 @@ To use the parser configure the python environment and run it as follows:
 ```
 
 ### Gene2Phenotype
+
 The Gene2Phenotype parser processes the four gene panels (Developmental Disorders - DD, eye disorders, skin disorders and cancer) that can be downloaded from https://www.ebi.ac.uk/gene2phenotype/downloads/.
 
 The mapping of the diseases, i.e. the "disease name" column, is done on the fly using [OnToma](https://pypi.org/project/ontoma/):
@@ -113,6 +117,31 @@ To use the parser configure the python environment and run it as follows:
 (venv)$ python3 modules/Gene2Phenotype.py -s 1.7.1 -v 2020-08-19 -d DDG2P_19_8_2020.csv.gz -e EyeG2P_19_8_2020.csv.gz -k SkinG2P_19_8_2020.csv.gz -c CancerG2P_19_8_2020.csv.gz -o gene2phenotype-19-08-2020.json -u gene2phenotype-19-08-2020_unmapped_diseases.txt 
 ```
 
+### Genomics England Panel App
+
+The Genomics England parser processes the associations between genes and diseases described in the _Gene Panels Data_ table. This data is provided by Genomics England and can be downloaded [here](https://storage.googleapis.com/otar000-evidence_input/PanelApp/20.11/All_genes_20200928-1959.tsv) from the _otar000-evidence_input_ bucket.
+
+The source table is then formatted into a compressed set of JSON lines following the schema of the version to be used.
+
+The mapping of the diseases is done on the fly using [OnToma](https://pypi.org/project/ontoma/):
+1. Exact matches to an EFO term are used directly.
+2. Sometimes an OMIM code can be present in the disease string. OnToma is then queried for both the OMIM code and the respective disease term. If OnToma returns a fuzzy match for both, it is checked whether they both point to the same EFO term. Being this the case, the term is considered as an exact match.
+
+By default the result of the diseases and codes mappings are stored locally as of _disease_queries.json_ and _disease_queries.json_ respectively. This is intended for analysys purposes and to ease up a potential rerun of the parser.
+
+The parser requires three parameters:
+- `-i`, `--input_file`: Name of tsv file located in the [Panel App bucket](https://storage.googleapis.com/otar000-evidence_input/PanelApp/20.11/All_genes_20200928-1959.tsv).
+- `-o`, `--output_file`: Name of evidence JSON file containing the evidence strings.
+- `-s`, `--schema_version`: JSON schema version to use, e.g. 1.7.5. It must be branch or a tag available in https://github.com/opentargets/json_schema.
+
+There is also an optional parameter to load a dictionary containing the results of querying OnToma with the disease terms:
+- `-d`, `--dictionary`: If specified, the diseases mappings will be imported from this JSON file.'
+
+To use the parser configure the python environment and run it as follows:
+```bash
+(venv)$ python3 modules/GenomicsEnglandPanelApp.py -i All_genes_20200928-1959.tsv -o genomics_england-2021-01-05.json -s 1.7.5 -d disease_queries.json
+```
+
 ### IntOGen
 
 The intOGen parser generates evidence strings from three files that need to be in the working directory or in the _resources_ directory:
@@ -240,3 +269,4 @@ python ${repo_directory}/modules/GeneticsPortal.py \
 ```
 
 **Important**: to ensure the resulting json schema is valid, we are using the [python_jsonschema_objects](https://pypi.org/project/python-jsonschema-objects/0.0.13/) library, which enforces the proper structure. The only caveat is that the this library uses draft-4 JSON schema, while our JSON schema is written on draft-7. To resolve this discrepancy, our JSON schema repository has a parallel draft-4 compatible branch that we are using for evidence generation.
+
diff --git a/modules/GenomicsEnglandPanelApp.py b/modules/GenomicsEnglandPanelApp.py
@@ -19,7 +19,9 @@
 }
 
 class PanelAppEvidenceGenerator():
+
     def __init__(self, schema_version=Config.VALIDATED_AGAINST_SCHEMA_VERSION):
+
         # Build JSON schema url from version
         self.schema_version = schema_version
         schema_url = f"https://raw.githubusercontent.com/opentargets/json_schema/{self.schema_version}/draft4_schemas/opentargets.json"
@@ -48,17 +50,15 @@ def __init__(self, schema_version=Config.VALIDATED_AGAINST_SCHEMA_VERSION):
     def build_publications(dataframe):
         '''
         Populates a dataframe with the publications fetched from the PanelApp API and cleans them to match PubMed IDs.
-
         Args:
             dataframe (pd.DataFrame): Initial .tsv data converted to a Pandas dataframe
         Returns:
             dataframe (pd.DataFrame): Original dataframe with an additional column: Publications
         '''
         populated_groups = []
-
         for (PanelId), group in dataframe.groupby("Panel Id"):
             request = PanelAppEvidenceGenerator.publications_from_panel(PanelId)
-            group["Publications"] = group.apply(lambda X: publication_from_symbol(X.Symbol, request), axis=1)
+            group["Publications"] = group.apply(lambda X: PanelAppEvidenceGenerator.publication_from_symbol(X.Symbol, request), axis=1)
             populated_groups.append(group)
 
         dataframe = pd.concat(populated_groups, ignore_index=True, sort=False)
@@ -80,7 +80,7 @@ def build_publications(dataframe):
     def publications_from_panel(panel_id):
         '''
         Queries the PanelApp API to obtain a list of the publications for every gene within a panel_id
-        
+
         Args:
             panel_id (str): Panel ID extracted from the "Panel Id" column
         Returns:
@@ -97,7 +97,8 @@ def publications_from_panel(panel_id):
     @staticmethod
     def publication_from_symbol(symbol, response):
         '''
-        Returns the list of publications for a given symbol in a PanelApp query response.
+        Returns the list of publications for a given symbol in a PanelApp query response
+
         Args:
             symbol (str): Gene symbol extracted from the "Symbol" column
             response (dict): Response of the API containing all genes related to a panel and their publications
@@ -141,7 +142,6 @@ def clean_dataframe(dataframe):
 
     def ontoma_query(self, iterable, dict_name="ontoma_queries.json"):
         '''
-        Queries the OnToma utility to map a phenotype to a disease.
         OnToma is used to query the ontology OBO files, the manual mapping file and the Zooma and OLS APIs.
 
         Args:
@@ -151,6 +151,7 @@ def ontoma_query(self, iterable, dict_name="ontoma_queries.json"):
             mappings (dict): Output file. Keys: queried term (phenotype or OMIM code), Values: OnToma output
         '''
         mappings = dict()
+
         for e in iterable:
             try:
                 tmp = self.otmap.find_term(e, verbose=True)
@@ -207,7 +208,6 @@ def OMIM_phenotype_xref(phenotype, code, mappings_dict, codes_dict):
     def build_mappings(mappings_dict, dataframe):
         '''
         Populates the dataframe with the mappings resulted from OnToma.
-
         Args:
             mappings_dict (dict): All mapping results for every phenotype
             dataframe (pd.DataFrame): DataFrame with transformed PanelApp data
@@ -231,7 +231,7 @@ def build_mappings(mappings_dict, dataframe):
             dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Action"] = match[phenotype]["action"]
             dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Term"] = match[phenotype]["term"]
             dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Label"] = match[phenotype]["label"]
-
+            
         for phenotype in fuzzy.keys():
             dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Result"] = "fuzzy"
             dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Action"] = fuzzy[phenotype]["action"]
@@ -243,9 +243,8 @@ def build_mappings(mappings_dict, dataframe):
     def build_pub_array(self):
         '''
         Takes a list of PMIDs and returns a list of reference dictionaries
-
         Returns:
-            pub_array (array): List of objects with the reference link to every publication    
+            pub_array (array): List of objects with the reference link to every publication   
         '''
         pub_array = []
 
@@ -381,21 +380,18 @@ def get_evidence_string(self, row):
             )
             return evidence.serialize()
         except Exception as e:
-            print(e)
             logging.error(f'Evidence generation failed for row: {row.name}')
             raise
 
     def write_evidence_strings(self, dataframe, mappings_dict):
         '''
         Processing of the dataframe to build all the evidences from its data
-
         Args:
             dataframe (pd.DataFrame): Initial .tsv file
             mappings_dict (dict): All mapping results for every phenotype
         Returns:
             evidences (array): Object with all the generated evidences strings
         '''
-
         # Read input file
         dataframe = pd.read_csv(dataframe, sep='\t')
 
@@ -417,12 +413,12 @@ def write_evidence_strings(self, dataframe, mappings_dict):
 
         if len(mappings_dict) == 0:
             # Checks whether the dictionary is not provided as a parameter 
-            mappings_dict = self.ontoma_query(phenotypes)
+            mappings_dict = self.ontoma_query(phenotypes, dict_name="phenotypes_mapping.json")
             logging.info("Disease mappings completed.")
         else:
             logging.info("Disease mappings imported.")
 
-        codes_dict = self.ontoma_query(codes)
+        codes_dict = self.ontoma_query(codes, dict_name="codes_mapping.json")
 
         # Cross-referencing the fuzzy results from the phenotype query and the OMIM code query
         phenotypes_list = dataframe["Phenotype"].to_list()
@@ -460,9 +456,9 @@ def removing_redundant_evidences(evidences):
                 'json_data': evidence
             })
         panelapp_df = pd.DataFrame(parsed_data)  
-        
+
         # Grouping to make the evidence unique: by target, disease and panel id
-        updated_evidences = []
+        updated_evidences = []  
         for (target, disease, panel_id), group in panelapp_df.groupby(['target','disease','panel_id']):
             # Extracting evidence data:
             data = group["json_data"].tolist()[0]
@@ -508,14 +504,17 @@ def main():
                 mappings_dict = json.load(f)
     else:
         mappings_dict = {}
-    
+
     # Writing evidence strings into a json file
     evidences = evidence_builder.write_evidence_strings(dataframe, mappings_dict)
-
+
+    # Exporting the outfile
     with open(output_file, "wt") as f:
-        evidences.apply(lambda x: f.write(str(x)+'\n'))
+        for evidence in evidences:
+            json.dump(evidence, f)
+            f.write('\n')
 
     logging.info(f"Evidence strings saved into {output_file}. Exiting.")
 
 if __name__ == '__main__':
-    main()
+    main()
diff --git a/settings.py b/settings.py
@@ -86,15 +86,6 @@ class Config:
     G2P_cancer_FILENAME = file_or_resource('CancerG2P_26_3_2020.csv.gz')
     G2P_EVIDENCE_FILENAME = 'gene2phenotype-19-08-2019.json'
 
-
-    # Genomics England
-    GE_PANEL_MAPPING_FILENAME = file_or_resource('genomicsenglandpanelapp_panelmapping.csv')
-    GE_EVIDENCE_FILENAME = 'genomics_england-17-06-2019.json'
-    GE_LINKOUT_URL = 'https://panelapp.genomicsengland.co.uk/panels/'
-    GE_ZOOMA_DISEASE_MAPPING = 'tmp/zooma_disease_mapping.csv'
-    GE_ZOOMA_DISEASE_MAPPING_NOT_HIGH_CONFIDENT = 'tmp/zooma_disease_mapping_low_confidence.csv'
-    GE_PANEL_VERSION = 'v5.7'
-
     # IntoGEN
     INTOGEN_DRIVER_GENES_FILENAME = file_or_resource('intogen_Compendium_Cancer_Genes.tsv')
     INTOGEN_EVIDENCE_FILENAME = 'intogen-02-02-2020.json'