Skip to content

Commit

Permalink
Merge pull request #57 from opentargets/il-panelapp-parser
Browse files Browse the repository at this point in the history
[Il panelapp parser] Version for release before rewrite
  • Loading branch information
ireneisdoomed authored Jan 7, 2021
2 parents e7f839b + e324548 commit 67812ea
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 33 deletions.
36 changes: 33 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,12 @@ Each folder in module corresponds corresponds to a datasource.
In each folder we have one or more standalone python scripts.

Generally these scripts:
1. map the disease terms (if any) to our ontology, sometimes using [OnToma](https://ontoma.readthedocs.io)
2. save the mappings in https://github.com/opentargets/mappings
3. Read the **github mappings** to generate evidence objects (JSON strings) according to our JSON schema
1. map the disease terms (if any) to our ontology in various ways:
- by using [OnToma](https://ontoma.readthedocs.io)
- by using the [RareDiseasesUtils](https://github.com/opentargets/evidence_datasource_parsers/blob/master/common/RareDiseasesUtils.py) script
- by using [Ontology Utils](https://github.com/opentargets/ontology-utils)
- by importing manually curated files. Some of these are stored in the [mappings repo](https://github.com/opentargets/mappings)
2. Once the mapping is handled, evidence objects are generated in the form of JSON strings according to our JSON schema

Code used by more than one script (that does not live in a python package)
is stored in the `common` folder and imported as follows:
Expand Down Expand Up @@ -86,6 +89,7 @@ To use the parser configure the python environment and run it as follows:
```

### Gene2Phenotype

The Gene2Phenotype parser processes the four gene panels (Developmental Disorders - DD, eye disorders, skin disorders and cancer) that can be downloaded from https://www.ebi.ac.uk/gene2phenotype/downloads/.

The mapping of the diseases, i.e. the "disease name" column, is done on the fly using [OnToma](https://pypi.org/project/ontoma/):
Expand Down Expand Up @@ -113,6 +117,31 @@ To use the parser configure the python environment and run it as follows:
(venv)$ python3 modules/Gene2Phenotype.py -s 1.7.1 -v 2020-08-19 -d DDG2P_19_8_2020.csv.gz -e EyeG2P_19_8_2020.csv.gz -k SkinG2P_19_8_2020.csv.gz -c CancerG2P_19_8_2020.csv.gz -o gene2phenotype-19-08-2020.json -u gene2phenotype-19-08-2020_unmapped_diseases.txt
```

### Genomics England Panel App

The Genomics England parser processes the associations between genes and diseases described in the _Gene Panels Data_ table. This data is provided by Genomics England and can be downloaded [here](https://storage.googleapis.com/otar000-evidence_input/PanelApp/20.11/All_genes_20200928-1959.tsv) from the _otar000-evidence_input_ bucket.

The source table is then formatted into a compressed set of JSON lines following the schema of the version to be used.

The mapping of the diseases is done on the fly using [OnToma](https://pypi.org/project/ontoma/):
1. Exact matches to an EFO term are used directly.
2. Sometimes an OMIM code can be present in the disease string. OnToma is then queried for both the OMIM code and the respective disease term. If OnToma returns a fuzzy match for both, it is checked whether they both point to the same EFO term. Being this the case, the term is considered as an exact match.

By default the result of the diseases and codes mappings are stored locally as of _disease_queries.json_ and _disease_queries.json_ respectively. This is intended for analysys purposes and to ease up a potential rerun of the parser.

The parser requires three parameters:
- `-i`, `--input_file`: Name of tsv file located in the [Panel App bucket](https://storage.googleapis.com/otar000-evidence_input/PanelApp/20.11/All_genes_20200928-1959.tsv).
- `-o`, `--output_file`: Name of evidence JSON file containing the evidence strings.
- `-s`, `--schema_version`: JSON schema version to use, e.g. 1.7.5. It must be branch or a tag available in https://github.com/opentargets/json_schema.

There is also an optional parameter to load a dictionary containing the results of querying OnToma with the disease terms:
- `-d`, `--dictionary`: If specified, the diseases mappings will be imported from this JSON file.'

To use the parser configure the python environment and run it as follows:
```bash
(venv)$ python3 modules/GenomicsEnglandPanelApp.py -i All_genes_20200928-1959.tsv -o genomics_england-2021-01-05.json -s 1.7.5 -d disease_queries.json
```

### IntOGen

The intOGen parser generates evidence strings from three files that need to be in the working directory or in the _resources_ directory:
Expand Down Expand Up @@ -240,3 +269,4 @@ python ${repo_directory}/modules/GeneticsPortal.py \
```

**Important**: to ensure the resulting json schema is valid, we are using the [python_jsonschema_objects](https://pypi.org/project/python-jsonschema-objects/0.0.13/) library, which enforces the proper structure. The only caveat is that the this library uses draft-4 JSON schema, while our JSON schema is written on draft-7. To resolve this discrepancy, our JSON schema repository has a parallel draft-4 compatible branch that we are using for evidence generation.

41 changes: 20 additions & 21 deletions modules/GenomicsEnglandPanelApp.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@
}

class PanelAppEvidenceGenerator():

def __init__(self, schema_version=Config.VALIDATED_AGAINST_SCHEMA_VERSION):

# Build JSON schema url from version
self.schema_version = schema_version
schema_url = f"https://raw.githubusercontent.com/opentargets/json_schema/{self.schema_version}/draft4_schemas/opentargets.json"
Expand Down Expand Up @@ -48,17 +50,15 @@ def __init__(self, schema_version=Config.VALIDATED_AGAINST_SCHEMA_VERSION):
def build_publications(dataframe):
'''
Populates a dataframe with the publications fetched from the PanelApp API and cleans them to match PubMed IDs.
Args:
dataframe (pd.DataFrame): Initial .tsv data converted to a Pandas dataframe
Returns:
dataframe (pd.DataFrame): Original dataframe with an additional column: Publications
'''
populated_groups = []

for (PanelId), group in dataframe.groupby("Panel Id"):
request = PanelAppEvidenceGenerator.publications_from_panel(PanelId)
group["Publications"] = group.apply(lambda X: publication_from_symbol(X.Symbol, request), axis=1)
group["Publications"] = group.apply(lambda X: PanelAppEvidenceGenerator.publication_from_symbol(X.Symbol, request), axis=1)
populated_groups.append(group)

dataframe = pd.concat(populated_groups, ignore_index=True, sort=False)
Expand All @@ -80,7 +80,7 @@ def build_publications(dataframe):
def publications_from_panel(panel_id):
'''
Queries the PanelApp API to obtain a list of the publications for every gene within a panel_id
Args:
panel_id (str): Panel ID extracted from the "Panel Id" column
Returns:
Expand All @@ -97,7 +97,8 @@ def publications_from_panel(panel_id):
@staticmethod
def publication_from_symbol(symbol, response):
'''
Returns the list of publications for a given symbol in a PanelApp query response.
Returns the list of publications for a given symbol in a PanelApp query response
Args:
symbol (str): Gene symbol extracted from the "Symbol" column
response (dict): Response of the API containing all genes related to a panel and their publications
Expand Down Expand Up @@ -141,7 +142,6 @@ def clean_dataframe(dataframe):

def ontoma_query(self, iterable, dict_name="ontoma_queries.json"):
'''
Queries the OnToma utility to map a phenotype to a disease.
OnToma is used to query the ontology OBO files, the manual mapping file and the Zooma and OLS APIs.
Args:
Expand All @@ -151,6 +151,7 @@ def ontoma_query(self, iterable, dict_name="ontoma_queries.json"):
mappings (dict): Output file. Keys: queried term (phenotype or OMIM code), Values: OnToma output
'''
mappings = dict()

for e in iterable:
try:
tmp = self.otmap.find_term(e, verbose=True)
Expand Down Expand Up @@ -207,7 +208,6 @@ def OMIM_phenotype_xref(phenotype, code, mappings_dict, codes_dict):
def build_mappings(mappings_dict, dataframe):
'''
Populates the dataframe with the mappings resulted from OnToma.
Args:
mappings_dict (dict): All mapping results for every phenotype
dataframe (pd.DataFrame): DataFrame with transformed PanelApp data
Expand All @@ -231,7 +231,7 @@ def build_mappings(mappings_dict, dataframe):
dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Action"] = match[phenotype]["action"]
dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Term"] = match[phenotype]["term"]
dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Label"] = match[phenotype]["label"]

for phenotype in fuzzy.keys():
dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Result"] = "fuzzy"
dataframe.loc[dataframe["Phenotype"] == phenotype, "OnToma Action"] = fuzzy[phenotype]["action"]
Expand All @@ -243,9 +243,8 @@ def build_mappings(mappings_dict, dataframe):
def build_pub_array(self):
'''
Takes a list of PMIDs and returns a list of reference dictionaries
Returns:
pub_array (array): List of objects with the reference link to every publication
pub_array (array): List of objects with the reference link to every publication
'''
pub_array = []

Expand Down Expand Up @@ -381,21 +380,18 @@ def get_evidence_string(self, row):
)
return evidence.serialize()
except Exception as e:
print(e)
logging.error(f'Evidence generation failed for row: {row.name}')
raise

def write_evidence_strings(self, dataframe, mappings_dict):
'''
Processing of the dataframe to build all the evidences from its data
Args:
dataframe (pd.DataFrame): Initial .tsv file
mappings_dict (dict): All mapping results for every phenotype
Returns:
evidences (array): Object with all the generated evidences strings
'''

# Read input file
dataframe = pd.read_csv(dataframe, sep='\t')

Expand All @@ -417,12 +413,12 @@ def write_evidence_strings(self, dataframe, mappings_dict):

if len(mappings_dict) == 0:
# Checks whether the dictionary is not provided as a parameter
mappings_dict = self.ontoma_query(phenotypes)
mappings_dict = self.ontoma_query(phenotypes, dict_name="phenotypes_mapping.json")
logging.info("Disease mappings completed.")
else:
logging.info("Disease mappings imported.")

codes_dict = self.ontoma_query(codes)
codes_dict = self.ontoma_query(codes, dict_name="codes_mapping.json")

# Cross-referencing the fuzzy results from the phenotype query and the OMIM code query
phenotypes_list = dataframe["Phenotype"].to_list()
Expand Down Expand Up @@ -460,9 +456,9 @@ def removing_redundant_evidences(evidences):
'json_data': evidence
})
panelapp_df = pd.DataFrame(parsed_data)

# Grouping to make the evidence unique: by target, disease and panel id
updated_evidences = []
updated_evidences = []
for (target, disease, panel_id), group in panelapp_df.groupby(['target','disease','panel_id']):
# Extracting evidence data:
data = group["json_data"].tolist()[0]
Expand Down Expand Up @@ -508,14 +504,17 @@ def main():
mappings_dict = json.load(f)
else:
mappings_dict = {}

# Writing evidence strings into a json file
evidences = evidence_builder.write_evidence_strings(dataframe, mappings_dict)


# Exporting the outfile
with open(output_file, "wt") as f:
evidences.apply(lambda x: f.write(str(x)+'\n'))
for evidence in evidences:
json.dump(evidence, f)
f.write('\n')

logging.info(f"Evidence strings saved into {output_file}. Exiting.")

if __name__ == '__main__':
main()
main()
9 changes: 0 additions & 9 deletions settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,6 @@ class Config:
G2P_cancer_FILENAME = file_or_resource('CancerG2P_26_3_2020.csv.gz')
G2P_EVIDENCE_FILENAME = 'gene2phenotype-19-08-2019.json'


# Genomics England
GE_PANEL_MAPPING_FILENAME = file_or_resource('genomicsenglandpanelapp_panelmapping.csv')
GE_EVIDENCE_FILENAME = 'genomics_england-17-06-2019.json'
GE_LINKOUT_URL = 'https://panelapp.genomicsengland.co.uk/panels/'
GE_ZOOMA_DISEASE_MAPPING = 'tmp/zooma_disease_mapping.csv'
GE_ZOOMA_DISEASE_MAPPING_NOT_HIGH_CONFIDENT = 'tmp/zooma_disease_mapping_low_confidence.csv'
GE_PANEL_VERSION = 'v5.7'

# IntoGEN
INTOGEN_DRIVER_GENES_FILENAME = file_or_resource('intogen_Compendium_Cancer_Genes.tsv')
INTOGEN_EVIDENCE_FILENAME = 'intogen-02-02-2020.json'
Expand Down

0 comments on commit 67812ea

Please sign in to comment.