Requesting feedback on mapping files #24

aegururaj · 2018-08-01T10:10:24Z

The Oxygen team has generated mapping files for mapping the metadata to the crosscut metadata model aka DATS model. We would like to request feedback and engage in a discussion on improving the mappings. The PR for the mapping files is dcppc/crosscut-metadata#22

cmungall · 2018-08-02T16:48:41Z

I looked at AGR_FB_Mapping (I assume the other MOD files are identical since the format is identical across the Alliance files).

I previously expressed some of my concerns here: dcppc/crosscut-metadata#21

I'm not sure DATS Dimension model is appropriate for representing the Alliance data. Even when representing basic gene information the mapping is lossy. For example, multiple fields with distinct semantics (displayName, prefix, localid) are all mapped to relatedIdentifiers.

On row 22, type is mapped to title - is this a mistake

It looks like the ortholog mapping is lossy , it's not clear how a homology could be performed on the transformed data

It may be the case that I misunderstanding the mappings file. Is there an example DATS JSON file, that would really help.

aegururaj · 2018-08-02T23:37:10Z

@cmungall We have the Elasticsearch endpoint here that has 5 MGI sample DATS JSON files: MGI/5622662, MGI/5622581, MGI/1346023, MGI/106092, MGI/99205

cmungall · 2018-08-02T23:40:06Z

Thanks! Can you provide URLs to get to the JSON?

aegururaj · 2018-08-03T02:13:30Z

Sure, there you go, MGI sample DATS JSON

cmungall · 2018-08-03T04:27:16Z

Thanks again!

It looks like things are not being mapped at the correct level. For example,

"isAbout": [
    {
        "@type": "MolecularEntity",
        "name": "",
        "taxonomy": [
            {
                "@type": "TaxonomicInformation",
                "name": "Zfp58",
                "identifier": {
                    "identifier": "NCBITaxon:10090",
                    "identifierSource": "NCBITaxon:10090"
                },
                "relatedIdentifiers": [
                    {
                        "identifier": "RIKEN cDNA A530094I17 gene",
                        "identifierSource": "RIKEN cDNA A530094I17 gene",
                        "relationType": "RIKEN cDNA A530094I17 gene"
                    }
                ]
            }
        ],

Zfp58 is the gene symbol, not the the name of the taxon (which should be Mus musculus). Similarly the RIKEN identifiers are at the level of the gene not the taxon.

for the identifiers, there are things like

"relatedIdentifiers": [
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    },
    {
        "identifier": "MGI:99205",
        "relationType": "gene",
        "identifierSource": "MGI"
    },
    {
        "identifierSource": "MGI",
        "identifier": "99205",
        "relationType": "gene"
    },
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    }
],

I suggest having a single canonical identifier and using a CURIE such as MGI:99205, facilitating JSONLD->RDF using a canonical context file

I'm looking for the homology information, it seems to be embedded inside Material objects:

"characteristics": [
    {
        "name": "",
        "@type": "Material",
        "identifier": {
            "identifier": "HGNC:28857"
        },
        "values": [
            "low",
            "false",
            "false",
            "13",
            "67500555",
            "67490167"
        ],
        "relatedIdentifiers": [
            {
                "identifier": "ZNF682"
            },
            {
                "identifier": "ZNF675"
            },
            {
                "identifier": "ZNF430"
            },

I don't really know what a Material is here, or what the list of values is intended to represent.

Overall I'm still not quite sure I grok the datamodel. Each gene is modeled as a DatasetDistribution, the DatasetDistribution conformsTo a SO type such as 'gene', the DatasetDistribution isAbout a MolecularEntity (which doesn't have a type field), the MolecularEntity has characteristics which are Materials, the material also has identifiers, but these seem to be gene symbols. It looks like the materials actually represent the orthologous genes, there is nothing to indicate that these are homologs, and it's not clear why a gene is a MolecularEntity if it's in the species of interest, and a Material in another species.

I'm trying to map this all onto my own mental map of biology and not having much luck.

sarala · 2018-08-03T08:47:22Z

Hi,

Would it be possible to use the compact identifiers [1] form for all the identifiers? This means you will also be able to resolve the identifiers using identifiers.org or n2t (KC2, team Sodium work).

Cheers,
Sarala

[1] Wimalaratne, S.M., et al., Uniform resolution of compact identifiers for biomedical data. Sci Data, 2018. 5: p. 180029.

aegururaj · 2018-08-03T20:11:21Z

@cmungall thanks for looking into this. We will review and get back to you soon.

bheavner · 2018-09-14T16:50:31Z

I'm sorry, I'm just seeing this issue. What's the best way for me/TOPMed to get more context about this? I'm not sure what we're reviewing for, or who the best person would be.

aegururaj · 2018-09-18T18:40:16Z

@bheavner Please let the Oxygen team (Anu) know if you would like to get additional information about the mapping process.

aegururaj mentioned this issue Aug 1, 2018

Mapping files for TOPMed, GTEx and Alliance dcppc/crosscut-metadata#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting feedback on mapping files #24

Requesting feedback on mapping files #24

aegururaj commented Aug 1, 2018 •

edited

Loading

cmungall commented Aug 2, 2018

aegururaj commented Aug 2, 2018

cmungall commented Aug 2, 2018

aegururaj commented Aug 3, 2018

cmungall commented Aug 3, 2018

sarala commented Aug 3, 2018

aegururaj commented Aug 3, 2018

bheavner commented Sep 14, 2018

aegururaj commented Sep 18, 2018

Requesting feedback on mapping files #24

Requesting feedback on mapping files #24

Comments

aegururaj commented Aug 1, 2018 • edited Loading

cmungall commented Aug 2, 2018

aegururaj commented Aug 2, 2018

cmungall commented Aug 2, 2018

aegururaj commented Aug 3, 2018

cmungall commented Aug 3, 2018

sarala commented Aug 3, 2018

aegururaj commented Aug 3, 2018

bheavner commented Sep 14, 2018

aegururaj commented Sep 18, 2018

aegururaj commented Aug 1, 2018 •

edited

Loading