Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting feedback on mapping files #24

Open
aegururaj opened this issue Aug 1, 2018 · 9 comments
Open

Requesting feedback on mapping files #24

aegururaj opened this issue Aug 1, 2018 · 9 comments

Comments

@aegururaj
Copy link

aegururaj commented Aug 1, 2018

The Oxygen team has generated mapping files for mapping the metadata to the crosscut metadata model aka DATS model. We would like to request feedback and engage in a discussion on improving the mappings. The PR for the mapping files is dcppc/crosscut-metadata#22

@cmungall
Copy link

cmungall commented Aug 2, 2018

I looked at AGR_FB_Mapping (I assume the other MOD files are identical since the format is identical across the Alliance files).

I previously expressed some of my concerns here: dcppc/crosscut-metadata#21

I'm not sure DATS Dimension model is appropriate for representing the Alliance data. Even when representing basic gene information the mapping is lossy. For example, multiple fields with distinct semantics (displayName, prefix, localid) are all mapped to relatedIdentifiers.

On row 22, type is mapped to title - is this a mistake

It looks like the ortholog mapping is lossy , it's not clear how a homology could be performed on the transformed data

It may be the case that I misunderstanding the mappings file. Is there an example DATS JSON file, that would really help.

@aegururaj
Copy link
Author

@cmungall We have the Elasticsearch endpoint here that has 5 MGI sample DATS JSON files: MGI/5622662, MGI/5622581, MGI/1346023, MGI/106092, MGI/99205

@cmungall
Copy link

cmungall commented Aug 2, 2018

Thanks! Can you provide URLs to get to the JSON?

@aegururaj
Copy link
Author

Sure, there you go, MGI sample DATS JSON

@cmungall
Copy link

cmungall commented Aug 3, 2018

Thanks again!

It looks like things are not being mapped at the correct level. For example,

"isAbout": [
    {
        "@type": "MolecularEntity",
        "name": "",
        "taxonomy": [
            {
                "@type": "TaxonomicInformation",
                "name": "Zfp58",
                "identifier": {
                    "identifier": "NCBITaxon:10090",
                    "identifierSource": "NCBITaxon:10090"
                },
                "relatedIdentifiers": [
                    {
                        "identifier": "RIKEN cDNA A530094I17 gene",
                        "identifierSource": "RIKEN cDNA A530094I17 gene",
                        "relationType": "RIKEN cDNA A530094I17 gene"
                    }
                ]
            }
        ],

Zfp58 is the gene symbol, not the the name of the taxon (which should be Mus musculus). Similarly the RIKEN identifiers are at the level of the gene not the taxon.

for the identifiers, there are things like

"relatedIdentifiers": [
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    },
    {
        "identifier": "MGI:99205",
        "relationType": "gene",
        "identifierSource": "MGI"
    },
    {
        "identifierSource": "MGI",
        "identifier": "99205",
        "relationType": "gene"
    },
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    }
],

I suggest having a single canonical identifier and using a CURIE such as MGI:99205, facilitating JSONLD->RDF using a canonical context file

I'm looking for the homology information, it seems to be embedded inside Material objects:

"characteristics": [
    {
        "name": "",
        "@type": "Material",
        "identifier": {
            "identifier": "HGNC:28857"
        },
        "values": [
            "low",
            "false",
            "false",
            "13",
            "67500555",
            "67490167"
        ],
        "relatedIdentifiers": [
            {
                "identifier": "ZNF682"
            },
            {
                "identifier": "ZNF675"
            },
            {
                "identifier": "ZNF430"
            },

I don't really know what a Material is here, or what the list of values is intended to represent.

Overall I'm still not quite sure I grok the datamodel. Each gene is modeled as a DatasetDistribution, the DatasetDistribution conformsTo a SO type such as 'gene', the DatasetDistribution isAbout a MolecularEntity (which doesn't have a type field), the MolecularEntity has characteristics which are Materials, the material also has identifiers, but these seem to be gene symbols. It looks like the materials actually represent the orthologous genes, there is nothing to indicate that these are homologs, and it's not clear why a gene is a MolecularEntity if it's in the species of interest, and a Material in another species.

I'm trying to map this all onto my own mental map of biology and not having much luck.

@sarala
Copy link

sarala commented Aug 3, 2018

Hi,

Would it be possible to use the compact identifiers [1] form for all the identifiers? This means you will also be able to resolve the identifiers using identifiers.org or n2t (KC2, team Sodium work).

Cheers,
Sarala

[1] Wimalaratne, S.M., et al., Uniform resolution of compact identifiers for biomedical data. Sci Data, 2018. 5: p. 180029.

@aegururaj
Copy link
Author

@cmungall thanks for looking into this. We will review and get back to you soon.

@bheavner
Copy link
Contributor

I'm sorry, I'm just seeing this issue. What's the best way for me/TOPMed to get more context about this? I'm not sure what we're reviewing for, or who the best person would be.

@aegururaj
Copy link
Author

@bheavner Please let the Oxygen team (Anu) know if you would like to get additional information about the mapping process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants