Skip to content

Latest commit

 

History

History
130 lines (102 loc) · 5.18 KB

README.md

File metadata and controls

130 lines (102 loc) · 5.18 KB

gcn-matching

Matching nodes in a knowledge graph using Graph Convolutional Networks and investigating the interplay between formal semantics and GCNs.

A detailed description of the motivation and the algorithms is available in the related paper.

Citing

When citing, please use the following reference:

Pierre Monnin, Chedy Raïssi, Amedeo Napoli, Adrien Coulet: Discovering alignment relations with Graph Convolutional Networks: A biomedical case study. Semantic Web 13(3): 379-398 (2022)

@article{monninRNC22,
  author    = {Pierre Monnin and
               Chedy Ra{\"{\i}}ssi and
               Amedeo Napoli and
               Adrien Coulet},
  title     = {Discovering alignment relations with Graph Convolutional Networks:
               {A} biomedical case study},
  journal   = {Semantic Web},
  volume    = {13},
  number    = {3},
  pages     = {379--398},
  year      = {2022},
  url       = {https://doi.org/10.3233/SW-210452},
  doi       = {10.3233/SW-210452}
}

Scripts

1. Query similarity set

  • In query_simset.py
  • Retrieve individuals to match (instances of classes in individuals-classes in the JSON configuration file)
  • Retrieve similarity links between these individuals (to use in train/valid/test sets).
    • Similarity links are described in similarity-links in the JSON configuration file
    • When having the link (url1, url2), we do not add (url2, url1) for symmetric predicates to avoid the symmetry bias in training

2. Query graph

  • In query_graph.py
  • Retrieve the adjacency of the RDF graph (except similarity links previously retrieved in 1.)
  • Retrieve predicates and their inverses (or symmetry)
  • Must be used with the cache manager resulting from the previous step

3. Similarity analysis

  • In similarity_analysis.py
  • Output PDF files with histograms depicting the size of similarity clusters and number of them for each model (computed based on the similarity links considered by each model)
  • Similarity clusters for each model are computed over all similarity links indifferently considered by the model in an undirected (symmetry) and transitive fashion

4. N Fold Split

  • In n_fold_split.py
  • Output a n-fold split of similarity links (after shuffling)

5. Transform graph

  • In transform_graph.py
  • Output a DGL graph from the given RDF graph applying one of the following transformations:
    • G0: RDF graph + adding an abstract inverse for each predicate
    • G1: RDF graph after owl:sameAs edges contraction (only considering canonical nodes)
    • G2: RDF graph with consideration of inverse predicates / symmetry (to avoid adding abstract inverses when not needed)
    • G3: RDF graph with links added based on the hierarchy of predicates: if (a, rel1, b) and (rel1, subPropertyOf, rel2), we add (a, rel2, b)
    • G4: RDF graph with rdf:type links added based on the hierarchy of classes: if (a, type, b) and (b, subClassOf, c), we add (a, type, c)
    • G5: all transformations of G1 to G4
  • The graph is limited to the considered neighborhood of individuals to match based on the number of layers

6. Learning

  • In learning.py
  • Output a python dict where each key is the index of the test fold and contains:
    • logits_history: python list associating an epoch with its logits
    • train_loss_history: python list associating an epoch with its train loss
    • val_loss_history: python list associating an epoch with its validation loss
    • test_loss_history: python list associating an epoch with its test loss
    • temperature_history: python list associating an epoch with its temperature
    • model: python list associating an epoch with the parameters of the GCN model

7. Clustering analysis

  • In clustering_analysis.py
  • Output for each fold:
    • A distance analysis based on all links, links whose nodes are in the training set, in the validation set, and in the test set
    • A UMAP projection computed on all nodes and displayed for all nodes, nodes in the training set, in the validation set and in the test set. Only --umap-colors similarity clusters are colored (starting at the biggest ones). Only similarity clusters containing more than --umap-size nodes are displayed (0 to disable)
    • A plot of the training, validation, and test losses
    • A plot of the temperature

Dependencies

  • Python3.7
  • tqdm
  • requests
  • pytorch
  • dgl
  • matplotlib
  • scikit-learn
  • umap-learn
  • pynndescent

Experiments

Models

(called gold clusterings in the preprint)

Similarity links owl:sameAs skos:closeMatch skos:relatedMatch skos:related skos:broadMatch
Properties T / S T / S nT / S nT / S T / nS
M0 X X X X X
M1 X X X X
M2 X
M3   X
M4   X
M5   X
M6   X
  • T: transitivity
  • S: symmetry
  • nT : non-transitivity
  • nS : non-symmetry