CFDE Data Distillery Data Dictionary

Introduction

The Data Distillery project aims to integrate summarized ("distilled") Common Fund data within a knowledge graph. The purpose of the Data Distillery Knowledge Graph (DDKG) is to link multiple sources of expertly curated data, thus providing data integration across multiple Common Fund data coordinating centers (DCCs). The summarized data are provided by participating DCCs and funded as part of the Common Fund Data Ecosystem (CFDE) project. The DDKG schema is based on the Unified Biomedical Knowledge Graph (UBKG) which originates from the Unifield Medical Language System (UMLS). The UBKG supports the DDKG with over 180 different ontologies and standards supporting the Common Fund data that either are native to UMLS or were explicitly added to support biomolecular data (see Figure 1). The DDKG can be used to create simple to complex queries, and use the results for a range of different applications related to the use of Common Fund data. We include some use cases with example queries and results in the Data Distillery User Guide.

For the first phase of the project, the participating DCCs have submitted 29 different datasets for integration into the DDKG This document is focused on outlining these 29 datasets within the DDKG, and describing the schema and information on each dataset.

Base Datasets

Information on the base set of ontologies included in the Data Distillery Knowledge Graph can be found in the documentation for the Unified Biomedical Knowledge Graph (UBKG), upon which the Data Distillery is built. See Figure 1 for a general schematic.

DCC Datasets

4D Nucleome (4DN) DCC

4DN datasets


Dataset SAB(s)	4DNQ, 4DNL, 4DNF, 4DND
DCC Website	data.4dnucleome.org
DCC	4DN-DCIC
Authority	Andy Schroeder (PM) Harvard Medical School, Boston
Source Information	Chromatin loops called from Hi-C experiments performed in select cell lines.
Purpose	Representing topologically associated domains and loops by chromosomal location can allow exploration of gene expression, genomic variation and other biological information in the context of chromatin architecture.
Description	Hi-C chromatin capture assays generate information on regions of the genome that can be located far apart along the linear sequence of DNA but are in close physical proximity in nuclear chromatin. Architectural features of the chromatin including topologically associated domains (TADs), loops and dots can be generated by algorithms from the results of Hi-C experiments. Loop calls from several 4DNucleome Hi-C datasets generated from select cell lines and tissues are provided to the data distillery for ingestion. A subset of loop calls were generated by two different 4DN research labs on datasets from the H1-ESC human ES cell line, H1 differentiated to endoderm and HFFc6, a human foreskin derived cell line. Loop calls from the Dekker lab were generated using the cooltools re-implementation of HICCUPS as described in Oksuz et al. 2021 https://pubmed.ncbi.nlm.nih.gov/34480151/. Loop calls from the Cremins lab were generated as described in Emerson et al. 2022 https://pubmed.ncbi.nlm.nih.gov/35676475/. Additional calls from the Cremins lab generated and part of the data from the Emerson paper from the HCT116 colorectal cancer cell line with or without depletion of WAPL or RAD21, genes that encode protein important for chromatin architecture are also provided. In addition, the 4DN-DCIC generated loop calls on 4 additional datasets, in situ Hi-C performed on H1-ESC or GM12878 cell lines, 4DNESFSCP5L8 and 4DNES3JX38V5, respectively and DNase-Hi-C in fetal heart tissue (4DNESZFHB53P) or RUES2 stem cells differentiated to cardiomyocytes (4DNESGTHHJAC). These loop calls from the 25 kb resolution matrices of these datasets were further filtered for those loops that overlapped expressed genes identified from gene expression data from the same or comparable cells and tissues.
Summarization of Methodology	This document indicates the datasets and files used as input to the 4DN distilled data. For the genome-wide loop calls from the Dekker and Cremins groups the indicated files were the direct input into the summarization process. For the 4DN-DCIC loop calls the mcool files indicated were used to call loops and further summarized utilizing expression data to provide a file of loops that overlap expressed genes as described in this document. Provided loop files were further prepared for ingestion by first creating dataset nodes (SAB: '4DND') with the respective terms containing the dataset information (assay type, lab and cell type involved), file nodes (SAB: '4DNF') with the respective terms containing the file information, loop nodes (SAB: '4DNL') attached to HSCLO nodes at 1kpb resolution level corresponding to upstream start and end and downstream start and end nodes of the characteristic anchor of the loop and q-value nodes (SAB: '4DNQ') corresponding to donut q-value of the loops. The mentioned nodes are then used to create concept nodes with connections depicted in the schematic below.
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/4DN
Total Nodes	216,212
Total Edges	1,294,980

4DN Schema Diagram

4DN Node Counts

SAB	Count
4DND	12
4DNF	12
4DNL	215,822
4DNQ	354

4DN Edge Counts

Subject SAB	Predicate	Object SAB	Count
4DND	has_assay_type	EFO	12
4DND	has_assay_type	OBI	12
4DND	dataset_involved_cell_type	EFO	11
4DND	dataset_involved_cell_type	UBERON	1
4DND	dataset_has_file	4DNF	12
4DNF	file_has_loop	4DNL	215,822
4DNL	loop_has_qvalue_bin	4DNQ	215,822
4DNL	loop_us_start	HSCLO	215,822
4DNL	loop_us_end	HSCLO	215,822
4DNL	loop_ds_start	HSCLO	215,822
4DNL	loop_ds_end	HSCLO	215,822

Extracellular RNA Communication Program (ERCC) DCC

ERCC RBP dataset


Dataset SAB(s)	ENSEMBL, UBERON, UNIPROTKB, ENCODE.RBS.150.NO.OVERLAP, ENCODE.RBS.HepG2, ENCODE.RBS.HepG2.K562, ENCODE.RBS.K562
DCC Website	https://exrna.org/
DCC	Extracellular RNA Communication Consortium (ERCC)
Authority	Aleksandar Milosavljevic
Source Information	The genomic coordinates of eCLIP peaks of 150 RNA binding proteins (RBPs) were taken from eCLIP-seq analysis results published by the ENCODE project. Control extracellular RNA (exRNA) sequencing profiles available through the exRNA Atlas were used to draw several relationships.
Purpose	To help identify minimally invasive biomarkers of disease.
Description	Assertions describe relationships between RBPs, RBP binding sites, genes, and biofluids. RBP binding sites refers to both eCLIP peaks and a second type of genomic locus. The group is the result of trimming the eCLIP loci so that there are no overlaps between sets of loci from a given pair of RBPs.
Summarization of Methodology	Relationships between RBPs and biofluids, and between eCLIP loci and biofluids are the result of a correlation-based analysis. This analysis was performed using the coverage of trimmed eCLIP loci within control exRNA profiles made available through the exRNA Atlas. This analysis is described in detail by LaPlante et al., Cell Genomics, 2023.
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total Nodes	1,169,178
Total Edges	2,431,786
Source Data DOI(s)	https://doi.org/10.1038/nmeth.3810
Source Data URL(s)	https://www.encodeproject.org/encore-matrix/?type=Experiment&status=released&internal_tags=ENCORE

ERCC RBP Schema Diagram

ERCC RBP Node Counts

SAB	Count
UBERON	5
UNIPROTKB	150
ENSEMBL	15,807
ENCODE.RBS.150.NO.OVERLAP	462,297
ENCODE.RBS.K562	304,175
ENCODE.RBS.HepG2	335,238
ENCODE.RBS.HepG2.K562	51,506

ERCC RBP Edge Counts

Subject SAB	Predicate	Object SAB	Count
ENCODE.RBS.150.NO.OVERLAP	overlaps	ENSEMBL	500,660
ENCODE.RBS.HepG2	overlaps	ENSEMBL	265,453
UNIPROTKB	molecularly_interacts_with	ENCODE.RBS.HepG2	335,238
ENCODE.RBS.K562	overlaps	ENSEMBL	333,141
UNIPROTKB	molecularly_interacts_with	ENCODE.RBS.K562	304,175
ENCODE.RBS.150.NO.OVERLAP	is_subsequence_of	ENCODE.RBS.HepG2	225,437
ENCODE.RBS.150.NO.OVERLAP	is_subsequence_of	ENCODE.RBS.K562	192,320
ENCODE.RBS.HepG2.K562	overlaps	ENSEMBL	56,565
UNIPROTKB	molecularly_interacts_with	ENCODE.RBS.HepG2.K562	51,506
ENCODE.RBS.150.NO.OVERLAP	is_subsequence_of	ENCODE.RBS.HepG2.K562	44,713
ENCODE.RBS.150.NO.OVERLAP	correlated_in	UBERON	22,092
UNIPROTKB	predicted_in	UBERON	268
UNIPROTKB	not_predicted_in	UBERON	116
ENCODE.RBS.150.NO.OVERLAP	not_correlated_in	UBERON	102

ERCC Regulatory Element dataset


Dataset SAB(s)	CLINGEN.ALLELE.REGISTRY, ENSEMBL, GTEXEQTL, UBERON, ENCODE.CCRE, ENCODE.CCRE.ACTIVITY, ENCODE.CCRE.CTCF, ENCODE.CCRE.H3K27AC, ENCODE.CCRE.H3K4ME3 (node SABs) ERCCREG, ERCCRBP (edge SABs)
DCC Website	https://exrna.org/
DCC	Extracellular RNA Communication Consortium (ERCC)
Authority	Aleksandar Milosavljevic
Source Information	The results of CHIP-seq experiments conducted by the ENCODE project were used to identify regulatory elements active within specific tissues and their transcriptional role. Similarly, we used data published by the GTEx project to identify eQTLs active within specific tissues.
Purpose	To identify regulatory elements active within a specific tissue which are also supported by having an active eQTL within the range of its genomic coordinates.
Description	The tissue specific regulation of a gene by an eQTL is modeled using variant, tissue, eQTL, and gene nodes. The same model structure is also used for regulatory elements. In this case a "regulatory element activity" (SAB=ENCODE.CCRE.ACTIVITY) node is used as the central node rather than the eQTL node. Regulatory element activity nodes are also decorated with relationships to other nodes to assist in determining the tissue specific transcriptional role of the regulatory element. eQTL and regulatory element models are connected by a relationship between variant and regulatory element nodes.
Summarization of Methodology	To summarize regulatory element data, ENCODE biosamples were grouped by their respective tissue or cell line ontology code. These groups were then further grouped by the number of samples within each biosample group. Next, within the DNase Z-score data matrix provided by ENCODE, for each larger group and each regulatory element, the number of z-scores that were above 1.64 were counted within samples of each biosample (or small) group. This process was used to build a reference distribution of counts specific to a biosample group with a specific number of members. Regulatory elements were then classified as active within a specific tissue or cell type if the count of z-scores greater than 1.64 within ENCODE biosamples belonging to that group was above the median value of the reference distribution. Only regulatory elements classified as active in at least one tissue or cell type are included. The process described above was repeated for the H3K4me3, H3K27Ac, and CTCF z-score data matrices to decorate regulatory element activity nodes with relationships to other nodes to help the user identify the transcriptional role of each regulatory element.
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/ERCC/check_ERCC_submissions.ipynb
Total Nodes	2,918,828
Total Edges	14,897,093
Source Data DOI(s)	https://doi.org/10.1038/s41586-020-2493-4
Source Data URL(s)	https://screen.wenglab.org/ https://www.gtexportal.org/home/

ERCC Regulatory Element Schema Diagram

ERCC Regulatory Element Node Counts

SAB	Count
UBERON	34
ENSEMBL	49,987
GTEXEQTL	265,965
ENCODE.CCRE.ACTIVITY	2,196,935
ENCODE.CCRE	342,850
CLINGEN.ALLELE.REGISTRY	63,051
ENCODE.CCRE.H3K4ME3	2
ENCODE.CCRE.CTCF	2
ENCODE.CCRE.H3K27AC	2

ERCC Regulatory Element Edge Counts

Subject SAB	Predicate	Object SAB	Count
ENCODE.CCRE.ACTIVITY	regulates	ENSEMBL	4,804,247
ENCODE.CCRE	part_of	ENCODE.CCRE.ACTIVITY	2,196,935
UBERON	part_of	ENCODE.CCRE.ACTIVITY	2,196,935
ENCODE.CCRE.ACTIVITY	isa	ENCODE.CCRE.H3K4ME3	1,712,682
ENCODE.CCRE.ACTIVITY	isa	ENCODE.CCRE.H3K27AC	1,570,234
ENCODE.CCRE.ACTIVITY	isa	ENCODE.CCRE.CTCF	1,510,150
CLINGEN.ALLELE.REGISTRY	part_of	GTEXEQTL	265,965
UBERON	part_of	GTEXEQTL	265,965
GTEXEQTL	negatively_regulates	ENSEMBL	156,836
GTEXEQTL	positively_regulates	ENSEMBL	154,093
CLINGEN.ALLELE.REGISTRY	located_in	ENCODE.CCRE	63,051

GlyGen DCC

GlyGen datasets


Dataset SAB(s)	FALDO, GLYCOCOO,GLYCORDF, UNIPROTKB, PROTEOFORM, GLACANS
DCC Website	https://www.glygen.org/
DCC	GlyGen
Authority	Raja Mazumder (PI) George Washington University; Mike Tiemeyer (PI) University of Georgia
Source Information	Data for GlyGen is retrieved from multiple glycomics database (e.g. GlyTouCan, GlyConnect, MatrixDB), proteomics database (e.g. UniProtKB) and other domain database (e.g. Ensembl, RefSeq, BioMuta, OMA, MGI, Bgee). All data is transformed in standardized representation and integrate in GlyGen
Purpose	Provide computational and informatics resources and tools for glycosciences research. Integrate data and knowledge from diverse disciplines relevant to glycobiology. Address needs inside and outside the glycoscience community.
Description	GlyGen is a data integration and dissemination project for carbohydrate and glycoconjugate related data. GlyGen retrieves information from multiple international data sources and integrates and harmonizes this data. This web portal allows exploring this data and performing unique searches that cannot be executed in any of the integrated databases alone.
Summarization of Methodology	The data ingested for the KnowledgeGraph are from ontologies associated with glycan and proteoform domain. Select nodes and edges for glycans are retrieved from GlyCoCoo and GlyCoRDF. ontologies that describe the properties of glycans. The assertion data received from GlyGen in n-triples format (glycan.nt and proteoform.nt) were imported into the No4j environment using the n10s plug-in functions. Once the data was imported for each of the glycans and proteoform datasets, subgraphs were created. Finally, the resulting graph nodes and edges were exported as .csv files using APOC plug-in procedures.The resulting nodes and edges were reformatted by curating the relationship names and adding SABs for all entities (either by using existing SABs e.g. UNIPROTKB and GLYTOUCAN or creating custom SABs such as GLYGEN.MOTIF or GLYCOPROTEIN) and saved as the OWLNETS_node_metadata.tsv and OWLNETS_edgelist.tsv for ingestion. More information on FALDO can be found here: https://bioportal.bioontology.org/ontologies/FALDO. More information on GlycoRDF can be found here: https://github.com/glycoinfo/GlycoRDF. More information on GlycoCoO can be found here: https://github.com/glycoinfo/GlycoCoO.
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/GLYGEN/GlyGen_workfolw.md
Total Nodes	241,770 (PROTEOFORM) 182,269 (GLYCANS)
Total Edges	455,469 (PROTEOFORM) 464,659 (GLYCANS)
Source Data URL(s)	Download from https://sparql.glygen.org/ Data file: https://sparql.glygen.org/ln2triplestoredata/triples.tar.gz

GlyGen GLYCANS Schema Diagram

GlyGen PROTEOFORM Schema Diagram

GlyGen FALDO Node Counts

SAB	Count
FALDO	16

GlyGen FALDO Edge Counts

Subject SAB	Predicate	Object SAB	Count
FALDO	isa	FALDO	13
FALDO	begin	FALDO	2
FALDO	end	FALDO	2
FALDO	member	FALDO	2
FALDO	after	FALDO	1
FALDO	before	FALDO	1

GlyGen GLYCOCOO Node Counts

SAB	Count
GLYCAN	10
FALDO	4
SIO	5
CODAO	1
CONJUGATE	4
PROTEIN	1
GLYCOSYLATION	1

GlyGen GLYCOCOO Edge Counts

Subject SAB	Predicate	Object SAB	Count
GLYCAN	isa	GLYCOSYLATION	1
GLYCAN	isa	PROTEIN	1
SIO	isa	SIO	3
SIO	SIO_000628	SIO	1
CODAO	isa	SIO	1
FALDO	isa	FALDO	3
GLYCAN	isa	GLYCAN	5
CONJUGATE	isa	GLYCAN	4

GlyGen GLYCORDF Node Counts

SAB	Count
GLYCAN	107
IMAGE	1
PROTEIN	1
GLYCOSYLATION	1

GlyGen GLYCORDF Edge Counts

Subject SAB	Predicate	Object SAB	Count
GLYCAN	isa	GLYCOSYLATION	1
GLYCAN	isa	IMAGE	1
GLYCAN	isa	GLYCAN	92
PROTEIN	isa	GLYCAN	1
GLYCAN	glycan_has_monosaccharide	GLYCAN	1
GLYCAN	glycan_has_signal	GLYCAN	1

GlyGen PROTEOFORM Node Counts

SAB	Count
UNIPROTKB	16,810
GLYCOPROTEIN	63,441
GLYCOPROTEIN.EVIDENCE	2,088
GLYCOSYLATION.SITE	52,451
GLYGEN.LOCATION	52,450
UNIPROTKB.ISOFORM	8,406
GLYGEN.CITATION	2,088
GP.ID2PRO	61,353
GLYTOUCAN	1,554
AMINO.ACID	7

GlyGen PROTEOFORM Edge Counts

Subject SAB	Predicate	Object SAB	Count
UNIPROTKB	has_isoform	UNIPROTKB.ISOFORM	8,404
GLYCOPROTEIN	has_evidence	GLYCOPROTEIN.EVIDENCE	120,158
GLYCOPROTEIN	sequence	UNIPROTKB.ISOFORM	61,353
GLYCOPROTEIN.EVIDENCE	citation	GLYGEN.CITATION	2,088
GLYCOPROTEIN	has_pro_entry	GP.ID2PRO	61,353
GLYCOPROTEIN	glycosylated_at	GLYCOSYLATION.SITE	52,450
GLYCOSYLATION.SITE	location	GLYGEN.LOCATION	52,450
GLYCOSYLATION.SITE	has_saccharide	GLYTOUCAN	44,763
GLYGEN.LOCATION	has_amino_acid	AMINO.ACID	52,450

GlyGen GLYCANS Node Counts

SAB	Count
GLYGEN.GLYCOSYLATION	91
GLYCOSYLTRANSFERASE.REACTION	91
GLYTOUCAN	33,755
GLYGEN.RESIDUE	80
GLYGEN.SRC	30,986
GLYGEN.GLYCOSEQUENCE	117,146
GLYCAN.MOTIF	120

GlyGen GLYCANS Edge Counts

Subject SAB	Predicate	Object SAB	Count
GLYGEN.GLYCOSYLATION	has_enzyme_protein	UNIPROTKB	91
GLYCOSYLTRANSFERASE.REACTION	has_enzyme_protein	UNIPROTKB	91
GLYTOUCAN	is_from_source	GLYGEN.SRC	30,986
GLYTOUCAN	has_glycosequence	GLYGEN.GLYCOSEQUENCE	117,146
GLYGEN.RESIDUE	attached_by	GLYGEN.GLYCOSYLATION	349
GLYTOUCAN	synthesized_by	GLYCOSYLTRANSFERASE.REACTION	210,563
GLYTOUCAN	has_motif	GLYCAN.MOTIF	19,321
GLYTOUCAN	has_canonical_residue	GLYGEN.RESIDUE	86,033
GLYGEN.RESIDUE	has_parent	GLYGEN.RESIDUE	79

Genotype Tissue Expression (GTEx) DCC

GTEx datasets


Dataset SAB(s)	GTEXEXP, GTEXEQTL, EXPBINS, PVALUEBINS
DCC Website	https://www.gtexportal.org/home/
DCC	GTEx
Authority	Kristen Ardlie
Source Information	Documentation on the sources of GTEx data can be found here: https://biospecimens.cancer.gov/resources/sops/docs/GTEx_SOPs/BBRB-PR-0004-W1%20GTEx%20Tissue%20Harvesting%20Work%20Instruction.pdf
Purpose	To include bulk RNA-seq gene expression levels from adult tissues as well as correlations between genotype and tissue-specific gene expression levels as expression quantitative trait loci (eQTLs) that identify regions of the genome that influence whether and how much a gene is expressed.
Description	The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. This database includes expression levels for genes by tissue in terms of transcripts per mission (TPM). The database also contains the p-values and relationships between loci and genes as expression quantitative trait loci (eQTLs).
Summarization of Methodology	Three types of GTEx data were summarized and ingested into the knowledge graph listed below by SAB: 1. GTEXEXP - Transcript per million (TPM) values, which represent gene-tissue expression levels, were ingested as is except that edges to 'bin nodes' (EXPBINS) were created. For example, a GTEXEXP node with a TPM of 10.5 will have an edge to the bin node that represents [10,11] TPM. 2. GTEXEQTL - GTEx eQTLs were filtered to include only those that are present in every tissue. This reduced the total set of eQTLs to ~2 million. P-values for the eQTLs are also included in the graph, however, they are represented as bin nodes just like the TPM values for the GTEXEXP dataset. 3. GTEXCOEXP - A GTEx co-expression dataset was made by first calculating the Pearson's correlation coefficient of all genes in GTEx intersection with HGNC master list separately for each of 54 tissues listed in GTEx using the provided TPMs. Then pairs of genes with correlation coefficient > 0.99 were tagged in each tissue as strongly correlated and reported as assertions with relationship types and counts in the attached table (please see below)
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/GTEx
Total Nodes	6,280,011
Total Edges	31,904,034
Source Data DOI(s)	N/A
Source Data URL(s)	https://www.gtexportal.org/home/datasets (GTEx_Analysis_v8_eQTL.tar and GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz)

GTEx GTEXEXP Schema Diagram

GTEx GTEXQTL Schema Diagram

GTEx GTEXEXP/EXPBINS Node Counts

SAB	Count
UBERON	42
GTEXEXP	1,573,380
EXPBINS	159

GTEx GTEXEXP/EXPBINS Edge Counts

Subject SAB	Predicate	Object SAB	Count
GTEXEXP	expressed_in	HGNC	1,573,380
GTEXEXP	has_expression	EXPBINS	1,573,380
GTEXEXP	expressed_in	UBERON	1,503,452
GTEXEXP	expressed_in	EFO	69,928

GTEx GTEXEQTL/PVALUEBINS Node Counts

SAB	Count
GTEXEQTL	1,240,810
PVALUEBINS	17
HSCLO	3,431,153

GTEx GTEXEQTL/PVALUEBINS Edge Counts

Subject SAB	Predicate	Object SAB	Count
GTEXEQTL	located_in	HGNC	2,047,088
GTEXEQTL	p_value	PVALUEBINS	1,251,403
GTEXEQTL	located_in	HSCLO	1,240,810
GTEXEQTL	located_in	UBERON	1,193,338
GTEXEQTL	located_in	EFO	47,472

GTEx GTEXCOEXP Node Counts

SAB	Count
HGNC	34,448

GTEx GTEXCOEXP Edge Counts

Subject SAB	Predicate	Object SAB	Count
GTEXCOEXP	coexpression_Adipose___Subcutaneous	HGNC	15,485
GTEXCOEXP	coexpression_Adipose___Visceral_(Omentum)	HGNC	2,646
GTEXCOEXP	coexpression_Adrenal_Gland	HGNC	37,897
GTEXCOEXP	coexpression_Artery___Aorta	HGNC	642,521
GTEXCOEXP	coexpression_Artery___Coronary	HGNC	612,950
GTEXCOEXP	coexpression_Artery___Tibial	HGNC	10,237
GTEXCOEXP	coexpression_Bladder	HGNC	529,181
GTEXCOEXP	coexpression_Brain___Amygdala	HGNC	22,221
GTEXCOEXP	coexpression_Brain___Anterior_cingulate_cortex_(BA24)	HGNC	102,887
GTEXCOEXP	coexpression_Brain___Caudate_(basal_ganglia)	HGNC	7,309
GTEXCOEXP	coexpression_Brain___Cerebellar_Hemisphere	HGNC	40,983
GTEXCOEXP	coexpression_Brain___Cerebellum	HGNC	106,195
GTEXCOEXP	coexpression_Brain___Cortex	HGNC	764,276
GTEXCOEXP	coexpression_Brain___Frontal_Cortex_(BA9)	HGNC	27,760
GTEXCOEXP	coexpression_Brain___Hippocampus	HGNC	84,051
GTEXCOEXP	coexpression_Brain___Hypothalamus	HGNC	185,487
GTEXCOEXP	coexpression_Brain___Nucleus_accumbens_(basal_ganglia)	HGNC	1,198,329
GTEXCOEXP	coexpression_Brain___Putamen_(basal_ganglia)	HGNC	1,146,393
GTEXCOEXP	coexpression_Brain___Spinal_cord_(cervical_c_1)	HGNC	267,533
GTEXCOEXP	coexpression_Brain___Substantia_nigra	HGNC	143,792
GTEXCOEXP	coexpression_Breast___Mammary_Tissue	HGNC	3,094
GTEXCOEXP	coexpression_Cells___Cultured_fibroblasts	HGNC	11,652
GTEXCOEXP	coexpression_Cells___EBV_transformed_lymphocytes	HGNC	90,051
GTEXCOEXP	coexpression_Cervix___Ectocervix	HGNC	817,624
GTEXCOEXP	coexpression_Cervix___Endocervix	HGNC	805,252
GTEXCOEXP	coexpression_Colon___Sigmoid	HGNC	11,358
GTEXCOEXP	coexpression_Colon___Transverse	HGNC	22,786
GTEXCOEXP	coexpression_Esophagus___Gastroesophageal_Junction	HGNC	12,874
GTEXCOEXP	coexpression_Esophagus___Mucosa	HGNC	20,463
GTEXCOEXP	coexpression_Esophagus___Muscularis	HGNC	79,416
GTEXCOEXP	coexpression_Fallopian_Tube	HGNC	769,999
GTEXCOEXP	coexpression_Heart___Atrial_Appendage	HGNC	168,057
GTEXCOEXP	coexpression_Heart___Left_Ventricle	HGNC	600,676
GTEXCOEXP	coexpression_Kidney___Cortex	HGNC	583,782
GTEXCOEXP	coexpression_Kidney___Medulla	HGNC	10,461,695
GTEXCOEXP	coexpression_Liver	HGNC	20,645
GTEXCOEXP	coexpression_Lung	HGNC	17,156
GTEXCOEXP	coexpression_Minor_Salivary_Gland	HGNC	47,164
GTEXCOEXP	coexpression_Muscle___Skeletal	HGNC	4,061
GTEXCOEXP	coexpression_Nerve___Tibial	HGNC	12,460
GTEXCOEXP	coexpression_Ovary	HGNC	20,177
GTEXCOEXP	coexpression_Pancreas	HGNC	24,183
GTEXCOEXP	coexpression_Pituitary	HGNC	28,152
GTEXCOEXP	coexpression_Prostate	HGNC	4,197
GTEXCOEXP	coexpression_Skin___Not_Sun_Exposed_(Suprapubic)	HGNC	1,516
GTEXCOEXP	coexpression_Skin___Sun_Exposed_(Lower_leg)	HGNC	84,793
GTEXCOEXP	coexpression_Small_Intestine___Terminal_Ileum	HGNC	73,157
GTEXCOEXP	coexpression_Spleen	HGNC	375,064
GTEXCOEXP	coexpression_Stomach	HGNC	12,964
GTEXCOEXP	coexpression_Testis	HGNC	141,440
GTEXCOEXP	coexpression_Thyroid	HGNC	31,593
GTEXCOEXP	coexpression_Uterus	HGNC	16,498
GTEXCOEXP	coexpression_Vagina	HGNC	20,584
GTEXCOEXP	coexpression_Whole_Blood	HGNC	13,845

The Human BioMolecular Atlas Program (HuBMAP) DCC

HuBMAP datasets


Dataset SAB(s)	AZ, HUBMAP
DCC Website	https://hubmapconsortium.org/
DCC	HuBMAP
Authority	Jonathan Silverstein, Phil Blood
Source Information	https://azimuth.hubmapconsortium.org/
Purpose	HuBMAP data provides tissue, cell-type and gene specific markers from single-cell data. The purpose of the Hubmap/AZ data is to provide cell-type-specific gene expression markers from single-cell experiments across each tissue.
Description	HuBMAP is working to catalyze the development of a framework for mapping the human body at single cell resolution and developing the tools to create an open, global atlas of the human body at the cellular level. In this database, we include cell-type specific gene markers from the Azimuth project form a subset of tissues including heart, liver and kidney.
Summarization of Methodology	https://azimuth.hubmapconsortium.org/references/
Summarization of Methodology Code Repository URL	https://azimuth.hubmapconsortium.org/references/
Total Nodes	769
Total Edges	910

HuBMAP Az Schema Diagram

HuBMAP Az Node Counts

SAB	Count
HGNC	677
AZ	92

HuBMAP Az Edge Counts

Subject SAB	Predicate	Object SAB	Count
AZ	has_marker_gene_in_kidney	HGNC	485
AZ	has_marker_gene_in_liver	HGNC	225
AZ	has_marker_gene_in_heart	HGNC	200

Illuminating the Druggable Genome (IDG) DCC

IDG datasets


Dataset SAB(s)	IDGP (compound/protein), IDGD (compound/disease) (both are edge SABs)
DCC Website	https://pharos.nih.gov/ https://commonfund.nih.gov/idg
DCC	Illuminating the Druggable Genome Data Coordinating Center - Engagement Plan with the CFDE
Authority	Christophe Lambert (PI), University of New Mexico Health Sciences Center
Source Information	Relationships between compounds, diseases, and proteins drawn from the IDG Target Central Resource Database (TCRD) hosted at https://pharos.nih.gov and at DrugCentral https://drugcentral.org/.
Purpose	The Illuminating the Druggable Genome (IDG) project elucidates the relationships between diseases, targets, and compounds, providing insights into lesser-known proteins, empowering researchers to discover novel therapeutic targets and accelerate drug development for various diseases.
Description	The IDG contributions to the Knowledge Graph include compounds, diseases and proteins and their relationships. A full description of IDG data sources is here: https://pharos.nih.gov/about. Target Central Resource Database (TCRD) is the central resource supporting the IDG-KMC. TCRD has information about human drug targets with a focus on GPCRs, kinases, and ion channels. TCRD categorizes all drug targets into four Target Development Levels (TDLs) by making use of activity thresholds. Protein drug targets from TCRD with known bioactive compounds were incorporated into the IDG KG. Also included in our KG are diseases from TCRD that have known "indication" relationships to approved drugs from DrugCentral. The IDG KG can be used to explore compounds and proteins related to a specific disease among other similar queries. The IDG KG can be combined with data from other DCCs such as LINCS, GTeX, and others to create interesting scientific use cases.
Summarization of Methodology	Compound nodes were sourced using TCRD, DrugCentral, and PubChem using PUBCHEM_CID ontology. Specifically, chemical compounds from DrugCentral were included if TCRD indicated known bioactivity against protein targets. Compound node properties include SMILES as the node_definition, drugbank ID as node_dbxrefs, name as node_label, and 'IDG' as the node_namespace. The PubChem API was used to assign the node_synonyms and node_dbxrefs node properties. Protein nodes were obtained from TCRD and DrugCentral, and rely on UNIPROTKB ontology. Protein targets from DrugCentral and TCRD with known bioactive compounds were included. A protein's symbol is denoted as node_label, protein names as node_definition, EnsEMBL IDs as node_dbxrefs, and 'IDG' as node_namespace. The disease nodes use SNOMED_US ontology and are sourced from TCRD and DrugCentral. Diseases from DrugCentral and TCRD with known indication relationships to approved drugs were included. The disease OMOP concept names are included as the node property node_label. Additionally, OMOP IDs are included as node_dbxrefs and 'IDG' as the node_namespace. The bioactivity relationship is defined between compounds and proteins. While ChEMBL and PubChem offer complex and differing ontologies for bioactivity relationships, for the sake of simplicity and efficiency in this early version we use the custom term "bioactivity". The type of bioactivity measurement (e.g. IC50, Kd, EC50) is included as evidence_class. The indication relationship links compounds and diseases. For simplicity, we introduce the custom, simple term "indication". These indication relationships are defined between diseases and approved drugs from DrugCentral.
Summarization of Methodology Code Repository URL	Code by IDG team: https://github.com/unmtransinfo/cfde-distillery Code from core DD team (further processing and formatting): https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/IDG/check_IDG.ipynb
Total Nodes	Total: 331788; compounds: 327951; disease: 1472; protein: 2365
Total Edges	Total: 463972; compound/protein (Bioactivity): 454957; compound/disease (Indication): 9015
Source Data DOI(s)	N/A
Source Data URL(s)	https://app.globus.org/file-manager?origin_id=24c2ee95-146d-4513-a1b3-ac0bfdb7856f&origin_path=%2Fprojects%2Fdata-distillery%2FImport%2FIDG%2F

IDG Schema Diagram

IDG IDGP (compound/protein) Node Counts

SAB	Count
UNIPROTKB	2,365
PUBCHEM	324,293

IDG IDGP (compound/protein) Edge Counts

Subject SAB	Predicate	Object SAB	Count
PUBCHEM	bioactivity	UNIPROTKB	454,957

IDG IDGD (compound/disease) Node Counts

SAB	Count
SNOMEDCT_US	1,472
PUBCHEM	325,299

IDG IDGD (compound/disease) Edge Counts

Subject SAB	Predicate	Object SAB	Count
PUBCHEM	indication	SNOMEDCT_US	9,015

Gabriella Miller Kids First (GMKF) DCC

GMKF datasets


Dataset SAB(s)	KFGENEBIN, KFPT, KFCOHORT
DCC Website	https://kidsfirstdrc.org/
DCC	Gabriella Miller Kids First (GMKF) Pediatric Research Program Data Resource Center (DRC)
Authority	Deanne Taylor (PI), Children's Hospital of Philadelphia
Source Information	Genomic and phenotypic data, broadly summarized from trio cohorts with cardiac birth defects from the Pediatric Cardiac Genetics Consortium cohort in Kids First,cohort SD_PREASA7S
Purpose	The main purpose of the KF DRC is to better understand the genetic causes and links between childhood cancer and structural birth defects.
Description	The Kids First DRC is a collaborative pediatric research effort created to accelerate data-driven discoveries and the development of novel precision-based approaches for children diagnosed with cancer or a structural birth defect using large genomic datasets. The Kids First DRC is comprised of integrated core teams that support development of leading-edge big data infrastructure and provide the necessary resources and tools to empower researchers and clinicians.
Summarization of Methodology	Variant data from a Congenital Heart Defects (CHD) cohort was queried and filtered using the Kids First variant workbench platform. We filtered for variants that were scored as 'high impact' by the variant effect predictor tool (VEP). Variant counts per gene were then computed by counting how many times each gene appeared. The number of variations per gene is stored in the 'value' property of the SAB KFGENEBIN Code nodes. Kids First cohorts are stored in the graph as their own nodes and have an SAB of KFCOHORT. Patient IDs from the CHD cohort have also been ingested into the graph as their own nodes and have an SAB of KFPT.
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/tree/main/DCC_workflows/KidsFirst
Total Nodes	18,719
Total Edges	76,690

GMKF Schema Diagram

GMKF Node Counts

SAB	Count
KFGENEBIN	13,375
KFPT	5,329
KFCOHORT	15

GMKF Edge Counts

Subject SAB	Predicate	Object SAB	Count
KFPT	has_phenotype	HPO	44,611
KFGENEBIN	belongs_to_cohort	KFCOHORT	13,375
KFGENEBIN	gene_has_variants	HGNC	13,375
KFPT	belongs_to_cohort	KFCOHORT	5,329

The Library of Integrated Network-Based Cellular Signatures (LINCS) DCC

LINCS datasets


Dataset SAB(s)	LINCS (edge SAB only)
DCC Website	https://lincsproject.org/
DCC	Library of Integrated Network-Based Cellular Signatures (LINCS) Data Coordination and Integration Center (DCIC)
Authority	Avi Ma'ayan (PI), Icahn School of Medicine at Mount Sinai
Source Information	Gene expression changes resulting from drug/small molecule perturbations across cell lines, and gene expression signature similarity between drug/small molecule based on LINCS L1000 signature similarity
Purpose	Understand cellular responses to various drug and pre-clinical compound treatments through L1000 transcriptomics assays
Description	The LINCS assertions include drug-gene associations and drug-drug similarity associations computed from the LINCS L1000 consensus signatures dataset. Each drug is linked to the top 25 most up-regulated and top 25 most down-regulated genes in the L1000 consensus signatures for the drug/small-molecule, as well as to the top 5 most similar other drugs in the dataset based on the correlation between the consensus signatures for each drug.
Summarization of Methodology	Level 3 L1000 profiles, drug metadata, and gene metadata were first downloaded from CLUE.io. The L1000 Level 5 signatures were then computed using the Characteristic Direction method [BMC Bioinformatics 15, 79 (2014)]. For each signature, replicate L1000 profiles for a given perturbagen and dosage were compared against all other L1000 profiles from the same cell line batch. Consensus signatures for each drug were then computed by taking the mean of all gene expression vectors corresponding to the given drug across cell lines, timepoints, and dosages. Drugs were filtered to only those with known PubChem IDs in the original CLUE.io metadata, resulting in a final set of 4,523 drugs. The top 25 up- and down-regulated genes in each consensus signature with known Ensembl IDs from the metadata were determined by the greatest positive and negative Characteristic Direction coefficients, respectively. In total, 225,509 edges and 4,419 unique genes are represented in this collection of knowledge graph assertions. Additionally, a drug-drug similarity matrix was generated by computing the cosine similarity between all possible pairs of the consensus drug signatures. For each drug, the top 5 other drugs with the greatest positive cosine similarity values were retained. Duplicate edges were removed, resulting in 20,785 total edges representing consensus signature-based drug-drug similarity between the 4,523 drugs with known PubChem IDs.
Summarization of Methodology Code Repository URL	The methods are described in the following publication: Evangelista, J.E., Clarke, D.J.B., Xie, Z. et al. Toxicology knowledge graph for structural birth defects. Commun Med 3, 98 (2023). The code to produce the assertions can be found at: https://github.com/nih-cfde/ReproToxTables
Total Nodes	8,942 (drugs: 4523; genes: 4419)
Total Edges	246,294 (drug-gene: 225,509; drug-drug: 20,785)
Source Data DOI(s)	N/A
Source Data URL(s)	https://maayanlab.cloud/sigcom-lincs/#/Download

LINCS Schema Diagram

LINCS Node Counts

SAB	Count
NCBI	1
PUBCHEM	4,523
HGNC	4,418

LINCS Edge Counts

Subject SAB	Predicate	Object SAB	Count
HGNC	negatively_regulated_by	PUBCHEM	112,759
HGNC	positively_regulated_by	PUBCHEM	112,747
PUBCHEM	in_similarity_relationship_with	PUBCHEM	20,785
NCBI	positively_regulated_by	PUBCHEM	3

The Molecular Transducers of Physical Activity Consortium (MoTrPAC) DCC

MoTrPAC datasets


Dataset SAB(s)	MOTRPAC
DCC Website	https://motrpac-data.org
DCC	Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC)
Authority	Euan Ashley MD PhD (PI), Matthew Wheeler MD PhD (PI)
Source Information	Gene differential expression changes resulting from the RNA-seq data of young adult rats (6 month old) performing endurance training exercise at the 1 week, 2 week, 4 week and 8 week time points.
Purpose	The Molecular Transducers of Physical Activity Consortium (MoTrPAC) aims to elucidate how exercise improves health and ameliorates diseases by building a map of the molecular responses to endurance exercise.
Description	MoTrPAC is a multi-site collaboration across the US encompassing various scientific disciplines: preclinical animal study sites and human clinical exercise sites, which perform the exercise testing and biospecimen collection; a consortium coordinating center and biorepository, which manages sample collection, distribution of samples, and consortium logistics; chemical analysis sites, which are responsible for omics analysis from the samples collected; and a bioinformatics center to collaboratively analyze and map the data generated by the other sites along with data dissemination to make the data and other resources available to the public. The animal studies enable analysis of the effects of exercise on many different tissues that are not readily obtainable in humans, whereas the collection of accessible human tissues (muscle, blood, and adipose) will permit the analysis of the direct effect of exercise in humans. Additional information can be found at the main consortium page (https://motrpac.org)) or at the data portal (https://motrpac-data.org). The MoTrPAC study is divided into two main parts - animal (rats) and human, with multiple phases or interventions in each of them. Preclinical animal study sites conduct the endurance exercise and training intervention in young adult (6 month old) and middle-aged adult (18 month old) rats, while Clinical study sites conduct the human endurance and resistance training interventions in pediatric, adults and highly active adults.
Summarization of Methodology
Summarization of Methodology Code Repository URL	https://github.com/TaylorResearchLab/CFDE_DataDistillery/blob/main/DCC_workflows/MoTrPAC/MOTRPAC.ipynb
Total Nodes	16,149
Total Edges	25,714

MoTrPAC Schema Diagram

MoTrPAC Node Counts

SAB	Count
ENSEMBL	5,919
MOTRPAC	8,571

MoTrPAC Edge Counts

Subject SAB	Predicate	Object SAB	Count
MOTRPAC	associated_with	ENSEMBL	8,570
MOTRPAC	located_in	UBERON	8,571
MOTRPAC	sex	PATO	8,571

Metabolomics Workbench (MW) DCC

MW datasets


Dataset SAB(s)	MW (REFMET; metabolite nodes)
DCC Website	https://www.metabolomicsworkbench.org/
DCC	Metabolomics Workbench
Authority	Professor Shankar Subramaniam (PI)
Source Information	Gene-metabolite relationships: MW database tables based on KEGG and other resources Disease-metabolite relationships: Publication based on HMDB data (https://pubmed.ncbi.nlm.nih.gov/32426349/) Cell-metabolite relationships: MW database tables for data submitted to NMDR
Purpose	Understand what metabolites may be regulated by various genes and their spatial (anatomical) and disease context.
Description	The National Institutes of Health (NIH) Common Fund Metabolomics Program was developed with the goal of increasing national capacity in metabolomics by supporting the development of next generation technologies, providing training and mentoring opportunities, increasing the inventory and availability of high quality reference standards, and promoting data sharing and collaboration. In support of this effort, the Metabolomics Common Fund's National Metabolomics Data Repository(NMDR), housed at the San Diego Supercomputer Center (SDSC), University of California, San Diego, has developed the Metabolomics Workbench. The Metabolomics Workbench serves as a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more. NMDR houses data on metabolomics studies conducted by various centers and research laboratories across the nation and the world, spanning many species, sample sources, diseases, metabolomics experimental techniques and metabolite classes [https://www.metabolomicsworkbench.org/data/browse.php]. The data we have shared with the Data Distillery partnership is a key subset of all the data in NMDR, centered around metabolites. Specifically, we have shared disease-metabolite, gene-metabolite and cell/anatomy (sample source)-metabolite relationships, which when integrated with data from other DCC and external resources has the potential to address interesting biological questions.
Summarization of Methodology	Gene-Metabolite: Human genes catalyzing metabolic reactions and their associated metabolites were obtained from MW database tables. The HGNC ID was used as metabolic gene node_id, and its approved symbol and name are used as node_label and node_definition, respectively. UMLS, ENTREZ and ENSMBL IDs are used as node_dbxrefs. For the edges, the Subject (Gene: HGNC ID) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0002566: Causally influences). Disease-Metabolite: Disease-metabolite entities and relationships were deduced from the publication (PMID: 32426349) based on HMDB. The PUBCHEM_CID/HMDB ID was used as node_id for the metabolite. Similarly, disease entities were encoded with DOID or HPO IDs. UMLS, PUBCHEM_CID, DRUGBANK and REFMET were used as node_dbxrefs. For the edges, the Subject (Metabolite: PUBCHEM_CID/HMDB ID) was related to the Object (Disease: DOID/HPO) by the Predicate (RO_0003308: Correlated with condition). Cell-Metabolite: Metabolite-anatomy context (cell/tissue association) was obtained from MW database. Cell/tissue entity node_id is encoded with UBERON, CL and CLO IDs and cross referenced with UMLS. For the edges, the Subject (Spatial context: UBERON/CL/CLO) was related to the Object (Metabolite: PUBCHEM_CID) by the Predicate (RO_0003000: Produces).
Summarization of Methodology Code Repository URL	https://github.com/mano-at-sdsc/MW_DataDistillery
Total Nodes	51,271
Total Edges	10,009

MW Schema Diagram

MW Node Counts

SAB	Count
UBERON	67
CL	12
CLO	3
HGNC	1,061
PUBCHEM	8,543
HMDB	18
DOID	276
HPO	32

MW Edge Counts

Subject SAB	Predicate	Object SAB	Count
UBERON	produces	PUBCHEM	34,471
CL	produces	PUBCHEM	6,777
CLO	produces	PUBCHEM	536
HGNC	causally influences	PUBCHEM	5,527
PUBCHEM	correlated with condition	DOID	3,856
HMDB	correlated with condition	DOID	27
PUBCHEM	correlated with condition	HPO	77

Stimulating Peripheral Activity to Relieve Conditions (SPARC) DCC

SPARC datasets


Dataset SAB(s)	SCKAN, NPO, UBERON, PATO, NIFSTD
DCC Website	https://sparc.science/
DCC	SPARC Data and Resouce Center (DRC) - Knowledge Management and Curation Core
Authority	Tom Gillespie, Fahim Imam (SPARC K-Core, University of California San Diego) Jyl Boline (PM - SPARC K-Core)
Source Information	A key component of the SPARC Program is the SPARC Connectivity Knowledge Base of the Autonomic Nervous system, referred to as SCKAN. SCKAN is a semantic store housing a comprehensive knowledge base of autonomic nervous system (ANS) nerve to end organ connectivity. Connectivity information is derived from SPARC experts, SPARC data, and the literature and textbooks using a Natural Language Processing (NLP) pipeline.
Purpose	Facilitate enhanced understanding of the peripheral nervous system to support the development of effective bioelectronic therapies by driving collaborative neurosciences and providing online resources for accessing and submitting curated data and models, as well as dynamic knowledge-management and visualization tools.
Description	The SPARC Knowledge base of the Automatic Nervous System (SCKAN) is an integrated graph database composed of three parts: the SPARC dataset metadata graph, ApiNATOMY and Neuron Phenotype Ontology (NPO) models of connectivity, and the larger ontology used by SPARC which is a combination of the NIF-Ontology and community ontologies.
Summarization of Methodology	SCKAN provides a central location to populate, discover, and query ANS connectivity knowledge over multiple scales. It allows issuing queries such as, "what are the locations of neuron somas with processes that pass through spinal cord level C4?" and create a searchable visual atlas of ANS circuitry. Users of the SPARC maps can query SCKAN to find more information about routes, targets and evidence. SCKAN contains statements about neuronal connectivity at the neuron population level, largely in the form of: "Neurons with somas in structure A project to structure B via nerve C." SCKAN models connections at two levels of granularity: circuits and individual connections. A circuit represents a detailed model of connectivity that is associated with a particular organ like bladder or functional circuits like defensive breathing. Circuits contain detailed representations of neuron populations giving rise to ANS connections. They include mappings of the locations of cell bodies, dendrites, axon segments as well as synaptic endings involved in a particular circuit. Circuits in SCKAN are modelled using ApiNATOMY, a knowledge model and a tool specifically created to represent multiscale connectivity. To provide a comprehensive knowledge about ANS connectivity, the circuit-based approach is supplemented with well-known connections of ANS derived from the literature and textbooks using a Natural Language Processing (NLP) pipeline.These types of individual connectivity statements do not have detailed topological information associated with them and are represented using NPO.
Summarization of Methodology Code Repository URL	https://zenodo.org/record/7476115
Total Nodes	484,768
Total Edges	1,337,124
Source Data DOI(s)	10.5281/zenodo.7476115
Source Data URL(s)	- https://doi.org/10.5281/zenodo.7476115 - https://github.com/open-physiology/apinatomy-models - https://github.com/SciCrunch/NIF-Ontology/ - https://bioportal.bioontology.org/ontologies/NPOKB

SPARC Schema Diagram

SPARC Node Counts

SAB	Count
UBERON	1,552
MBA	1,327
NPOKB	436
ILX	174
CHEBI	122
ENTREZ	122
NIFEXT	70
NLX	56
NCBITAXON	54
PR	39

SPARC Edge Counts

Subject SAB	Predicate	Object SAB	Count
NPOKB	hasInstanceInTaxon	NCBI	569
NPOKB	hasInstanceInTaxon	SNOMEDCT_US	567
NPOKB	isa	ILX.TR	434
ILX	isa	UBERON	385
ILX	isa	NCI	311
ILX	isa	FMA	288

Additional Datasets

CLINVAR

The ClinVar dataset (v2023-01-05) was utilized to define assertions between human genes and phenotypes. Only genes with pathogenic, likely pathogenic and pathogenic/likely pathogenic variants were considered, and we excluded associations with no assertion criteria met. To retrieve the target phenotype/disease we used MedGen IDs listed in the ClinVar dataset (also already present in the KG). Processed ClinVar dataset contains 214,040 relationships (including reverse relationships) with the following characteristics [Type: "gene_associated_with_disease_or_phenotype", SAB: "CLINVAR"] and [type: inverse_gene_associated_with_disease_or_phenotype, SAB: "CLINVAR"] connecting HGNC to MONDO, HPO, EFO and MESH Concept nodes.

CMAP

The edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset were obtained from the Harmonizome database https://maayanlab.cloud. The dataset added 2,625,336 relationships (including reverse relationships) connecting the CHEBI and HGNC nodes with predicates "negatively_correlated_with_gene", "inverse_negatively_correlated_with_gene", "positively_correlated_with_gene", "inverse_positively_correlated_with_gene" (SAB: "CMAP").

DisGeNET

DisGeNET contains gene-disease associations (GDA) and gene-variant associations (VDA). The GDA data are organized by Semanticscience Integrated Ontology Codes which represent what kind of variant is infecting the gene. There are 15 different types of variants and each one has its own SAB. Each GDA gets its own node and are connected to an HGNC node and a disease/phenotype node, usually HPO or DOID, through 'refers_to' relationships. The VDA data also get their own nodes and these are connected to a dbSNP node and an HGNC node. There are approximately 1.1 million GDAs and 370k VDAs.

// Cypher query to reproduce the schema figure
match (t:Term)-[:PT_DGN]-(code1:Code)-[:CODE]-(cui1:Concept)-[r0:refers_to]-(cui2:Concept)-[:CODE]-(code2:Code {SAB:'HP'})-[:PT]-(t2:Term)
match (cui1)-[r1:refers_to]-(cui3:Concept)-[:CODE]-(code3:Code {SAB:'HGNC'})-[:PT_MONDO]-(t3:Term)
where code1.SAB starts with 'DGNF'
RETURN * limit 1

HPOMP

This set of assertions maps human phenotype ontology (HPO) nodes to mammalian phenotype ontology (MP) nodes through the 'is_approximately_equivalent_to'. It is essentially a set of assertions mapping human phenotype codes to mouse phenotype codes. The mappings were produced by using a software tool called PheKnowLator. There are 1,785 HPOMP mappings. These assertions can be queried by specifying the SAB property as HPOMP on the 'is_approximately_equivalent_to' relationship.

HGNCHPO

This set of assertions maps HGNC gene nodes to human phenotype ontology (HPO) nodes through the 'associated_with' relationship. There are 671,046 HGNCHPO mappings. These assertions can be queried by specifying the SAB property as HGNCHPO on the 'associated_with' relationship.

HGNCHCOP

This set of assertions maps mouse gene nodes (HCOP) to human gene nodes (HGNC). Mouse gene nodes are referred to as 'HCOP' in the Data Distillery Knowledge Graph because the HGNC Comparison of Orthology Predictions (HCOP) tool was used to generate these mappings. The 'in_1_to_1_orthology_relationship_with' is used to connect the HGNC and HCOP nodes. There are 67,027 HGNCHCOP mappings. These assertions can be queried by specifying the SAB property as HCOPHGNC on the 'in_1_to_1_orthology_relationship_with' relationship.

HCOPMP

This set of assertions maps mouse gene nodes (HCOP) to the mammalian phenotype ontology (MP) nodes through the 'involved_in' relationship. These mappings are the mouse version of the HGNCHPO mappings. Files from the International Mouse Phenotyping Consortium (IMPC) and Mouse Genome Informatics (MGI) were used to create this dataset. There are 234,043 HCOPMP mappings. These assertions can be queried by specifying the SAB property as HCOPMP on the 'involved_in' relationship.

Homo Sapiens Chromosomal Location Ontology (HSCLO)

Homo Sapiens Chromosomal Location Ontology (HSCLO) was primarily created to connect 4DN loop coordinates to the rest of the graph through the mapping between HSCLO and GENCODE. HSCLO was later utilized to connect GTEXEQTL locations in the graph as searchable nodes at 1 kbp resolution (same as 4DN). The dataset relationships as well as nodes use HSCLO as their SAB. HSCLO nodes are defined at 5 resolution levels; chromosomes, 1 Mbp, 100 kbp, 10 kbp and 1kbp with each level connects to lower level with above_(resolution level)_band (e.g. "above_1Mbp_band", "above 1_kbp_band") and nodes at the same resolution level are connected through prcedes_(resolution level)_band (e.g. "precedes_10kbp_band"). The dataset contains 3,431,155 nodes and 6,862,195 relationships (13,724,390 bidirectional).

MSIGDB

Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets ). With this subset, MSIGDB Concept nodes were created for MSigDB systematic names (used as Codes excluding KEGG data). The relationships between these Concept nodes and HGNC nodes were defined using the mentioned 5 subsets where the subset information was included in the relationship SABs as "MSIGDB".

RATHCOP

This set of assertions maps human ENSEMBL gene nodes to rat ENSEMBL gene nodes. These mappings were generated from the HCOP tool just like for the mouse to human assertions, except we used the ENSEMBL codes here instead of the HGNC codes. The 'has_human_ortholog' relationship is used to connect ENSEMBL Rat nodes to ENSEMBL Human nodes. There are 42,371 RATHCOP mappings and they can be queried by specifying the SAB property as RATHCOP on the 'has_human_ortholog' relationship.

Reactome

// Cypher query to reproduce the schema figure
match (code1:Code {SAB:'REACTOME'})-[:CODE]-(cui1:Concept)-[{SAB:'REACTOME'}]-(cui3:Concept)-[:CODE]-(code3:Code {SAB:'GO'})
RETURN * limit 1

Reactome is a database of reactions, organized into their respective pathways. Interactions between entities such as nucleic acids, proteins and small molecules make up these assertions. Reactome reactions have the SAB of REACTOME and have either a 'has_input' relationship or a 'has_GO_term' relationship with either a CHEBI, GO or UNIPROTKB Code.

WikiPathways

// Cypher query to reproduce the schema figure
match (t0:Term)-[:PT]-(code1:Code {SAB:'HGNC'})-[:CODE]-(cui1:Concept)-[{SAB:'WP'}]-(cui3:Concept)-[:CODE]-(code3:Code {SAB:'HGNC'})-[:PT]-(t:Term)
RETURN * limit 1

WikiPathways contains assertions defining interactions between genes within biological pathways. Genes are connected through one of seven different relationship types, in order of most frequent to least frequent: DirectedInteraction, Inhibition,Stimulation, Binding, TranscriptionTranslation, Conversion and Catalysis. There are also WikiPathway Concepts which represent pathways. Each pathway Concept is connected to the genes that have interactions in that pathway.

UNIPROTKB

UNIPROTKB dataset


Dataset SAB(s)	UNIPROTKB, HGNC
Authority	J. Alan Simmons, Department of Biomedical Informatics (DBMI), University of Pittsburgh
Source Information	Export of information from UniProtKB'S REST API.
Purpose	Describes protein products of human genes
Description	Describes a select set of proteins from UniProtKB in terms of their relationship as gene products of genes from HGNC.
Summarization of Methdology	The UBKG generation framework script executes a query against the UniProtKB REST API that returns information on the proteins that are associated with Homo sapiens. The script then maps each protein with a gene, using HGNC identifiers.
Summarization of Methodology Code Repository URL	https://github.com/x-atlas-consortia/ubkg-etl/blob/main/generation_framework/uniprotkb/README.md
Total Nodes	40,339
Total Edges	20,212

UNIPROTKB Schema Diagram

UNIPROTKB Node Counts

SAB	Count
UNIPROTKB	20,208
HGNC	20,131

UNIPROTKB Edge Counts

Subject SAB	Predicate	Object SAB	Count
UNIPROTKB	gene_product_of	HGNC	20,212
HGNC	has_gene_product	UNIPROTKB	20,212

GENCODE

GENCODE dataset


Dataset SAB(s)	ENSEMBL,UNIPROTKB,GENCODE_VS,PGO,REFSEQ
Authority	J. Alan Simmons, Department of Biomedical Informatics (DBMI), University of Pittsburgh
Source Information	GENCODE FTP site for Human release 41 (GRCh38.p13)
Description	Translated gene annotation data from GENCODE
Summarization of Methdology	The UBKG generation script: downloads annotation and metadata GTF files from the GenCode FTP site; encodes information using valuesets extracted from both the annotation files and the GenCode web site; and converts encoded annotation data into UBKG Edges/Nodes format.
Summarization of Methodology Code Repository URL	https://github.com/x-atlas-consortia/ubkg-etl/blob/main/generation_framework/gencode/README.md
Total Nodes
Total Edges

GENCODE Schema Diagram

The GENCODE schema corresponds to assertions that can be derived from the annotation file. The types of relationships depend on whether the annotation is for a gene or a transcript.

Gene annotation:

Transcript annotation:

GENCODE Node Counts

The UBKG features concept-code synonymy: e.g., if a gene has both ENSEMBL and HGNC IDs, the codes for those IDs share the same concept. Because of synonymy, concepts are counted, not codes.

GENCODE is associated with 453,126 concept nodes.

GENCODE Edge Counts

All edges in GENCODE have SAB='GENCODE'.

Many of the GENCODE assertions have categorical objects with values that are taken from the GENCODE_VS ontology.

edge	count	categorical?
isa	15,790	no
is_gene_biotype	241,650	yes
is_transcript_biotype	175,576	yes
is_feature_type	312,946	yes
is_directional_form_of	312,867	yes
has_reseq_id	140,164	no

GENCODE_VS

GENCODE_VS dataset


Dataset SAB(s)	GENCODE_VS
Authority	J. Alan Simmons, Department of Biomedical Informatics (DBMI), University of Pittsburgh
Source Information	GENCODE site
Description	Valuesets of categorical gene annotation information
Summarization of Methdology	Categorical annotation information from the GenCode Data Format and Biotypes pages were used to build a SimpleKnowledge spreadsheet of valueset information. The UBKG generation script refers GENCODE_VS when building GENCODE.
Summarization of Methodology Code Repository URL	https://github.com/x-atlas-consortia/ubkg-etl/blob/main/generation_framework/gencode/README.md

Source spreadsheet

The GENCODE_VS SimpleKnowledge spreadsheet can be found here.

REFSEQ

REFSEQ dataset


Dataset SAB(s)	REFSEQ
Authority	J. Alan Simmons, Department of Biomedical Informatics (DBMI), University of Pittsburgh
Source Information	Export of information from NCBI EUtils
Purpose	RefSeq definitions for human genes
Description	Provides RefSeq definitions of genes based on Gene ID.
Summarization of Methdology	The UBKG generation framework script executes a query against the NCBI Eutils REST API that returns information on genes based on Gene id. The script assumes that the GENCODE ingestion has occurred, which cross-references HGNC and Gene IDs.
Summarization of Methodology Code Repository URL	https://github.com/x-atlas-consortia/ubkg-etl/tree/main/generation_framework/refseq
Total Nodes	16,970
Total Edges	20,781

REFSEQ schema

AZ

AZ dataset


Dataset SAB(s)	AZ, CL
Authority	J. Alan Simmons, Department of Biomedical Informatics (DBMI), University of Pittsburgh
Source Information	SimpleKnowledge spreadheet that maps Azimuth codes with Cell Ontology codes.
Purpose	Cross-walk between AZ and CL
Summarization of Methdology	A custom SimpleKnowledge spreadsheet maps AZ codes to CL codes.
Summarization of Methodology Code Repository URL	https://docs.google.com/spreadsheets/d/1p1gE2F5S5Q5dIUWW5tp4SKgc3K6rJejeTU61eKrb4z0/edit?usp=sharing
Total Nodes	166
Total Edges	648

Files

DataDistilleryDataDictionary.md

Latest commit

History