Skip to content

Latest commit

 

History

History
805 lines (702 loc) · 37 KB

BUILD.org

File metadata and controls

805 lines (702 loc) · 37 KB

Building AlzKB (from scratch)

Overview

This guide will teach you the complete process of building the Alzheimer’s Knowledge Base (AlzKB). It’s not a concise process, but it is extensible to other applications of knowledge engineering. We use the same process for our other knowledge bases (such as ComptoxAI), so this guide can also be used to teach you how to build your own.

The following diagram gives an overview of the build process:

./img/build-abstract.png

  1. First, you use domain knowledge to create the ontology
  2. Then, you collect the data sources and use them to populate the ontology
  3. Finally, you convert the ontology into a graph database

1.: Creating the AlzKB Ontology

Important note: Most users don’t need to follow these steps, since it is already done! Unless you want to extend AlzKB or make major modifications to its node/edge types, you should skip to the next section. If you DO want to do those things, then keep reading.

AlzKB uses an OWL 2 ontology to act something like a ‘template’ for the nodes and relationships in the final knowledge graph. While the actual nodes and relationships are added automatically according to the ‘rules’ defined in the ontology, the ontology itself is constructed manually, using domain knowledge about AD. We do this using the Protégé ontology editor. If you don’t already have it, download and install Protégé Desktop on your computer.

2.: Obtaining the third-party data sources

The next step is to collect the source data files that will eventually become the nodes, relationships, and properties in AlzKB’s knowledge graph. Since databases are distributed in a variety of formats and modalities, you will have to work with a mix of plain-text “flat” files as well as relational (SQL) databases. All of the SQL databases parsed to build AlzKB are distributed for MySQL (as opposed to some other flavor of SQL).

Flat file data sources

SourceDirectory nameEntity type(s)URLExtra instructions
HetionethetionetMany - see populate-ontology.pyGitHubHetionet
NCBI GenencbigeneGenesHomo_sapiens.gene_info.gzNCBI Gene
DrugbankdrugbankDrugs / drug candidatesDrugBank websiteDrugbank
DisGeNETdisgenetDiseases and disease-gene edgesDisGeNETDisGeNET

Hetionet

Download the hetionet-v1.0-edges.sif.gz (extract it using gunzip) and hetionet-v1.0-nodes.tsv files from the Hetionet Github repository. Both of them are, essentially, TSV files, even though one has the .sif extension.

Hetionet is, itself, a knowledge base, and contains many of the core biological entities used in AlzKB. Accordingly, it contains data derived from many other third-party sources.

NCBI Gene

Download the Homo_sapiens.gene_info.gz file from the NCBI FTP page and extract it (e.g., using gunzip).

Create a CUSTOM subdirectory inside the ncbigene directory. Inside of that subdirectory, place the following two files:

Then, run alzkb_parse_ncbigene.py (no external Python packages should be needed). You’ll notice that it creates two output files that are used while populating the ontology.

Drugbank

In order to download the Academic DrugBank datasets, you need to first create a free DrugBank account and verify your email address. After verifying your email address, they may need some more information regarding your DrugBank account, like the description of how you plan to use DrugBank, a description of your organization, Who is sponsoring this research, and What is the end goal of this research. Account approval can take up to several business days to weeks based on our experience.

After your access has been approved, navigate to the Academic Download page on the Drugbank website (linked above) by selecting the “Download” tab and “Academic Download”. Select the “External Links” tab. In the table titled “External Drug Links”, click the “Download” button on the row labeled “All”. This will download a zip file. Extract the contents of that zip file, and make sure it is named drug_links.csv (some versions use a space instead of an underscore in the filename).

DisGeNET

Although DisGeNET is available under a Creative Commons license, the database requires users to create a free account to download the tab-delimited data files. Therefore, you should create a user account and log in. Then, navigate to the Downloads page on the DisGeNET website. Now, download the two necessary files by clicking on the corresponding links:

  • “UMLS CUI to several disease vocabularies” (under the “UMLS CUI to several disease vocabularies” section heading - the resulting file name will be disease_mappings.tsv.gz)
  • “UMLS CUI to top disease classes” (the resulting file will be named disease_mappings_to_attributes.tar.gz)

Next, download curated_disease_gene_associations.tsv.gz directly by copying the following URL into your web browser: https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz

All three files are gzipped, so extract them into the disgenet/ directory using your favorite method (e.g., gunzip from the command line, 7zip from within Windows, etc.).

Now that you have the three necessary data files, you should run the AlzKB script we wrote to filter for rows in those files corresponding to Alzheimer’s Disease, named alzkb_parse_disgenet.py. This script is in the scripts/ directory of the AlzKB repository, so either find it on your local filesystem if you already have a copy of the repository, or find it on the AlzKB GitHub repository in your web browser.

You can then run the Python script from within the disgenet/ directory, which should deposit two filtered data files in the disgenet/CUSTOM/ subdirectory. These will be automatically detected and used when you run the ontology population script, along with the unmodified curated_disease_gene_associations.tsv file.

Then you create a directory that will hold all of the raw data files. It can be ‘D:\data' or something else you prefer. Within that, there will be 1 folder for each third-party database, and in those folders, you’ll put the individual csv/tsv/txt files.

SQL data sources

If you don’t already have MySQL installed, install it. We recommend using either a package manager (if one is available on your OS), or installing MySQL Community Server from the mysql.com website (e.g., by visiting https://dev.mysql.com/downloads/mysql/). Make sure it’s running and you have the ability to create and modify new databases.

AOP-DB

The Adverse Outcome Pathway Database (AOP-DB) is the only MySQL database you need to install to build the current version of AlzKB. It can be downloaded at: https://gaftp.epa.gov/EPADataCommons/ORD/AOP-DB/

WARNING: This is a big download (7.2G while compressed)! Make sure you have enough disk space before proceeding.

You’ll have to extract two archives - first, unzip the AOP-DB_v2.zip archive, which should contain two *.tar.gz archives and another .zip archive. Now, extract the *.tar.gz archive containing nogi in its name (the smaller of the two). Windows doesn’t natively support extracting .tar.gz archives, so you’ll either have to download another program that does this (e.g., 7-zip) or extract it in a Unix-based environment (Linux, MacOS, Windows Subsystem for Linux, Cygwin, etc.) that has the tar program available on the command line. Once you’ve extracted it, you should have a file named something like aopdb_no-orthoscores.sql.

Now, create an empty database in MySQL, and name it aopdb. Make sure you have full admin privileges on the database. Then, load the (newly extracted) .sql file into the empty database. I always find this easiest from the command line, by running a command such as:

$ mysql -u username -p database_name < aopdb_no-orthoscores.sql

Substitute your username after the -u option and enter your password when prompted. If you prefer to import it from a GUI, you can use a tool like MySQL Workbench or DataGrip.

WARNING: It can take a while to import, so be ready to take a break or do something else while you wait.

2.5: Populating the ontology

Now that we have an ontology (currently ‘unpopulated’, consisting of a class hierarchy, object property types, data property types, and possibly annotations), we can populate it with records from the third-party databases we collected in the previous step. Fortunately, this is a largely automated process, facilitated by a tool we call ista (ista is the Sindarin word for knowledge). With ista, you write a Python script that first tells ista where to find the third-party data sources, and then maps each of those data sources to one or two node or edge types defined in the ontology (as classes or object properties, respectively). Here, we’ll walk through the different parts of AlzKB’s ista build script and discuss what each component does. If you are reading this guide to modify or extend AlzKB, you should be able to use the information in the following few sections to write your own build script.

For reference, an up-to-date, complete copy of this build file can be found in the AlzKB source repository at the location alzkb/populate_ontology.py.

Installing ista

  • Keep MySQL Server running
  • Install mysqlclient via Anaconda-Navigator
  • Clone the ista repository onto your computer (git clone https://github.com/RomanoLab/ista)
  • cd ista
  • pip install .

Build file top-matter

At the top of the file, we do some imports of necessary Python packages. First comes ista. We don’t import the whole package, just the classes and function that we actually interact with.

from ista import FlatFileDatabaseParser, MySQLDatabaseParser
from ista.util import print_onto_stats

In order to interact with OWL 2 ontology files, we bring in the owlready2 library.

import owlready2

We put private data for our local MySQL databases (hostname, username, and password) in a file named secrets.py, and then make sure the file is added to our .gitignore file so it isn’t checked into version control. You’ll have to create that file yourself, and define the variables MYSQL_HOSTNAME, MYSQL_USERNAME, and MYSQL_PASSWORD. Then, in the build script, you’ll import the file containing those variables and wrap them into a configuration dict.

import secrets

mysql_config = {
    'host': secrets.MYSQL_HOSTNAME,
    'user': secrets.MYSQL_USERNAME,
    'passwd': secrets.MYSQL_PASSWORD
}

Telling ista where to find your data sources

Since we are populating an ontology, we need to load the ontology into owlready2. Make sure to modify this path to fit the location of the AlzKB ontology file on your system! Future versions of AlzKB will source the path dynamically. Also note the file:// prefix, which tells owlready2 to look on the local file system rather than load a web URL. Since this guide was made on a Windows desktop, you’ll notice that we have to use escaped backslashes to specify file paths that the Python interpreter will parse correctly.

onto = owlready2.get_ontology("file://D:\\projects\\ista\\tests\\projects\\alzkb\\alzkb.rdf").load()

We also set the ‘base’ directory for all of the flat files that ista will be loading. You will have determined this location already (see Obtaining the third-party data sources).

data_dir = "D:\\data\\"

Now, we can actually register the source databases with ista’s parser classes. We use FlatFileDatabaseParser for data sources stored as one or more delimited flat files, and MySQLDatabaseParser for data sources in a MySQL database. For flat file-based sources, the first argument given to the parser’s constructor MUST be the subdirectory (within data_dir) where that source’s data files are contained, and for MySQL sources it MUST be the name of the MySQL database. If not, ista won’t know where to find the files. The second argument is always the ontology object loaded using owlready2, and the third is either the base data directory or the MySQL config dictionary, both of which were defined above.

epa = FlatFileDatabaseParser("epa", onto, data_dir)
ncbigene = FlatFileDatabaseParser("ncbigene", onto, data_dir)
drugbank = FlatFileDatabaseParser("drugbank", onto, data_dir)
hetionet = FlatFileDatabaseParser("hetionet", onto, data_dir)
aopdb = MySQLDatabaseParser("aopdb", onto, mysql_config)
aopwiki = FlatFileDatabaseParser("aopwiki", onto, data_dir)
tox21 = FlatFileDatabaseParser("tox21", onto, data_dir)
disgenet = FlatFileDatabaseParser("disgenet", onto, data_dir)

In the following two sections, we’ll go over a few examples of how to define mappings using these parser objects. We won’t replicate every mapping in this guide for brevity, but you can see all of them in the full AlzKB build script.

Configuration for ‘flat file’ (e.g., CSV) data sources

hetionet.parse_node_type(
    node_type="Symptom",
    source_filename="hetionet-v1.0-nodes.tsv",
    fmt="tsv",
    parse_config={
        "iri_column_name": "name",
        "headers": True,
        "filter_column": "kind",
        "filter_value": "Symptom",
        "data_transforms": {
            "id": lambda x: x.split("::")[-1]
        },
        "data_property_map": {
            "id": onto.xrefMeSH,
            "name": onto.commonName
        }
    },
    merge=False,
    skip=False
)

This block indicates the third-party database is hetionet, and the file is hetionet-v1.0-nodes.tsv

So the file it will look for is D:\data\hetionet\hetionet-v1.0-nodes.tsv

Some of the configuration blocks will have a CUSTOM\ prefix to the filename. This means that the file was created by us manually and will need to be stored in a CUSTOM subdirectory of the database folder. For example:

disgenet.parse_node_type(
    node_type="Disease",
    source_filename="CUSTOM/disease_mappings_to_attributes_alzheimer.tsv",  # Filtered for just Alzheimer disease
    fmt="tsv-pandas",
    parse_config={
        "iri_column_name": "diseaseId",
        "headers": True,
        "data_property_map": {
            "diseaseId": onto.xrefUmlsCUI,
            "name": onto.commonName,
        }
    },
    merge=False,
    skip=False
)

This file will be D:\data\disgenet\CUSTOM\disease_mappings_alzheimer.tsv

Configuration for SQL server data sources

aopdb.parse_node_type(
    node_type="Drug",
    source_table="chemical_info",
    parse_config={
        "iri_column_name": "DTX_id",
        "data_property_map": {"ChemicalID": onto.xrefMeSH},
        "merge_column": {
            "source_column_name": "DTX_id",
            "data_property": onto.xrefDTXSID
        }
    },
    merge=True,
    skip=False
)

This block indicates the third-party database is AOP-DB, and the source table is chemical_info.

Mapping data sources to ontology components

Every flat file or SQL table from a third-party data source can be mapped a single node or relationship type. For example, a file describing diseases can be mapped to the Disease node type, where each line in the file corresponds to a disease to be inserted (or ‘merged’—see below) into the knowledge graph. If the source is being mapped to a node type (rather than a relationship type), ista additionally can populate one or more node properties from the feature columns in the source file.

Each mapping is defined using a method call in the ista Python script.

Running ista

Now you have set the location of data resources, ontology, and defined mapping method. Run populate_ontology.py

The alzkb-populated.rdf is the output of this step and will be used for setting Neo4j Graph database.

3.: Converting the ontology into a Neo4j graph database

Installing Neo4j

If you haven’t done so already, download Neo4j from the Neo4j Download Center. Most users should select Neo4j Desktop, but advanced users can instead opt for Community Server (the instructions for which are well outside of the scope of this guide).

Configuring an empty graph database for AlzKB

You should now create a new graph database that will be populated with the contents of AlzKB. In Neo4j Community, this can be done as follows:

  • Create a new project by clicking the “New” button in the upper left, then selecting “Create project”.
  • In the project panel (on the right of the screen), you will see the default name “Project” populates automatically. Hover over this name and click the edit icon, then change the name to AlzKB.
  • To the right of the project name, click “Add”, and select “Local DBMS”. Change the Name to AlzKB DBMS, specify a password that you will remember, and use the Version dropdown to select “4.4.0” (if it is not already selected). Click “Create”. Wait for the operation to finish.
  • Install plugins:
    • Click the name of the DBMS (“AlzKB DBMS”, if you have followed the guide), and in the new panel to the right click the “Plugins” tab.
    • Expand the “APOC” option, click “Install”, and wait for the operation to complete.
    • Do the same for the “Graph Data Science Library” and “Neosemantics (n10s)” plugins.
  • Before starting the DBMS, click the ellipsis immediately to the right of the “Open” button, and then click “Settings…”. Make the following changes to the configuration file:
    • Set dbms.memory.heap.initial_size to 2048m.
    • Set dbms.memory.heap.max_size to 4G.
    • Set dbms.memory.pagecache.size to 2048m.
    • Uncomment the line containing dbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.* to activate it.
    • Add n10s.*,apoc.cypher.*,apoc.help to dbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.*
    • Click the “Apply” button, then “Close”.
  • Click “Start” to start the graph database.

Importing the ista RDF output into Neo4j

  • Open neo4j Browser and run the following Cypher to import RDF data
# Cleaning nodes
MATCH (n) DETACH DELETE n
# Constraint Creation
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE
# Creating a Graph Configuration
CALL n10s.graphconfig.init()
CALL n10s.graphconfig.set({applyNeo4jNaming: true, handleVocabUris: 'IGNORE'})
# Importing RDF
CALL n10s.rdf.import.fetch( "file://D:\\data\\alzkb-populated.rdf", "RDF/XML")
  • Run the Cyphers below to clean nodes
MATCH (n:Resource) REMOVE n:Resource;
MATCH (n:NamedIndividual) REMOVE n:NamedIndividual;
MATCH (n:AllDisjointClasses) REMOVE n:AllDisjointClasses;
MATCH (n:AllDisjointProperties) REMOVE n:AllDisjointProperties;
MATCH (n:DatatypeProperty) REMOVE n:DatatypeProperty;
MATCH (n:FunctionalProperty) REMOVE n:FunctionalProperty;
MATCH (n:ObjectProperty) REMOVE n:ObjectProperty;
MATCH (n:AnnotationProperty) REMOVE n:AnnotationProperty;
MATCH (n:SymmetricProperty) REMOVE n:SymmetricProperty;
MATCH (n:_GraphConfig) REMOVE n:_GraphConfig;
MATCH (n:Ontology) REMOVE n:Ontology;
MATCH (n:Restriction) REMOVE n:Restriction;
MATCH (n:Class) REMOVE n:Class;
MATCH (n) WHERE size(labels(n)) = 0 DETACH DELETE n; # Removes nodes without labels

Now, you have built the AlzKB from scratch. You can find the number of nodes and relationships with

CALL db.labels() YIELD label
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as count',{}) YIELD value
RETURN label, value.count ORDER BY label
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value
RETURN type, value.count ORDER BY type

4.: Adding new data resources, nodes, relationships, and properties.

In version 2.0, we added “TranscriptionFactor” nodes, “TRANSCRIPTIONFACTORINTERACTSWITHGENE” relationships, node properties of “chromosome” number and “sourcedatabase”, relationships properties of “correlation”, “score”, “p_fisher”, “z_score”, “affinity_nm”, “confidence”, “sourcedatabase”, and “unbiased”.

To achieve this, we added the above entities to the ontology RDF and now named alzkb_v2.rdf in the alzkb\data directory. Then collect additional source data files as detailed in the table below.

SourceDirectory nameEntity type(s)URLExtra instructions
TRRUSTdorotheaTranscription factors(TF) and TF-gene edgesTRRUST DownloadTRRUST
DoRothEAdorotheaTranscription factors(TF) and TF-gene edgesDoRothEA InstallationDoRothEA RScript

Prepare Source Data

Download trrust_rawdata.human.tsv from TRRUST Download. Install DoRothEA by following the DoRothEA Installation within R. Place the trrust_rawdata.human.tsv and alzkb_parse_dorothea.py inside of Dorothea/ subdirectory, which should be within your raw data directory (e.g., D:\data). Run alzkb_parse_dorothea.py. You’ll notice that it creates a tf.tsv file that is used while populating the ontology.

Replicate Hetionet Resources

Since Hetionet does not have an up-to-date update plan, we have replicated them using the rephetio paper and source code to ensure AlzKB has current data. Follow the steps in AlzKB-updates Github repository to create hetionet-custom-nodes.tsv and hetionet-custom-edges.tsv. Place these files in the hetionet/ subdirectory.

Process Data Files

Place the updated alzkb_parse_ncbigene.py, alzkb_parse_drugbank.py, and alzkb_parse_disgenet.py from the scripts/ directory in their respective raw data file subdirectory. Run each script to process the data for the next step.

Populate Ontology

Now that we have the updated ontology and updated data files, run the updated alzkb/populate_ontology.py to populate records. It creates a alzkb_v2-populated.rdf file that will be used in next step.

5.: Converting the ontology into a Memgraph graph database

Installing Memgraph

If you haven’t done so already, download Memgraph from the Install Memgraph page. Most users install Memgraph using a pre-prepared docker-compose.yml file by executing:

  • for Linux and macOS: curl https://install.memgraph.com | sh
  • for Windows: iwr https://windows.memgraph.com | iex

More details are in Install Memgraph with Docker

Generating the CSV File

Before uploading the file to Memgrpah, run alzkb/rdf_to_memgraph_csv.py with the alzkb_v2-populated.rdf file to generate alzkb-populated.csv. Then run populate_edge_weights.py to create alzkb_with_edge_properties.csv file if you want to add edge properies to the knowledge graph.

Starting Memgraph with Docker

Follow the instructions in importing-data-into-memgraph Step 1. Starting Memgraph with Docker to upload the alzkb-populated.csv or alzkb_with_edge_properties.csv file to the container.

Open Memgraph Lab. Memgraph Lab is available at http://localhost:3000. Click the Query Execution in MENU on the left bar. Then, you can type a Cypher query in the Cypher Editor.

Gaining speed with indexes and analytical storage mode

  • To create indexes, run the following Cypher queries:
CREATE INDEX ON :Drug(nodeID);
CREATE INDEX ON :Gene(nodeID);
CREATE INDEX ON :BiologicalProcess(nodeID);
CREATE INDEX ON :Pathway(nodeID);
CREATE INDEX ON :MolecularFunction(nodeID);
CREATE INDEX ON :CellularComponent(nodeID);
CREATE INDEX ON :Symptom(nodeID);
CREATE INDEX ON :BodyPart(nodeID);
CREATE INDEX ON :DrugClass(nodeID);
CREATE INDEX ON :Disease(nodeID);
CREATE INDEX ON :TranscriptionFactor (nodeID);
  • To check the current storage mode, run:
SHOW STORAGE INFO;
  • Change the storage mode to analytical before import:
STORAGE MODE IN_MEMORY_ANALYTICAL;

Importing data into Memgraph

  • Drug nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Drug' AND row.commonName <> ''
CREATE (d:Drug {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase,
                xrefCasRN: row.xrefCasRN, xrefDrugbank: row.xrefDrugbank});

MATCH (d:Drug)
RETURN count(d);
  • Gene nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Gene'
CREATE (g:Gene {nodeID: row._id, commonName: row.commonName, geneSymbol: row.geneSymbol, sourceDatabase: row.sourceDatabase,
                typeOfGene: row.typeOfGene, chromosome: row.chromosome, xrefEnsembl: row.xrefEnsembl, 
                xrefHGNC: row.xrefHGNC, xrefNcbiGene: toInteger(row.xrefNcbiGene), xrefOMIM: row.xrefOMIM});

MATCH (g:Gene)
RETURN count(g);
  • BiologicalProcess nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':BiologicalProcess'
CREATE (b:BiologicalProcess {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase,
                             xrefGeneOntology: row.xrefGeneOntology});

MATCH (b:BiologicalProcess)
RETURN count(b)
  • Pathway nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Pathway'
CREATE (p:Pathway {nodeID: row._id, pathwayId: row.pathwayId, pathwayName: row.pathwayName, sourceDatabase: row.sourceDatabase});

MATCH (p:Pathway)
RETURN count(p)
  • MolecularFunction nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':MolecularFunction'
CREATE (m:MolecularFunction {nodeID: row._id, commonName: row.commonName, xrefGeneOntology: row.xrefGeneOntology});

MATCH (m:MolecularFunction)
RETURN count(m)
  • CellularComponent nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':CellularComponent'
CREATE (c:CellularComponent {nodeID: row._id, commonName: row.commonName, xrefGeneOntology: row.xrefGeneOntology});

MATCH (c:CellularComponent)
RETURN count(c)
  • Symptom nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Symptom'
CREATE (s:Symptom {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefMeSH: row.xrefMeSH});

MATCH (s:Symptom)
RETURN count(s)
  • BodyPart nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':BodyPart'
CREATE (b:BodyPart {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefUberon: row.xrefUberon});

MATCH (b:BodyPart)
RETURN count(b)
  • DrugClass nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':DrugClass'
CREATE (d:DrugClass {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, xrefNciThesaurus: row.xrefNciThesaurus});

MATCH (d:DrugClass)
RETURN count(d)
  • Disease nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':Disease'
CREATE (d:Disease {nodeID: row._id, commonName: row.commonName, sourceDatabase: row.sourceDatabase, 
                   xrefDiseaseOntology: row.xrefDiseaseOntology, xrefUmlsCUI: row.xrefUmlsCUI});
                   
MATCH (d:Disease)
RETURN count(d)
  • Transcription Factor nodes
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._labels = ':TranscriptionFactor'
CREATE (t:TranscriptionFactor {nodeID: row._id, sourceDatabase: row.sourceDatabase, TF: row.TF});
MATCH (t:TranscriptionFactor)
RETURN count(t)
  • GENEPARTICIPATESINBIOLOGICALPROCESS relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEPARTICIPATESINBIOLOGICALPROCESS'
MATCH (g:Gene {nodeID: row._start}) MATCH (b:BiologicalProcess {nodeID: row._end}) 
MERGE (g)-[rel:GENEPARTICIPATESINBIOLOGICALPROCESS]->(b) 
RETURN count(rel)
  • GENEREGULATESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEREGULATESGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end}) 
MERGE (g)-[rel:GENEREGULATESGENE]->(g2) 
RETURN count(rel)
  • GENEINPATHWAY relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEINPATHWAY'
MATCH (g:Gene {nodeID: row._start}) MATCH (p:Pathway {nodeID: row._end}) 
MERGE (g)-[rel:GENEINPATHWAY]->(p) 
RETURN count(rel)
  • GENEINTERACTSWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEINTERACTSWITHGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end}) 
MERGE (g)-[rel:GENEINTERACTSWITHGENE]->(g2) 
RETURN count(rel)
  • BODYPARTUNDEREXPRESSESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'BODYPARTUNDEREXPRESSESGENE'
MATCH (b:BodyPart {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (b)-[rel:BODYPARTUNDEREXPRESSESGENE]->(g) 
RETURN count(rel)
  • BODYPARTOVEREXPRESSESGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'BODYPARTOVEREXPRESSESGENE'
MATCH (b:BodyPart {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (b)-[rel:BODYPARTOVEREXPRESSESGENE]->(g) 
RETURN count(rel)
  • GENEHASMOLECULARFUNCTION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEHASMOLECULARFUNCTION'
MATCH (g:Gene {nodeID: row._start}) MATCH (m:MolecularFunction {nodeID: row._end}) 
MERGE (g)-[rel:GENEHASMOLECULARFUNCTION]->(m) 
RETURN count(rel)
  • GENEASSOCIATEDWITHCELLULARCOMPONENT relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEASSOCIATEDWITHCELLULARCOMPONENT'
MATCH (g:Gene {nodeID: row._start}) MATCH (c:CellularComponent {nodeID: row._end}) 
MERGE (g)-[rel:GENEASSOCIATEDWITHCELLULARCOMPONENT]->(c) 
RETURN count(rel)
  • GENECOVARIESWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENECOVARIESWITHGENE'
MATCH (g:Gene {nodeID: row._start}) MATCH (g2:Gene {nodeID: row._end}) 
MERGE (g)-[rel:GENECOVARIESWITHGENE {sourceDB: row.sourceDB, unbiased: row.unbiased, correlation: row.correlation}]->(g2) 
RETURN count(rel)
  • CHEMICALDECREASESEXPRESSION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALDECREASESEXPRESSION'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (d)-[rel:CHEMICALDECREASESEXPRESSION {sourceDB: row.sourceDB, unbiased: row.unbiased, z_score: row.z_score}]->(g) 
RETURN count(rel)
  • CHEMICALINCREASESEXPRESSION relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALINCREASESEXPRESSION'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (d)-[rel:CHEMICALINCREASESEXPRESSION {sourceDB: row.sourceDB, unbiased: row.unbiased, z_score: row.z_score}]->(g) 
RETURN count(rel)
  • CHEMICALBINDSGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'CHEMICALBINDSGENE'
MATCH (d:Drug {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (d)-[rel:CHEMICALBINDSGENE {sourceDB: row.sourceDB, unbiased: row.unbiased, affinity_nM: row.affinity_nM}]->(g) 
RETURN count(rel)
  • DRUGINCLASS relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGINCLASS'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:DrugClass {nodeID: row._end}) 
MERGE (d)-[rel:DRUGINCLASS]->(d2) 
RETURN count(rel)
  • GENEASSOCIATESWITHDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'GENEASSOCIATESWITHDISEASE'
MATCH (g:Gene {nodeID: row._start}) MATCH (d:Disease {nodeID: row._end}) 
MERGE (g)-[rel:GENEASSOCIATESWITHDISEASE {sourceDB: row.sourceDB, score: row.score}]->(d) 
RETURN count(rel)
  • SYMPTOMMANIFESTATIONOFDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'SYMPTOMMANIFESTATIONOFDISEASE'
MATCH (s:Symptom {nodeID: row._start}) MATCH (d:Disease {nodeID: row._end}) 
MERGE (s)-[rel:SYMPTOMMANIFESTATIONOFDISEASE {sourceDB: row.sourceDB, unbiased: row.unbiased, p_fisher: row.p_fisher}]->(d) 
RETURN count(rel)
  • DISEASELOCALIZESTOANATOMY relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DISEASELOCALIZESTOANATOMY'
MATCH (d:Disease {nodeID: row._start}) MATCH (b:BodyPart {nodeID: row._end}) 
MERGE (d)-[rel:DISEASELOCALIZESTOANATOMY {sourceDB: row.sourceDB, unbiased: row.unbiased, p_fisher: row.p_fisher}]->(b) 
RETURN count(rel)
  • DRUGTREATSDISEASE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGTREATSDISEASE'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:Disease {nodeID: row._end}) 
MERGE (d)-[rel:DRUGTREATSDISEASE]->(d2) 
RETURN count(rel)
  • DRUGCAUSESEFFECT relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'DRUGCAUSESEFFECT'
MATCH (d:Drug {nodeID: row._start}) MATCH (d2:Disease {nodeID: row._end}) 
MERGE (d)-[rel:DRUGCAUSESEFFECT]->(d2) 
RETURN count(rel)
  • TRANSCRIPTIONFACTORINTERACTSWITHGENE relationships
LOAD CSV FROM "/usr/lib/memgraph/alzkb-populated.csv" WITH HEADER AS row
WITH row WHERE row._type = 'TRANSCRIPTIONFACTORINTERACTSWITHGENE'
MATCH (t:TranscriptionFactor {nodeID: row._start}) MATCH (g:Gene {nodeID: row._end}) 
MERGE (t)-[rel:TRANSCRIPTIONFACTORINTERACTSWITHGENE {sourceDB: row.sourceDB, confidence: row.confidence}]->(g) 
RETURN count(rel)

Switching Back to Transactional Storage Mode

After importing the data, follow these steps to switch back to the transactional storage mode:

  • Switch to Transactional Storage Mode:
STORAGE MODE IN_MEMORY_TRANSACTIONAL;
  • Verify the Storage Mode Switch:
SHOW STORAGE INFO;