Skip to content

Introduction

Paula Gearon edited this page Jun 3, 2021 · 29 revisions

Asami is a graph database, which may be a new approach to data management for some users. This page tries to explain what graph data is and how to use Asami to interact with it.

Contents

Graph Data

When we refer to a "Graph Database" we are not referring to the sort of graph that we often see in magazines, or are generated from spreadsheets. Instead, the "graph" being referenced here is a branch of mathematics called "Graph Theory". These graphs describe a set of items called "Nodes", which are connected with lines that are called "Edges".

An example of this might be a graph of social connections between people. The people in such a network can be represented by the "nodes" of a graph, and the "edges" can represent the social connections between them.

A simple social graph

Each node appears once in a graph, regardless of how many connections it may have, and any pair of nodes will only have a single edge between them. Another important thing is that an image of a graph is merely a representation of the data. The location of the nodes and the edges do not matter. Also, edges do not interact in any way, so even if the lines indicating edges cross over each other, this does not mean anything. For instance, the following 2 graphs represent identical data to the above graph:

The same social graph in 2 different presentations

While many subtleties emerge from this structure, these simple ideas describe the entire foundation of this branch of mathematics!

However, most graph databases (including Asami) add two important features to the above definition: directed edges, and labeling.

Directed Edges

A variation on the above graphs is when the edges are given a direction. We can represent this in an image using arrows instead of simple lines. Perhaps this might be used to represent a company hierarchy.

A Directed Graph

Importantly, having direction on edges means that there can now exist 2 edges between a pair of nodes: one in either direction.

Labels

A labeled graph will have labels on both the nodes and the edges. This provides a lot of new possibilities:

  • Nodes need not represent the same types of things. They can represent anything: people, organizations, inventory items, account entries – anything at all.
  • Multiple edges can connect the same nodes. For instance, a woman who is a mother to someone is also that person's parent. Both of these connections can be represented as distinct labeled edges.

Labeling some of the previous graph, we could get something like this:

An Org Chart as a Labeled, Directed Graph

Labeled, directed graphs form the basic structure of graph databases.

Entities

Different graph databases make different choices about how they store their graph information. A common approach is for each node in the graph to be an object that has multiple attributes on it. We see this sort of thing in the Neo4J Graph Database.

Using this approach, let's add some details to the organization chart we had a moment ago.

An Org Chart with Detailed Entities

This diagram shows how entities can group a lot of information together in the graph. However, Asami and other graph databases actually break this information down into a basic graph structure. The information is identical, but it just gets managed a little differently.

Standard Graph Structure

This is identical information as in the previous diagram, but instead of nodes being associated with complex data structures, every node represents only a single unit of information. Some are strings, while others provide a point to represent a group of information. These other nodes also need labels, so they have each been assigned a letter.

You may have noticed that the edge and node labels all start with a : character. Asami is a Clojure database, and this is the syntax of a Clojure keyword. Other types of labels can be used, but keywords are very common and easy. We will stick to them here for consistency.

Triples

While the graph images above are clear in their structure, they have several limitations. As the amount of information expands, the size of the graph can grow significantly. The details and labels can get lost. For instance, there is no way to provide labels in the following graph without the image becoming enormous.

Complex Graph - from Ryerson University

At the same time, the graph information has to be stored in a computer where it can be accessed in an efficient manner. This is done by storing only the information necessary for the graph.

The representation for basic graph information is a representation of the 3 elements of an edge:

start-nodeedgeend-node

These elements form a single statement, or triple. The 3 parts of a statement are given different names by different databases. For instance, in the Datomic database the elements are called:

entityattributevalue

In Resource Description Framework (RDF) databases, they are called:

subjectpredicateobject

It doesn't matter if you use entity/attribute/value or subject/predicate/object as they mean the same thing.

To see how these statements work, consider the edge:

A Single Edge

This is represented with the statement:

:B    :first-name    "Lori"

A group of statements like this provides the complete description for a graph. Using this we can represent the entire Organization Chart from above with the following statements:

Graph Statements
:A    :first-name    "Sally"
:A    :last-name    "Smith"
:A    :title               "CEO"
:B    :reports-to    :A
:B    :first-name    "Lori"
:B    :last-name    "Luck"
:B    :title               "CTO"
:C    :reports-to    :B
:C    :first-name    "Carly"
:C    :last-name    "Cool"
:C    :title               "Engineering Manager"
:D    :reports-to    :C
:D    :first-name    "Jenny"
:D    :last-name    "James"
:D    :title               "QA Engineer"
:E    :reports-to    :C
:E    :first-name    "Mary"
:E    :last-name    "Mercy"
:E    :title               "Engineer"

If we use this information to build a picture, we can reconstruct the original graph image or something very close. Remember that it doesn't matter how a graph image is laid out, so long as it has all of the correct nodes and edges.

The information here is very dense, but it is ideal for computers. It also follows a simple pattern that we can use when accessing the data.

Patterns

When we want to access data in a graph, we use an approach called a pattern.

A graph pattern looks like a triple, but with one or more of the elements not filled in. These spaces will be given a name called a variable. These variables will be used to access the data that we are looking for. Variables are represented as a name with a ? character at the front.

For instance, the following pattern looks for the first-name of node :E. The variable in the pattern is ?n:

[ :E :first-name ?n ]

The pattern can be applied to the graph to find all the statements that match it, and will fill in the variables. In this case, the pattern matches the statement:

:E :first-name "Mary"

So the variable ?n will be made equal to "Mary".

Multiple Answers

Most of the time a pattern will have multiple statements that it will match. In this case, the match will be a series of values for the variables in the pattern. To see this, let's look for all of the titles in the graph:

[?node :title ?t]

This matches a series of statements:

:A   :title   "CEO"
:B   :title   "CTO"
:C   :title   "Engineering Manager"
:D   :title   "QA Engineer"
:E   :title   "Engineer"

The series leads to a series of values for ?node and ?t:

?node=:A  ?t="CEO"
?node=:B  ?t="CTO"
?node=:C  ?t="Engineering Manager"
?node=:D  ?t="QA Engineer"
?node=:E  ?t="Engineer"

The variables in each of these values are connected, so when ?node is :A then ?t is "CEO", and when ?node is :B then ?t is "CTO". We call each of these variable/value groups a binding. Within a binding, each variable is bound.

So in the first binding the ?node is bound to :A and the ?t is bound to "CEO". In the last binding the ?node is bound to :E and the ?t is bound to "Engineer".

The result of matching a pattern to a graph is a set of bindings. In this case, the set of bindings is 5 long.

In fact, whenever a pattern returns only a single result, like in the previous section, it will also be a set of bindings. But as we saw, that set was only one binding long.

Another Example

Using what we have seen so far, we can try to answer the question, "Is 'James' a first name or a last name?"

  • We know that :first-name and :last-name refer to strings like "James" that will appear at the end of a pattern. So the pattern should end with "James".
  • We know that :first-name and :last-name are edge names, and edge names appear in the middle of a pattern. We don't know which one we're going to get, so we will need a variable to put here. That variable will get the answer.
  • We don't know which nodes will appear at the start of the statements. So we will need a variable at the start. However, we don't care what these values are.

This gives us the following pattern:

[ ?node ?edge "James" ]

When this pattern is applied to the graph, it will match the single statement:

:D :last-name "James"

So the result is this binding:

?node=:D ?edge=:last-name

Clone this wiki locally