Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview on DAS architecture and components #186

Merged
merged 58 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
0a9b9d7
WIP - DAS overview doc
Mar 11, 2024
5517f5c
WIP - DAS overview doc
Mar 11, 2024
58f8316
WIP - DAS overview doc
Mar 11, 2024
711224c
WIP - DAS overview doc
Mar 11, 2024
fb7b3e8
WIP - DAS overview doc
Mar 11, 2024
86beb64
WIP - DAS overview doc
Mar 11, 2024
c9c26c5
WIP - DAS overview doc
Mar 11, 2024
615ef5a
WIP - DAS overview doc
Mar 11, 2024
cf86dc6
WIP - DAS overview doc
Mar 11, 2024
e85015d
WIP - DAS overview doc
Mar 11, 2024
8e04370
WIP - DAS overview doc
Mar 11, 2024
a4c877f
WIP - DAS overview doc
Mar 11, 2024
61df68e
WIP - DAS overview doc
Mar 11, 2024
06e9f56
WIP - DAS overview doc
Mar 11, 2024
e84ae1b
WIP - DAS overview doc
Mar 11, 2024
4697bd1
WIP - DAS overview doc
Mar 11, 2024
83721a4
WIP - DAS overview doc
Mar 11, 2024
f5dd6db
WIP - DAS overview doc
Mar 11, 2024
e5f6ced
WIP - DAS overview doc
Mar 11, 2024
e7526ff
WIP - DAS overview doc
Mar 11, 2024
29b8a49
WIP - DAS overview doc
Mar 12, 2024
3b6c91b
WIP - DAS overview doc
Mar 12, 2024
18c8eaf
WIP - DAS overview doc
Mar 12, 2024
2bec5a7
WIP - DAS overview doc
Mar 12, 2024
385326d
WIP - DAS overview doc
Mar 12, 2024
42ae62e
WIP - DAS overview doc
Mar 12, 2024
abb3300
WIP - DAS overview doc
Mar 12, 2024
e29ea76
WIP - DAS overview doc
Mar 13, 2024
6aca87b
WIP - DAS overview doc
Mar 13, 2024
ae1decf
WIP - DAS overview doc
Mar 13, 2024
af08c73
WIP - DAS overview doc
Mar 13, 2024
8613175
WIP - DAS overview doc
Mar 13, 2024
9c33248
WIP - DAS overview doc
Mar 13, 2024
a648d45
WIP - DAS overview doc
Mar 13, 2024
5fe8179
WIP - DAS overview doc
Mar 13, 2024
1e3484f
WIP - DAS overview doc
Mar 13, 2024
fa4de76
WIP - DAS overview doc
Mar 13, 2024
0875395
WIP - DAS overview doc
Mar 13, 2024
f26aa87
WIP - DAS overview doc
Mar 13, 2024
51d390b
WIP - DAS overview doc
Mar 13, 2024
dabc3b6
WIP - DAS overview doc
Mar 13, 2024
b466bce
WIP - DAS overview doc
Mar 13, 2024
a01a106
WIP - DAS overview doc
Mar 14, 2024
d7d6bfe
WIP - DAS overview doc
Mar 14, 2024
5758cfb
WIP - DAS overview doc
Mar 14, 2024
4962850
WIP - DAS overview doc
Mar 14, 2024
c9920c0
WIP - DAS overview doc
Mar 14, 2024
8d38d09
WIP - DAS overview doc
Mar 14, 2024
1ccdeae
WIP - DAS overview doc
Mar 14, 2024
b518e7b
WIP - DAS overview doc
Mar 14, 2024
af7fbfb
WIP - DAS overview doc
Mar 14, 2024
d23167c
Merge remote-tracking branch 'origin' into senna-doc-overview-1
Mar 14, 2024
124d404
Replaced /media -> /assets
Mar 14, 2024
b3ae177
Replaced /media -> /assets
Mar 14, 2024
4f7fdbe
Revision fixes
Mar 14, 2024
13b4aad
Revision fixes
Mar 14, 2024
e5ab35f
Revision fixes
Mar 14, 2024
273fc9c
Revision fixes
Mar 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions docs/DAS-Overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
# Distributed Atomspace - Overview

Atomspace is the hypergraph OpenCog Hyperon uses to represent and store
knowledge, being the source of knowledge for AI agents and the container of any
computational result that might be created or achieved during their execution.

The __Distributed Atomspace (DAS)__ is an extension of OpenCog Hyperon's
Atomspace into a more independent component designed to support multiple
simultaneous connections with different AI algorithms, providing a flexible
query interface to distributed knowledge bases. It can be used as a component
(e.g. a Python library) or as a stand-alone server to store essentially
arbitrarily large knowledge bases and provide means for the agents to traverse
regions of the hypergraphs and perform global queries involving properties,
connectivity, subgraph topology, etc.

DAS can be understood as a persistence layer for knowledge bases used in
OpenCog Hyperon.

<p align="center">
<img src="assets/persistence_layer.png" width="400"/>
</p>


The data manipulation API provides a defined set of operations without exposing
database details such as data modeling and the DBMS (Database Management
System) being used. This is important because it allows us to evolve the data
model inside DAS and even change the DBMS without affecting the integration
with the AI agents.

But being an abstraction for the data model is not the only purpose of DAS.
While performing this connection between AI agents and the knowledge bases, DAS
provides a lot of other functionalities:

* Higher level indexes stored in the DBMS
* Query engine with pattern matching capabilities
* Traverse engine to keep track of hypergraph traversal
* Cache for query results
* Scalable connection manager to connect the DAS with multiple other DASs

This is why DAS is not just a Data Access Object or a database interface layer
but rather a more complex OpenCog Hyperon's component that abstracts not only
data modeling/access itself but also several other algorithms that are closely
related to the way AI agents manipulate information.

## Table of contents

- [DAS Components](#das-components)
- [Higher Level Indexing](#higher-level-indexing)
- [Pattern Matcher](#pattern-matcher)
- [Mapping knowledge bases to nodes and links](#mapping-knowledge-bases-to-nodes-and-links)
- [DAS Server Deployment and Architecture](#das-server-deployment-and-architecture)

## DAS Components

DAS is delivered as a Python library
[hyperon-das](https://pypi.org/project/hyperon-das/) which can be used in two
different ways:

1. To create a DAS server which is supposed to contain a knowledge base and
provide it to many remote clients (somehow like a DBMS).
2. To instantiate a DAS in a Python program which can store a smaller local
knowledge base and can, optionally, connect to one or more remote DAS servers,
exposing their contents to the local program. In this case, the local knowledge
base can store its contents in RAM or can use a DB backend to persist it.

<p align="center">
<img src="assets/components.png" width="900"/>
</p>

Components in the DAS architecture are designed to provide the same
[data manipulation API](https://singnet.github.io/das-query-engine/api/das/)
regardless of whether it's being used locally or remotely or, in the case of
a local DAS, whether DB persistence is being used or not.

Part of this API is delegated to __Traverse Engine__, which interacts with the
Query Engine and the Cache to provide means to the user to traverse the
Atomspace hypergraph. Operations like finding the links pointing from/to a
given atom or finding atoms in the surrounding neighborhood are performed by
this engine, which controls the pre-fetching of the surrounding atoms when a
remote DAS is being used, in such a way that following links can be done
quickly.

The __Query Engine__ is where global queries are processed. These are queries
for specific atoms or sets of atoms that satisfies some criteria, including
pattern matching. When making a query, the user can specify whether only local
atoms should be considered or whether atoms in remote DASs should be searched
as well. If that's the case, the Query Engine connects to the remote OpenFaaS
servers to make the queries in the remote DASs and return a answer which is a
proper combination of local and remote information. For instance, if there're
different versions of the same atom in local and one of the remote DASs, the
local version is returned.

Both engines use the __Cache__ in order to make queries involving a remote DAS
faster. The DAS' cache is not exactly like a traditional cache, where data is
stored basically in the same way in both, the cache and the primary data
repository, and queries are answered by searching the data in the former and
then in the latter. The DAS's cache implements this functionality but it also
sorts and partitions queries' results in such a way that the caller sees the
most relevant results first.

All the queries that return more than one atom, return an iterator to the
results instead of the results themselves. This way only a subset of the
results are returned in a remote query. When the caller iterates through this
iterator, other chunks of results are fetched on demand from the remote DAS
until all the results have been visited. Before splitting the results in
chunks, the resulting atoms are sorted by "relevance", which can be a measure
based in atoms' Short and Long Term Importance (STI and LTI), in a way that the
most relevant results are iterated first. This is important because most AI
agents make several queries and visit the results in a combinatorial fashion so
visiting every single possible combination of results are not practical. Having
results sorted by relevance allow the agents to constraint the search and
eventually avoid fetching too many chunks of results from the remote server.

The __AtomDB__ is somehow like a Data Access Object or a database interface
layer to abstract the calls to the database where atoms are actually stored.
Having this abstraction is important because it allows us to change or to
extend the actual data storage without affecting the query algorithms (such as
pattern matching) implemented in traverse and query engines. AtomDB can be
backended by in-RAM data structures or one or more DBMSs.

## Higher Level Indexing

DAS uses a DBMS to store atoms. By doing so it uses the indexing capabilities
of this DBMS to retrieve atoms faster. But in addition to this, DAS also
creates other custom indexes and stores these indexes in another DBMS. The most
relevant of such indexes is the Pattern Inverted Index.

An inverted index is a data structure which stores a map from contents (words,
sentences, numbers, etc) to where they can be found in a given data container
(database, file system etc).

This type of data structure is largely used in document retrieval systems to
implement efficient search engines. The idea is spending computational time
when documents are inserted in the document base to index and record the words
that appear in each document (and possibly the position they happen inside the
documents). Afterwards this index can be used by the search engine to
efficiently locate documents that contain a given set of keywords.

The entities in the Opencog Hyperon's context are different from the ones in
typical document retrieval systems but their roles and the general idea of the
algorithms are very similar. In OpenCog Hyperon's context, a knowledge base is
a set of toplevel links (which may point to nodes or to other links). When
the knowledge base is loaded, we can create an inverted index of patterns
present in each toplevel link and use such index later to perform pattern
matching.

For instance, given as toplevel link like this one:

```
Inherits
<Concept A>
<Concept B>
```

We could add entries like these ones in the Pattern Inverted Index (where `H1`
is the handle of the toplevel link above):

```
Inherits * <Concept B> ==> H1
Inherits <Concept A> * ==> H1
Inherits * * ==> H1
```

## Pattern Matcher

DAS' query engine can answer pattern matching queries. These are queries where
the caller specifies a _pattern_ i.e. a boolean expression of subgraphs with
nodes, links and wildcards and the engine finds every subgraph in the knowledge
base that satisfies the passed expression.

For instance, suppose we have the following knowledge base in DAS.

<p align="center">
<img src="assets/pmquery_1.png" width="500"/>
</p>

We could search for a pattern like:

```
AND
Similar(V1, V2)
NOT
AND
IS_A(V1, V3)
IS_A(V2, V3)
```

`V1`, `V2` and `V3` are wildcards or variables. In any candidate subgraph
answer, the atom replacing `V1`, for instance, should be the same in all the
links where `V1` appears. In other words, with this pattern we are searching
for two nodes `V1` and `V2` such that there exist a similarity link between
them but there's no pair of inheritance links pointing `V1` and `V2` to the
same node `V3`, no matter the value of `V3`.

In this example, `Chimp` and `Human` are not a suitable answer to replace `V1`
and `V2` because there's a possible value for `V3` that satisfies the `AND`
clause in the pattern, as shown below.

<p align="center">
<img src="assets/pmquery_2.png" width="500"/>
</p>

On the other hand, there are other pair of nodes which could be used to match
`V1` and `V2` whitout matching the `AND` clause, as shown below.

<p align="center">
<img src="assets/pmquery_3.png" width="500"/>
</p>

The answer for the query is all the subgraphs that satisfy the pattern. In our
example, the answer would be as follows.

<p align="center">
<img src="assets/pmquery_4.png" width="500"/>
</p>

## Mapping knowledge bases to nodes and links

Before loading a knowledge base into DAS, you need to define a proper mapping
to Atomspace nodes and links. DAS doesn't make any assumptions regarding nodes
or link types, arity etc. When adding nodes and links using DAS' API, one may
specify atom types freely and the semantic meaning of such atom types are
totally concerned with the application. DAS don't make any kind of processing
based in pre-defined types (actually, there are no internally pre-defined atom
types).

DAS also doesn't provide a way to read a text or SQL or whatever type of file in
order to load a knowledge base. There's no DAS-defined file syntax for this.
If one needs to import a knowledge base, it needs to provide a proper loader
application to parse the input file(s) and make the proper calls to DAS' API in
order to add nodes and links.

Surely one of the interesting topics for future/on-going work on DAS is to
provide loaders (and respective nodes/links mapping) for different types of
knowledge base formats like SQL, Atomese, etc. We already have such a
[loader for MeTTa files](https://github.com/singnet/das-metta-parser).

## DAS Server Deployment and Architecture

DAS server is deployed in a [Lambda Architecture](https://en.wikipedia.org/wiki/Lambda_architecture)
based either in [OpenFaaS](https://www.openfaas.com/) or [AWS Lambda](https://aws.amazon.com/pm/lambda/).
We made a comparative study of these two architectures (results are presented in
this [report](https://docs.google.com/document/d/1kQhM62T3TIb3ECoqBxmqPulMciJq7Vev3-f_Gnhes_s/))
and decided to prioritize OpenFaaS.
Although deployment in AWS Lambda is still possible, currently only OpenFaaS is
supported by [our automated deployment tool](https://github.com/singnet/das-toolbox).
This architecture is presented in the diagram below.


<p align="center">
<img src="assets/architecture.jpg" width="800"/>
</p>

When deploying in AWS Lambda, [Redis](https://redis.io/) and
[MongoDB](https://www.mongodb.com/) can be replaced by AWS'
[DocumentDB](https://aws.amazon.com/pt/documentdb/) and
[ElastiCache](https://aws.amazon.com/elasticache/redis/) but the overall
structure is basically the same.

Functions are deployed in servers in the cloud as
[Docker](https://www.docker.com/) containers, built in [our CI/CD pipeline](Deployment-Process.md)
by automated [GitHub Actions](https://docs.github.com/en/actions) scripts and
stored in a private Docker hub registry.

Clients can connect using HTTP, gRPC or an external lambda functions (OpenFaaS
functions can only connect to OpenFaaS and the same is true for AWS functions).

DAS is versioned and released as a [library in PyPI](https://pypi.org/project/hyperon-das/).
Loading
Loading