-
Notifications
You must be signed in to change notification settings - Fork 3
How Sossity Works
See Sossity Readme for command-line args.
Sossity allows everyone across an organization to access and transform data in real-time, without worrying about code or scale.
Sossity is the link between application code, stream processing, and cloud resources.
Given a set of data sources, sinks, and data pipelines in a config file, Sossity determines the dependencies and cloud resources necessary to execute the pipelines. It then creates a Terraform resource file that describes the individual cloud services to create.
Sossity is written in Clojure, because the language has succinct ways to transform and apply data structures -- especially graphs.
Sossity follows several steps:
- Check config file for errors
- Build dependency graph
- Verify dependency graph
- Output Terraform file
Sossity uses Plumatic Schema to define data structures for the config files (config.clj
and test_config.clj
). It applies the schema to the file, and notifies the user of any missing/misspelled/different fields. There are also some "sanity checks", including making sure there are not multiple resources with the same name, etc. There are several checks at each stage of the planning process.
A dependency graph is a collection of nodes and edges describing dependencies between resources and the order they must be built. Unlike other workflow systems, Sossity is not a Directed Acyclic Graph, because error pipelines can return messages to their originator.
Example dependency:
There are 4 main resources in Sossity:
- Sources
- Pipelines
- Sinks
- Edges
Sources are data ingestion points -- currently, only REST endpoints managed by App Engine. Pipelines are Cloud Dataflow jobs. Sinks are Kubernetes images which consume PubSub messages and write to an external source (files, external APIs, or BigQuery). Edges are PubSub queues.
Graph of Terraform resources and their dependencies:
Sossity uses Loom as its graph library.
The steps that Sossity takes to create an entire Flow:
- Create graph nodes from
sources
,sinks
, andpipelines
. A node is an operation on data. - Create graph edges from
targets
inedges
in config file. An edge is a communication channel for data between nodes. - Annotate nodes with metadata from config file, such as
name
,bucket
,transform-jar
. - Calculate dependencies of each node based on its ancestor edges (incoming data) and descendant edges (outgoing data)
- Name resources based on dependencies, for example, an edge might be named
pipeline1-to-pipeline2
. - Using metadata and connectivity, create Terraform resources in
terraform.tf.json
file.
Sossity looks at all sources
, sinks
, and pipelines
and creates a node
data structure for each one.
From the edges
config file entry, Sossity connects every node
to its relatives using an edge
data structure.
Sossity then applies all the metadata from the config file to every node and edge -- this includes information like output buckets, executable jars, etc.
Because of the transitive property, a resource only needs to know its own dependencies, because an in-order traversal of the graph will assure its relatives' dependencies are built as well.
Each resource has different dependencies -- all resources need at least one PubSub, some have other needs (Subscriptions, Cloud Storage buckets, etc.)
Sossity tries to follow a "Convention-over-Configuration" method to name resources. Naming is important, because it allows resources to be easily identifiable and unique.
Names of nodes and pipelines are determined in the config file. PubSubs are named after their inputs and outputs (e.g., pipeline1-to-pipeline2
). Error PubSubs are named after their parent (pipeline1-to-pipeline1-error
). Error sinks are also named after their parent (pipeline1-error
).
A large amount of the Sossity codebase is dedicated to translating the Dependency Graph into a Terraform file. To do this, the metadata for each node and edge is turned into a Terraform resource. For example, a Pipeline creates a Cloud Dataflow job, a Sink is turned into a Kubernetes Container, and a Source is turned into an App Engine Module. See the Sossity Output for more details.
Since every pipelines' errors are output to a queue, you can also write a pipeline jar to read the queue and re-insert the data into the main wokflow. Simply indicate the :repair-jar
in the config file entry for the pipeline
and Sossity will attach it with the correct inputs and outputs.
Error output files can also be read by a Batch Cloud Dataflow job and re-inserted to the main workflow. See this example project (soon) for more details.