Releases: Pometry/Raphtory
raphtory-akka-0.4.0
0.4.0 is a drastic overhaul over almost every component within Raphtory. This includes:
- A new analysis engine with a clearer more intuitive API;
- A full rework of Raphtory orchestration, allowing for clean local and distributed deployment;
- Batching and backpressure added between all components, improving stability and providing 10-100x speed up in ingestion and analysis conversion;
- Integration tests for all update types, different analysis tasks and windowed perspectives;
- CI/CD built on top of the tests for automated testing of branches, publishing of nightly builds and release tagging.
Analysis Control Overhaul
- Added the
Query Manager
andQuery Handler
to deal with the organisation of a query and replace the Analysis Manager/Task Handlers. - Added
PerspectiveControllers
which look after allperspectives
(combinations of timestamp + windows) within a job. - Added the
QueryExecutor
inside of each Partition to run submitted algorithms - replacing theAnalyserExecutor
. - Centralised
safetime
checking within the watermarker instead of asking all the Executors independently. - Removed all REST api and replaced with a programatic
RaphtoryClient
which can be run via a compiled Jar or via the scala REPL.
New Graph Algorithm Structure
- Added
Graph Algorithm
,Graph Perspective
andTable
traits as the components of our new algorithm API. - Graph algorithms consist of calling
Step
andIterate
on the graph perspective - these take a function to run on each vertex. This massively expands the analytical facilities of Raphtory allowing the chaining of multiple algorithmic steps. - Once an algorithm has been defined via step/iterate the user may call
select
which converts each vertex to a row and returns a table abstraction. This table may befiltered
,exploded
(turning one row into N rows as defined by a user given function) and then written to disk viawriteTo
. - Global aggregation is currently removed as it was causing several issues in the previous analytical model. Elements such as counting, groupBy, topK, etc. will be added in the next minor version.
New Analysis Features
- Added Explode Edges and helpers to view Temporal Edges as singular entities (one for each update).
- Added a global vertex count within Graph Algorithms via
graph.nodeCount
which changes based upon the perspective. - Added an equivalent function to assignID inside of the graphAlgorithm called checkID this is a helper function which allows the user to feed queries with the strings that exist in the raw data and convert these to the internal long ID.
- Changed property access to take a class tag (i.e.
getState[String]()
) instead of requiring.asInstanceOf[T]
- Added
getStateOrElse
to vertex visitor.
Deployment Overhaul
- Moved away from docker fully, allowing Raphtory to be compiled into a Jar and deployed on bare metal.
- Organised deployment classes into
RaphtoryGraph
for local deployment andRaphtoryService
for distributed deployments. - Added a
RaphtoryClient
for users to submit jobs to either Graph or Service. - Converted the Raphtory Component (top level akka handler of spout, builder, partition, etc) into a Component Factory Object used by all deployment classes i.e. the
RaphtoryGraph
andRaphtoryService
- Moved creation of Spout and Builder away from Scala reflection to allow them to have multiple parameters.
Message Batching
- Batched the messages between the spout and the builders to minimise the amount of akka messages sent between them. The size of this is configurable by
RAPHTORY_BUILDER_BATCH_SIZE
. - Added a outgoing queue map for the builder (one queue for each partition) which will be pulled when the partition is ready for more data.
- Added
RAPHTORY_BUILDER_MAX_CACHE
for the total amount of messages a builder will hold before it stops pulling data from the spout (to stop it becoming memory overloaded itself). - Added
RAPHTORY_PARTITION_MIN_QUEUE
which is number of messages in the queue of each partition actor below which the Partition Manager will request more data from the builders. - Batched effect sync messages between partitions which are pushed out every second.
- Batched vertex messages between Readers during analysis, which are flushed after each superstep has concluded on all vertices in the perspective.
- Changed ChannelID between the builder/partition to be Ints instead of strings and only send 1 per batch.
Message Handling
- Swapped to Twitter Chill library for kryo serialisation instead of altoo-ag. This removed the need to specify each case class manually in the conf and allows users to send their own types.
- Moved all actors onto the large message queue - allowing heartbeats etc to use the normal queue unopposed - shrinking the queue size to 100k, but increasing each message frame to 10MB.
- Bumped up heartbeat monitoring to stop akka complaining when actors are looping through a batch of messages.
- Added a new custom mailbox which tracks the amount of messages in each actors queue to get a better idea of workload. This is how the Partition Manager knows how much data each partition has to process.
- Swapped the vertex message queue to use ArrayBuffer instead of ListBuffer
- Swapped the watermarker to ask the watchdog when the cluster is ready before probing for timestamps to stop fake dead letters.
Testing:
- Added a new version of the all commands test, which is more extensible than the prior and works with the new API. This also doesn't require the
golden standard
data to be available, instead comparing hashes. The data has also been removed from the repo and is now pulled into /tmp within a users first run. - Added a Raphtory pseudo distributed deployment class (
RaphtoryPD
) which simulates a real distributed environment for testing. - Adding speed logging to All commands test for ingestion and queries
- Added speed logging to the Query Manager for all jobs.
CI/CD
- Added CI pipeline workflow via GitHub actions which will - Run the all commands test on push to any branch that isn’t master - Run a nightly build of develop branch and publish a tag and release on successful build - Run a build on push to master branch, bump semantic version, create a tag and release on successful build
- Added in badges to readme to show latest build run status for push and scheduled events
- Added in badges to readme to show latest tag and release versions (SemVer only as to only show published release rather than nightly builds)
Partition Overhaul
- Have abstracted the graph partition so that we can work on the storage analysis and ingestion separately. They were very intertwined before.
- Have removed the concept of a local partition/partitioned shared state as this means we can move partitions round a lot easier and support other non-object based partitions i.e. arrow.
- Turned the object based entity visitors and graph lens into interfaces so that we can easily see what a user will have access to + wrapper functions. This also further separates the storage and analysis.
- Current version of the entity storage implements these and has been renamed as
POJOGraph
to make it distinct from later implementation in frameworks such as arrow. - Reworked all the
worker classes
(router/writer/reader) to have a unique name and not reference the machine they are on. - Turned the actor lookup functions within the RaphtoryActor into lazy vals so that they do not reinitialise every call.
- Removed all ParTrieMaps (our last parallel data structure) as these were causing resource contention.
Partition Memory Management:
- Swapped all multiple.treemaps to ArrayBuffers and moved to using arrays in analysis were possible as these are apparently more efficient based on recent benchmarks (https://www.lihaoyi.com/post/BenchmarkingScalaCollections.html).
- Added a state deduplication step run periodically in the partitions to remove any state which is stored more than once (say in the instance of a vertex added at the same time as some of its edges).
- Removed the HistoricOrdering object as it is no longer used without the trees and one was being instantiated for each entity in the graph taking huge amounts of memory.
Spout
- Deleted the multiline file spout as not used.
- Added a mongoDB spout.
- Added a parquet spout.
- Fixed the Kafka spout to work with the overhaul from 0.3.0
Config
- Added
RAPHTORY_LEADER_ADDRESS
,RAPHTORY_LEADER_PORT
,RAPHTORY_BIND_ADDRESS
,RAPHTORY_BIND_PORT
for specifying where a Raphtory service should be binding to and where to look for the leader of a deployment. - Added
RAPHTORY_DATA_HAS_DELETIONS
flag to only run the extra steps for handling deletions which such elements exist in the data.
Misc clean up
- Removed all env variables throughout the code, notably in spouts and algorithms - these are all now taken as class arguments.
- Tided up the Raphtory components and placed previously duplicated akka code such as the mediator into the top level Raphtory Actor.
- Added a
Windows
type as a wrapper to List[Long] to better explain what is happening when you submit a query - Shortened Raphtory job names, removing the full algorithm path.
- Deleted env-setter as no longer using raphtory docker image.
- Deleted old docker settings inside of
build.sbt
as we no longer build straight into docker. - Removed all the old compile at runtime code as now depreciated.
- Removed old Kamon logging code which caused deadlocking issues.
- Removed the Router Manager and fully renamed the router to graph builder throughout.
- Deleted unneeded utils and and actors, including the original seed actor.
- Deleted the old Snapshot Manager
- Increased the frequency of watermarking
- Added logging info of the IP of each Partition joining a Raphtory cluster to better discern ...
raphtory-akka-0.3.0
A large number of changes have occurred in dev causing it to diverge largely from master and current documentation. Before any larger changes are completed (notably snapshotting, breaking away from docker and the creation of a new analysis API) we are releasing 0.3.0. The changes for this are listed below:
Major Changes - Raphtory Management
- Raphtory have been upgraded to use Akka 2.6 and swapped from Netty to Artery
- All messaging is now handled through Kryo serialiser instead of the default Java serialiser
- The Watchdog SeedNode and Watermark Manager are now combined into an orchestration actor group that manages the whole cluster.
- All Raphtory Components are now managed by a Raphtory Component Connector which ensures cluster startup by reporting to the watchdog.
- Writer Workers can now martial and un-martial the state of their allocated entity storage to/from parquet.
Major Changes - Analysis Management
- Raphtory’s logic for handling analysis has been totally rewritten. The Analysis Manager now spawns a task that contains the full control logic for each submitted query. This task requests the Reader Workers to create a separate actor for the analysis (the AnalysisSubtaskWorker) which contains all vertex visitors. Once the analysis is completed (across all flattenings) both the Task and SubtaskWorkers can be killed, removing all analytical states and stopping the build-up of visitors/analysis properties over time.
- The above drastically simplified the VertexMultiQueue which now only needs an odd and an even mailbox instead of storing timestamp and windowsize as well.
- PubSub was removed as a communication method between Analysis Task and SubtaskWorkers in favour of direct actor messaging. Completed for performance and practical reasons (new actors spawning requires gossiping of their location which is slow and causes intermittent errors).
- VertexMessageHandler was created to track all vertex messages between SubtaskWorkers.
- ViewLens and WindowedLens compressed into one class (GraphLens)
Major Changes - Analysis API
- Analysers now require the user to return a map of results which can then be serialised in a variety of ways. This makes the analyser more general and removes the need to edit the code when the user wants to swap from saving to a text file to saving to mongo etc.
- Raphtory queries may now be submitted with a serialiser class which contains the logic to save the results in the users desired format.
- Raphtory Serialisers handle both windowed and unwindowed flattenings, therefore, ProcessResults and ProcessWindowResults have been replaced with extractResults, removing A LOT of duplicated code.
- The old serialiser type which extended Analyser is now removed as redundant.
- Analysers are now typed, specifying what each subtask worker is returning and, therefore, what the analysis task is aggregating. This removes the need for unpleasant casting inside of extractResults.
- The Query API has had the explicit window type arguments removed (true, false, batched). A user may now simply submit a window set (which can include one window) and raphtory will handle it internally.
- Vertex Visitor and Edge Visitor were renamed to Vertex and Edge for user clarity.
- Double args array submission is now no longer possible with the RaphtoryGraph removing confusion.
- Example algorithms have been updated with the new API
- Example Algorithms have been given named param alternatives to the args array.
Test Changes
- All Commands Test changed to use set generated file as all scala versions seem to do something different in utils.Random (see testUpdates.txt in dev/allcommands)
- Datablast Analyser added which throws large arrays of data from all subtask workers to see how well Raphtory handles it.
- All commands test converted to unit tests which are fully automated to make comparisons between versions a lot simpler.
Minor Changes
- build.sbt has been cleaned up and organised
- A large amount of package refactoring ahead of breaking the project into core + modules
- The router has been officially renamed GraphBuilder
- Initial SnapshotManager included, but currently stub
- Initial GraphAlgorithm included, but currently a stub
- AnalysisUtils created for misc runtime compilation code.
- Swapped from 32 bit murmur3 hash to 64 bit xxHash to remove chances of collisions.
raphtory-akka-0.2.0
Changelog for v0.2.0
API Changes
- Changed the Graph Builder API such that instead of having to define a graph update case class and pass it to the sendUpdate() function, each update type has its only function. These are then overloaded to cleanly handle properties and types.
Testing
- Created the Raphtory Component which runs into its own JVM allowing the deployment of Raphtory 'clusters' locally for better testing.
- Created an all commands test generator which runs 300k updates of all types across a distributed deployment.
- Added a StateTest algorithm which extracts the full structural and property histories from a flattening.
- Added functionality to compare versions of Raphtory
- Requires running the all commands test, then querying the new state test and connected components across the history of the resulting graph.
- The results of the prior version and new version can then be compared by a new
ResultsCompare
class which pulls the data from mongo and reports on if it is correct and if a speed up has been obtained.
- These can all be found in
com.raphtory.dev.allcommands
Internal Improvements
- Fixed the Watermarking functionality.
- All router workers now synchronise fully for bounded datasets.
- All ingestion workers now report the correct safe time.
- Safe times are now aggregated fully in the new Watermark Manager.
- Fixed bug in Router Workers where they were sharing an outbound queue object.
- Fully refactored the messages for Update ingestion and synchronisation.
- Extracted all message parsing logic from Entity Storage and moved to Ingestion worker -- both cleaned up addressing poor function names.
Analysis
-
Added a variety of new algorithms -- experimenting with Community Detection, temporal mutli-layer LPA and temporal motif counting.
- These can be found in
com.raphtory.algorithms
- These can be found in
-
Entity Visitor Property access has been expanded, but is to be fully redone in the next version.
Extra Bits
- Fully updated readme and documentation.
- Large clean up of the overall project, removing old readme information/pictures, PDF's and other elements which can be moved to the examples repository.
- Moved the src code to be top level.
- Added multi-line Spout, to be merged into standard file spout - can be found in
com.raphtory.spouts
- Added a dev folder for alpha algorithms and example Raphtory pipelines for testing.
Thanks @wuliaososhunhun @imanehaf @dorely103 @narnolddd @richardclegg @felixcdr @jamesalford @Haaroon @Alnaimi-
raphtory-akka-0.1
Initial Raphtory release now that it is in a useable state to ingest and build the graph. This is a test release for current users and will soon be superseded by v0.2 which will include an updated analysis model.