Cassovary is a simple "big graph" processing library for the JVM. Most JVM-hosted graph libraries are flexible but not space efficient. Cassovary is designed from the ground up to first be able to efficiently handle graphs with billions of nodes and edges. A typical example usage is to do large scale graph mining and analysis of a big network. Cassovary is written in Scala and can be used with any JVM-hosted language. It comes with some common data structures and algorithms.
Please follow the cassovary project on twitter at @cassovary for updates.
See examples/ for some simple examples of using the library.
./sbt update
(might take a couple of minutes)./sbt test
sbt package-dist
./sbt publish-local
cd ../<dependant project>
sbt update
There are many excellent graph mining libraries already in existence. Most of them have one or more of the following characteristics:
- Written in C/C++. Examples include SNAP from Stanford and GraphLab from CMU. The typical way to use these from JVM is to use JNI bridges.
- Sacrifice storage efficiency for flexibility. Examples include JUNG which is written in Java but stores nodes and edges as big objects.
- Are meant to do much more, typically a full graph database. Examples include Neo4J.
On the other hand, Cassovary is intended to be easy to use in a JVM-hosted
environment and yet be efficient enough to scale to billions of edges.
It is deliberately not designed to provide any persistence or database functionality.
Also, it currently skips any concerns of partitioning the graph and hence is
not directly comparable to distributed graph processing systems like
Apache Giraph. This allows complex algorithms
to be run on the graph efficiently, an otherwise recurring issue with distributed
graph processing systems because of the known difficulty of achieving good
graph partitions. On the flip side, the size of the
graph it works with is bounded by the memory available in a machine, though
the use of space efficient data structures does not seem to make this a
limitation for most practical graphs. For example, an ArrayBasedDirectedGraph
instance of a unidirectional graph with 10M nodes and 1B edges consumes
less than 6GB of memory, and scales linearly beyond that.
http://groups.google.com/group/twitter-cassovary
Please follow the cassovary project on twitter at @cassovary for updates.
Please report any bugs to: https://github.com/twitter/cassovary/issues
Copyright 2012 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0