High performance distributed data processing engine
Skale-engine is a fast and general purpose distributed data processing system. It provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.
The following figure is a comparison against Spark for a logistic regression program:
Word count using skale:
var sc = require('skale-engine').context();
sc.textFile('/path/...')
.flatMap(line => line.split(' '))
.map(word => [word, 1])
.reduceByKey((a, b) => a + b, 0)
.count().then(console.log);
- In-memory computing
- Controlled memory usage, spill to disk when necessary
- Fast multiple distributed streams
- realtime lazy compiling and running of execution graphs
- workers can connect through TCP or websockets
- simple standalone or fully distributed mode
- very fast, see benchmark
- Documentation
- Gitter for support and discussion
- skale mailing list for discussion about use and development
- Contributing guide
- Roadmap
The best and quickest way to get started with skale-engine is to use skale to create, run and deploy skale applications.
$ sudo npm install -g skale # Install skale command once and for all
$ skale create my_app # Create a new app, install skale-engine
$ cd my_app
$ skale run # Starts a local cluster if necessary and run
In the following, we bypass skale toolbelt, and use directly and only skale-engine. It's for you if you are rather more interested by the skale-engine architecture, details and internals.
To run the internal examples, clone the skale-engine repository and install the dependencies:
$ git clone git://github.com/skale-me/skale-engine.git --depth 1
$ cd skale-engine
$ npm install
Then start a skale-engine server and workers on local host:
$ npm start
Then run whichever example you want
$ ./examples/wordcount.js /etc/hosts
The standalone mode is the default operating mode. All the processes,
master and workers are running on the local host, using the
cluster core
NodeJS module. This mode is the simplest to operate: no
dependency, and no server nor cluster setup and management required.
It is used as any standard NodeJS package: simply require('skale-engine')
,
and that's it.
This mode is perfect for development, fast prototyping and tests on a single machine (i.e. a laptop). For unlimited scalibity, see distributed mode below.
The distributed mode allows to run the exact same code as in standalone over a network of multiple machines, thus achieving horizontal scalability.
The distributed mode involves two executables, which must be running prior to launch application programs:
- a
skale-server
process, which is the access point where themaster
(user application) andworkers
(running slaves) connect to, either by direct TCP connections, or by websockets. - A
skale-worker
process, which is a worker controller, running on each machine of the computing cluster, and connecting to theskale-server
. The worker controller will spawn worker processes on demand (typically one per CPU), each time a new job is submitted.
To run in distributed mode, the environment variable SKALE_HOST
must
be set to the skale-server
hostname or IP address. If unset, the
application will run in standalone mode. Multiple applications, each
with its own set of workers and master processes can run simultaneously
using the same server and worker controllers.
Although not mandatory, running an external HTTP server on worker
hosts, exposing skale temporary files, allows efficient peer-to-peer
shuffle data transfer between workers. If not available, this traffic
will go through the centralized skale-server
. Any external HTTP
server such as nginx, apache or busybox httpd, or even NodeJS
(although not the most efficient for static file serving) will do.
For further details, see command line help for skale-worker
and skale-server
.
To run the test suite, first install the dependencies, then run npm test
:
$ npm install
$ npm test
The original authors of skale-engine are Cedric Artigue and Marc Vertes.