Erlang elixir distributed computing model versus Transient universe model

I have been watching this interesting presentation of distributed computing in Erlang-elixir by Maciej Kaszubowski:

I want to mention the remarkable points and compare them with the transient way of distributed computing:

min 1:30 It uses a cookie defined as a command-line parameter. It is a good idea. Connections would be required to send the same cookie.

min 3:00 Erlang nodes connect all nodes with all by default and maintain a heartbeat between then, probably because the communication protocol is low level in order to be very fast. It uses UDP probably.

Transient connect only on demand and a node may route request to other destination nodes. It is made for weakly coupled internet nodes rather than for a closed coupled local area network. For this reason transient uses TCP sockets and websockets. A node behind a firewall can be accessed using the HTTP port as long as websockets are enabled. Once connected the remote node can address nodes internal to the firewall trough the gateway node.

4:30 Erlang send asynchronous messages between processes identified by a PID that is unique across the network. The send/receive is transparent. it means that serialization and deserialization is done by the OTP runtime. The ordering of messages between single nodes is guaranteed, but the delivery of every single message is not guaranteed (at most once). If two processes are sending to a third, the ordering of messages by time are not guaranteed.

Transient does not use the concept of the process, considered as computation which run his own thread for all the lifetime of the program, which send and receive messages trough his mailbox. Although this can be simulated since transient-universe has mailboxes (see below), Transient communications are more like teleporting running programs from a node to another (in fact the primitive that does that is called teleport). What is transferred is not a message, but the local variables of the computation. In other words: the execution stack, the closure, the local variables or whatever you may want to call it. The remote node continue the execution until it executes another distributed primitive that translates this computation to some other node.

The fundamental distributed primitive, is teleport which translate the computation to the remote node most recently connected. Then, the second teleport return the computation back to the calling node. wormhole node is used to connect to a node. with both primitives any other distributed primitive can be made:

To run a computation in a remote node and return back, runAt is defined as:

runAt remotenode computation= wormhole node $ do 
                                 teleport
                                 result <- computation
                                 teleport
                                 return result

So a transient program is a "do" sequence where the program execution switch nodes back and forth without any notion of a process statically being run in any particular place sending and receiving messages. In this sense the processes are created and destroyed in the nodes dynamically while the distributed computation is executed.

do
   result <- runAt here dothis
   runAt there dothat
   ...

Don't think that transporting the computation context from node to node is heavy: basically the monad create and interpret REST-like paths which address the remote location where the code must continue in the remote node and gives the variable values necessary for such continuation. There are other optimizations: variables already sent are never being resent again.

Since the execution is synchronous by default, ordering of messages can be guaranteed. although, if required, asynchronous messaging can be used too; if the remote computation executes empty, the computation never will return back. So it would be asynchronous.

min 11:50 As in the case of Erlang/elixir, there is only a single socket connection among two nodes since wormhole connections reuse the same socket. A big message can force a delay for the rest of the messages.

min 12:48 Since remote processes are created and destroyed on demand there is no need to discover already running processes in remote nodes. So no global registration service is necessary.

With one exception: A process can be created permanently in another node, for example, to stream back some events.

   text <- runAt remotenode $ local $ waitEvents   getLine
   localIO $ putStrLn text

That program would read the standard input in the remote node and return back them to the calling node. In the remote node, waitEvents set a thread that execute getLine endlessly in a loop. The second teleport of runAt transport each string back to the calling node where it is printed.

At some time it would be necessary to kill this branch of computation created by the remote thread and any child thread. For this purpose a thread can be named and killed by setRemoteJob name node and killRemoteJob name.

If the remote process does not generate additional threads, this problem does not exist, since teleport in the remote node executes empty, so the remote process dies and his data is garbage collected.

min 14:08: Erlang uses monitors to guarantee that processes are alive, they also do the housekeeping when the processes fail, restart them etc. Since transient processes are created and (if necessary) killed on demand there is a lesser need of a third process in charge of monitoring.

If something fails in a remote node, transient has a unique feature called "cloudExceptions": The programmer can set the program so that exceptions in remote nodes can be handled in the calling node, which can take the appropriate actions depending on the nature of the problem. The calling node is both the client and the monitor.

However there is something additional. Erlang programs in different nodes are supposed to share the same code. Transient programs with completely different codebases can communicate using the "Transient.Move.Services" module. There is a executable that is build with transient-universe called "monitorService" which is in charge of downloading/compiling/executing different services, that are transient executables. they may be demanded by other transient executables (using callService).

Therefore, "monitorService" is a program-level monitor, not a process-level monitor.

16:26: In case of partitions/splits of the network, the registering of process names, necessary in Erlang, produce problems that are absent in transient. However an split of course may inherently produce a loss of data in any distributed system.

In transient, a communication failure may produce an exception in both nodes. The programmer may know the node with which the connection failed and he could handle the problem depending on his particular use case.

19:30 pub-sub mechanism is used heavily in Erlang, since it is an asynchronous architecture. To send the same message to many nodes asyncronously, transient can use runAt to update the mailbox of each node, followed by empty. Since they use empty, it is asynchronous and the alternative operator <|> execute both terms:

  runAt node1 do local (putMailBox "hello" ; empty) <|>  runAt node2 local (putMailBox "hello" ; empty)

  clustered $ do  local (putMailBox "hello" ; empty)

The first snippet above put "hello" in the mailbox for the string type in two remote nodes. There is a different mailbox for each datatype in a node. The second form does it for all the nodes known by the local node.

There is a more optimized pubsub mechanism that is being developed. Since transient networks consider web nodes as part of the cloud, if the program is compiled with GHCJS - the ghc-to-js compiler- a web browser will receive a copy of the program when clicking the URL of a transient node. So a single server can have hundred or thousands of nodes connected via websockets. publish by indiscriminate broadcasting is not efficient in this case. Also in transient, a node is not directly connected to all the rest of the nodes.

21:40 There is work in progress to allow strong consistent solutions in transient, but in general, as Maciej Kaszubowski says, this is a responsibility of the particular programmer of each particular application which has his own particular trade-ofs.

26:47 Erlang distributed computing as well as almost every other distributed computing architecture is asynchronous. Distributed systems in Transient are not necessarily about asynchronous messages. a distributed call like runAt node todo can be asynchronous if it has empty. That means that it return 0 results. It also can be synchronous if return a single result to the caller. And also can be streamed, if it run a non-deterministic primitive like waitEvents in the remote node.

<><