[Almost] Standalone Map/Reduce. Works on top of PUB/SUB engine Work in progress: adaptation to be plugable into Hadoop and [maybe] Spark
The MapReduce programming model consistently gained ground as a viable solution for assuring the necessary scalability of distributed data processing. The generic model, composed of the two main map and reduce functions, was widely used to implement applications that can leverage parallel task processing. In this generic model, the data to be processed is typically located over a series of mapper nodes, which apply in parallel a map function responsible for mapping individual data items to a finite set of predefined keys. The output is redistributed in a shuffle step based on the key mappings to reducer nodes, which execute in parallel a reduce function for processing each set of data items corresponding to a certain key. This model can be used in processing tasks that range from simple operations that can be composed like counting, sorting, and searching data to more complex algorithms like cross-correlation or page rank.
Package dependencies:
- libzmq3-dev
- libboost-system-dev
- libboost-chrono-dev
- libcrypto++-dev
In such a system, we have two roles:
- Client : sends code and data to workers, receives the output
- Worker : responsible for processing data based on received code
Workers are autonomous, they should be accordingly started. The system interface lies on the client side. Since it reuses SCBR as dissemination channel, SCBR message and communication interfaces also apply here. However, the Lightweight MapReduce framework has a protocol of its own. Refer to the paper to get more information.
If the three repositories bellow are cloned in a common root, relative paths should already be correctly configured. Otherwise, fixes must be made in the Makefiles.
Source code:
$ git clone https://gitlab.securecloud.works/rafael.pires/mapreduce.git
$ git clone https://gitlab.securecloud.works/rafael.pires/scbr.git
Dependency:
$ git clone https://gitlab.securecloud.works/rafael.pires/sgx_common.git
Within the SCBR publication header we use the following fields:
- sid : Session identification. It can be a random number used to identify each session, so that messages belonging to distinct ones are not mixed.
- type : it tags each publication according to its role in the protocol, as follows:
INVALID_TYPE
: undefined messageMAP_CODETYPE
: the publication conveys code regarding the map stepREDUCE_CODETYPE
: the publication conveys code regarding the reduce stepMAP_DATATYPE
: the publication conveys data to be processed by mappersREDUCE_DATATYPE
: the publication conveys data to be processed by reducersRESULT_DATATYPE
: the publication conveys resulting data after processingJOB_ADVERTISE
: the publication anounces that a new task is availableJOB_REQUEST
: the publication conveys data that allows a worker to be assigned to a given task
The session identification (sid
) field and type of message (type
) in which the subscriber is interested are also used in the filter. Besides, subscription headers also include the node identification. All three fields use equality operator, meaning that all three fields need to match a publication header.
- dst : identification of a node intended to receive messages belonging to a tipe and session.
Input: sender identification, vector of header's keys and values
Output: SCBR headers
Users: clients
Input: filenames for code (in Lua scripting language) and data
Output: none. It dispatches jobs
Users: clients