Skip to content
This repository has been archived by the owner on May 29, 2020. It is now read-only.
dlwh edited this page Sep 12, 2010 · 4 revisions

In progress: port to Hadoop

SMR is now trying to position itself as a layer on top of hadoop. What’s available now is more or less a straight port of the original SMR to Hadoop. More Hadoop-like features will be created soon.

Using the the code is very similar:


import smr.hadoop._;
import org.apache.hadoop.fs.Path;

val h = Hadoop(Array(""),new Path("output"));
println(h.distribute(1 to 1000,3) reduce ( _+_));
println(h.distribute(1 to 1000,3) map (2*)  reduce ( _+_));

where 3 is the number of shards, and 1 to 1000 is the the range.

Building is I hope fairly easy: you just need to specify a path to your scala install and your hadoop install in $SCALA_HOME and $HADOOP_HOME respectively, the run ant.

Testing in non-distributed modes.

scala -cp $CLASSPATH:classes/ -Xplugin:jars/seroverride.jar scripts/hadoop.scala

Note that the interpreter and the script runner cannot be used in a distributed way, because classes defined there are held in memory and so other processes can’t access them. (Getting access to those would be a nice TODO). Also, I have not tested a distributed setup, since I don’t have a cluster set up (yet).


Welcome to the smr wiki!

Examples are in the testscripts/ directory, but here’s an example. Suppose you want to add all the numbers from 1 to 100000, except for those divisible by 7, and you want to multiply them by 6 before adding (say). In normal scala you’d write something like this:


(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduceLeft(_+_);

With smr, to parallelize it, you say:


distributor.distribute(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduce(_+_);

And that’s it! Notice we use reduce instead of reduceLeft. This is because we need an operator that is fully associative, and not just left associative like Scala’s normal operators.

Clone this wiki locally