-
Notifications
You must be signed in to change notification settings - Fork 5
Home
In progress: port to Hadoop
SMR is now trying to position itself as a layer on top of hadoop. What’s available now is more or less a straight port of the original SMR to Hadoop. More Hadoop-like features will be created soon.
Using the the code is very similar:
import smr.hadoop._;
import org.apache.hadoop.fs.Path;
val h = Hadoop(Array(""),new Path("output"));
println(h.distribute(1 to 1000,3) reduce ( _+_));
println(h.distribute(1 to 1000,3) map (2*) reduce ( _+_));
where 3 is the number of shards, and 1 to 1000 is the the range.
Building is I hope fairly easy: you just need to specify a path to your scala install and your hadoop install in $SCALA_HOME and $HADOOP_HOME respectively, the run ant.
Testing in non-distributed modes.
scala -cp $CLASSPATH:classes/ -Xplugin:jars/seroverride.jar scripts/hadoop.scala
Note that the interpreter and the script runner cannot be used in a distributed way, because classes defined there are held in memory and so other processes can’t access them. (Getting access to those would be a nice TODO). Also, I have not tested a distributed setup, since I don’t have a cluster set up (yet).
Welcome to the smr wiki!
Examples are in the testscripts/ directory, but here’s an example. Suppose you want to add all the numbers from 1 to 100000, except for those divisible by 7, and you want to multiply them by 6 before adding (say). In normal scala you’d write something like this:
(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduceLeft(_+_);
With smr, to parallelize it, you say:
distributor.distribute(1 to 100000).filter(_%7!=0).map(BigInt(_*6)).reduce(_+_);
And that’s it! Notice we use reduce instead of reduceLeft. This is because we need an operator that is fully associative, and not just left associative like Scala’s normal operators.