Implementation of Kmeans algorithm on Spark system for clustering spam emails
Spark 0.7.0
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
sbt 0.11.3
A build tool for Scala and Java projects. It requires Java 1.6 or later.
Scala 2.9.2
Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way.
Run runKmeans.sh with these required parameters:
- path to SPARK root
- 0 to only run or 1 to package in jar and run (required to deploy on cluster nodes)
- spark configuration file (see below for format)
- parameter file (see below for format)
- outLabel optional label to add to output folder
E.g.
./runKmeans.sh /user/spark/spark-0.7.0 0 spark.conf parameter.conf kmeansOn1000spams
Parameter | Meaning |
---|---|
host | master URL passed to Spark |
appName | name of job |
inputfile | a path to input json dataset, either a local one on the machine or a hdfs:// |
evalOutput | 1 to evaluate result, 0 otherwise |
This is an example:
{
"host": "spark://10.10.14.3:7077",
"appName": "SparkKmeans",
"inputFile": "hdfs://10.10.14.3:8020/user/spark/spam/spam10_6.json",
"evalOutput": 1
}
Note: Master url must be follow these rules
Master URL | Meaning |
---|---|
local | Run Spark locally with one worker thread (i.e. no parallelism at all) |
local[K] | Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine) |
spark://HOST:PORT | Connect to the given Spark standalone cluster master |
Parameter | Meaning |
---|---|
mode | method for kmeans, choose "PartialSums" or "Standard" (see Overview on algorithm ) |
initialCentroids | number of initial Centroids, a first approximation can be sqrt(inputSize/2) |
convergeDist | Stopping condition: old and new centroids between iterations don't change more than this value. |
maxIter | Stopping condition: iterations of the algorithm are limited to this value |
numSamplesForMedoid | num of Elements in a cluster that will be used to update centroid for the categorical part in the Standard Kmeans (see Overview on algorithm ) |
weights | weights for features, values between 0 and 1. Sum of weights must be 1. |
This is an example:
{
"mode" : "PartialSums",
"initialCentroids" : 100,
"convergeDist" : 0.001,
"maxIter" : 2,
"numSamplesForMedoid" : 3,
"weights" : {
"space" : 0.20,
"time" : 0.20,
"IP" : 0.20,
"uri" : 0.20,
"botname" : 0.20
}
}
The output files are in folder ./out
Each execution has its output folder ./out[outLabel_]dDATE with these following files:
File | Content |
---|---|
out | result Centroids |
DATE.log | spark log file of executed job |
parameter.conf | file with algorithm parameters used to run the program |