Skip to content

This project deals with the implementation of k-means for multi-dimensional clustering.

Notifications You must be signed in to change notification settings

anzdani/spark-kmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Kmeans for Spam

Implementation of Kmeans algorithm on Spark system for clustering spam emails

Getting Started #

Requirements

Spark 0.7.0
Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

sbt 0.11.3
A build tool for Scala and Java projects. It requires Java 1.6 or later.

Scala 2.9.2
Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way.

Usage

Run runKmeans.sh with these required parameters:

  • path to SPARK root
  • 0 to only run or 1 to package in jar and run (required to deploy on cluster nodes)
  • spark configuration file (see below for format)
  • parameter file (see below for format)
  • outLabel optional label to add to output folder

E.g.

./runKmeans.sh /user/spark/spark-0.7.0 0 spark.conf parameter.conf kmeansOn1000spams

Configuration files

spark.conf

Parameter Meaning
host master URL passed to Spark
appName name of job
inputfile a path to input json dataset, either a local one on the machine or a hdfs://
evalOutput 1 to evaluate result, 0 otherwise

This is an example:

{
  "host": "spark://10.10.14.3:7077",
	"appName": "SparkKmeans",
	"inputFile": "hdfs://10.10.14.3:8020/user/spark/spam/spam10_6.json",
	"evalOutput": 1
}

Note: Master url must be follow these rules

Master URL Meaning
local Run Spark locally with one worker thread (i.e. no parallelism at all)
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine)
spark://HOST:PORT Connect to the given Spark standalone cluster master

parameter.conf

Parameter Meaning
mode method for kmeans, choose "PartialSums" or "Standard" (see Overview on algorithm )
initialCentroids number of initial Centroids, a first approximation can be sqrt(inputSize/2)
convergeDist Stopping condition: old and new centroids between iterations don't change more than this value.
maxIter Stopping condition: iterations of the algorithm are limited to this value
numSamplesForMedoid num of Elements in a cluster that will be used to update centroid for the categorical part in the Standard Kmeans (see Overview on algorithm )
weights weights for features, values between 0 and 1. Sum of weights must be 1.

This is an example:

{
	"mode" : "PartialSums",
	"initialCentroids" : 100,
	"convergeDist" : 0.001,
	"maxIter" : 2,
	"numSamplesForMedoid" : 3,
	"weights" : {
		"space" : 0.20, 
		"time" : 0.20,
		"IP" : 0.20, 
		"uri" : 0.20, 
		"botname" : 0.20 
	}
}

Output File

The output files are in folder ./out
Each execution has its output folder ./out[outLabel_]dDATE with these following files:

File Content
out result Centroids
DATE.log spark log file of executed job
parameter.conf file with algorithm parameters used to run the program

About

This project deals with the implementation of k-means for multi-dimensional clustering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published