##Purpose This is a repo for my blog post at the Pythian bog
##Prerequisites
Have docker, python 2.7 and jre 1.7 installed, Scala and basic familiarity with Spark and the concept or RDD’s
- Clone this repo.
- Create a virtualenv for this project
mkvirtualenv streaming_percentile
(optional) - Install requirements using
pip install -r requirements.pip
- Install the kafka docker container. If you have a kafka cluster set up you can skip this step
..* On mac
docker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=
boot2docker ip--env ADVERTISED_PORT=9092 spotify/kafka
..* On linuxdocker run -p 2181:2181 -p 9092:9092 --env ADVERTISED_HOST=
127.0.0.1--env ADVERTISED_PORT=9092 spotify/kafka
..* More info on the container can be found here. ..* Download and extract the kafka binaries. This location will now be refered to as KAFKA_HOME. - Install Spark
..* THis can either be run locally or using a docker container
..* To run locally Download the binaries here
..*
wget http://apache.claz.org/spark/spark-1.4.0/spark-1.4.0-bin-hadoop2.6.tgz
..*tar -xvf spark-1.4.0-bin-hadoop2.6.tgz
..* Run the pyspark shell to confirm using./bin/pyspark
..* You can also run this with IPython if you have IPython installed usingIPYTHON=1 ./bin/pyspark
..* You can also run this as a docker container usingdocker run -i -t -h -p 8888:8888 -v my_code:/app sandbox anantasty/ubuntu_spark_ipython:1.0 bash