Skip to content

GetStarted_EC2

Andy Feng edited this page Feb 4, 2017 · 17 revisions

Running TensorFlowOnSpark on EC2

1. Set up a standalone Spark cluster on EC2

You apply scripts/spark-ec2 to launch a Spark cluster with 3 slaves on p2.xlarge (1 GPU, 4 vCPUs) instances with an TFoS AMI. We assume that TFoS_HOME refers to the home directory of TensorFlowOnSpark source code.

export AMI_IMAGE=ami-cf8b30af
export EC2_REGION=us-west-2
export EC2_ZONE=us-west-2a
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=p2.xlarge  
#export EC2_INSTANCE_TYPE=p2.8xlarge
export EC2_MAX_PRICE=0.8
${TFoS_HOME}/scripts/spark-ec2 \
	--key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
        --region=${EC2_REGION} --zone=${EC2_ZONE} \
        --ebs-vol-size=50 \
        --instance-type=${EC2_INSTANCE_TYPE} \
        --master-instance-type=${EC2_INSTANCE_TYPE} \
        --ami=${AMI_IMAGE}  -s ${SPARK_WORKER_INSTANCES} \
        --spot-price ${EC2_MAX_PRICE} \
        --copy-aws-credentials \
        --hadoop-major-version=yarn --spark-version 1.6.0 \
        --no-ganglia \
        --user-data ${TFoS_HOME}/scripts/ec2-cloud-config.txt \
        launch TFoSdemo

You should see the following line, which contains the host name of your Spark master.

Spark standalone cluster started at http://ec2-52-49-81-151.us-west-2.compute.amazonaws.com:8080
Done!

2. ssh onto Spark master

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${EC2_PEM_FILE} root@<SPARK_MASTER_HOST>

3. Convert MNIST files into TensorFlow Record format

Execute the following Spark command to convert MNIST data files into TensorFlow Record Format and store them on HDFS file system.

pushd ${TFoS_HOME}
spark-submit --master local[4] \
--jars ${TFoS_HOME}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64" \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
--output /mnist/tfr \
--format tfr
popd
hadoop fs -ls /mnist/tfr

4. Train a MNIST model

Train a DNN model, and test using mnist dataset located at /mnist/tfr/train

Instance Type Settings
p2.xlarge export NUM_GPU=1; export CORES_PER_WORKER=4
p2.8xlarge export NUM_GPU=8; export CORES_PER_WORKER=32
pushd ${TFoS_HOME}/src
zip -r ${TFoS_HOME}/tfspark.zip *
popd

export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export MASTER=spark://$(hostname):7077

spark-submit --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/tf/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images /mnist/tfr/train --format tfr \
--mode train --model mnist_model --tensorboard

During the training, you should be able to see TensorBoard via URL like: http://ec2-52-35-64-3.us-west-2.compute.amazonaws.com:43673/ You may need to adjust EC2 security settings.

The trained model and its check points should be located at HDFS.

hadoop fs -ls  /user/root/mnist_model

4. Conduct Image Inference using a MNIST model

spark-submit --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server
:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/tf/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images /mnist/tfr/test \
--mode inference \
--model mnist_model \
--output /user/root/predictions

You could now examine the prediction result via:

hadoop fs -cat  /user/root/predictions/* | less

5. Interactive Learning with IPython Notebook

Install additional software required by IPython Notebooks.

pip install ipython ipython[notebook] 

Launch IPython notebook on Master node.

pushd ${TFoS_HOME}/examples/mnist
export IPYTHON_OPTS="notebook --no-browser --ip=`hostname`"
IPYTHON=1 pyspark  --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server
:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64" 

6. Destroy Spark clusters

${TFoS_HOME}/scripts/spark-ec2 \
	--key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
        --region=${EC2_REGION} --zone=${EC2_ZONE} \
        destroy TFoSdemo
Clone this wiki locally