-
Notifications
You must be signed in to change notification settings - Fork 941
GetStarted_EC2
You apply scripts/spark-ec2 to launch a Spark cluster with 3 slaves on p2.xlarge (1 GPU, 4 vCPUs) instances with an TFoS AMI. We assume that TFoS_HOME refers to the home directory of TensorFlowOnSpark source code.
export AMI_IMAGE=ami-cf8b30af
export EC2_REGION=us-west-2
export EC2_ZONE=us-west-2a
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=p2.xlarge
#export EC2_INSTANCE_TYPE=p2.8xlarge
export EC2_MAX_PRICE=0.8
${TFoS_HOME}/scripts/spark-ec2 \
--key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
--region=${EC2_REGION} --zone=${EC2_ZONE} \
--ebs-vol-size=50 \
--instance-type=${EC2_INSTANCE_TYPE} \
--master-instance-type=${EC2_INSTANCE_TYPE} \
--ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES} \
--spot-price ${EC2_MAX_PRICE} \
--copy-aws-credentials \
--hadoop-major-version=yarn --spark-version 1.6.0 \
--no-ganglia \
--user-data ${TFoS_HOME}/scripts/ec2-cloud-config.txt \
launch TFoSdemo
You should see the following line, which contains the host name of your Spark master.
Spark standalone cluster started at http://ec2-52-49-81-151.us-west-2.compute.amazonaws.com:8080
Done!
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${EC2_PEM_FILE} root@<SPARK_MASTER_HOST>
Execute the following Spark command to convert MNIST data files into TensorFlow Record Format and store them on HDFS file system.
pushd ${TFoS_HOME}
spark-submit --master local[4] \
--jars ${TFoS_HOME}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64" \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
--output /mnist/tfr \
--format tfr
popd
hadoop fs -ls /mnist/tfr
Train a DNN model, and test using mnist dataset located at /mnist/tfr/train
Instance Type | Settings |
---|---|
p2.xlarge | export NUM_GPU=1; export CORES_PER_WORKER=4 |
p2.8xlarge | export NUM_GPU=8; export CORES_PER_WORKER=32 |
pushd ${TFoS_HOME}/src
zip -r ${TFoS_HOME}/tfspark.zip *
popd
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export MASTER=spark://$(hostname):7077
spark-submit --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/tf/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images /mnist/tfr/train --format tfr \
--mode train --model mnist_model --tensorboard
During the training, you should be able to see TensorBoard via URL like: http://ec2-52-35-64-3.us-west-2.compute.amazonaws.com:43673/ You may need to adjust EC2 security settings.
The trained model and its check points should be located at HDFS.
hadoop fs -ls /user/root/mnist_model
spark-submit --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server
:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64" \
${TFoS_HOME}/examples/mnist/tf/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images /mnist/tfr/test \
--mode inference \
--model mnist_model \
--output /user/root/predictions
You could now examine the prediction result via:
hadoop fs -cat /user/root/predictions/* | less
Install additional software required by IPython Notebooks.
pip install ipython ipython[notebook]
Launch IPython notebook on Master node.
pushd ${TFoS_HOME}/examples/mnist
export IPYTHON_OPTS="notebook --no-browser --ip=`hostname`"
IPYTHON=1 pyspark --master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/tf/mnist_dist.py \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/local/cuda/lib64:$JAVA_HOME/jre/lib/amd64/server
:$HADOOP_HOME/lib/native" \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.HADOOP_HDFS_HOME=”$HADOOP_HOME” \
--driver-library-path="/usr/local/cuda/lib64"
${TFoS_HOME}/scripts/spark-ec2 \
--key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} \
--region=${EC2_REGION} --zone=${EC2_ZONE} \
destroy TFoSdemo