Skip to content

GetStarted_EC2

leewyang edited this page Aug 15, 2018 · 17 revisions

Running TensorFlowOnSpark on EC2

This guide describes the basic steps for running TensorFlowOnSpark on AWS EC2. This example will create a small Spark Standalone cluster with a master and two workers. When running TensorFlowOnSpark, the two Spark workers will serve as a TensorFlow worker/master and a TensorFlow PS node.

Note: for simplicity, we will use default (or fairly loose) security settings. However, it is HIGHLY RECOMMENDED that you properly secure your AWS services, which is beyond the scope of this tutorial.

Create a new S3 bucket and IAM role

Spark Standalone and TensorFlow clusters both require a distributed file system. Luckily, both also provide native support for S3, which we will use for data and model storage.

  • In the S3 Management Console, create a new S3 bucket using the default settings.
  • In the IAM Console, create a new IAM role called flintrock for EC2 and attach the AmazonS3FullAccess policy.
  • In the EC2 Console, if you haven't already done so, create a "Key Pair" to allow access to your instances. Note the key pair name and save the PEM file.

Set up a Spark Standalone cluster on EC2

Use flintrock to launch a Spark Standalone cluster on EC2. Note that we assume Amazon Linux AMIs (per flintrock recommendation).

export KEY_NAME=<your_ec2_key_pair_name>
export KEY_PEM=<your_ec2_key_pem>
export AMI=<Amazon_Linux_AMI_id>
export REGION=<your_AWS_region>
export ROLE=flintrock

# Note: this is a fairly basic setup.  For more advanced options, type `flintrock launch --help`.
flintrock launch test-cluster \
    --num-slaves 2 \
    --spark-version 2.3.1 \
    --ec2-region ${REGION} \
    --ec2-key-name ${KEY_NAME} \
    --ec2-identity-file ${KEY_PEM} \
    --ec2-instance-profile-name ${ROLE} \
    --ec2-ami ${AMI} \
    --ec2-user ec2-user

You should see output like:

Warning: Downloading Spark from an Apache mirror. Apache mirrors are often slow and unreliable, and typically only serve the most recent releases. We strongly recommend you specify a custom download source. For more background on this issue, please see: https://github.com/nchammas/flintrock/issues/238
Launching 3 instances...
[34.219.7.173] SSH online.
[34.219.7.173] Configuring ephemeral storage...
[54.200.216.113] SSH online.
[54.200.216.113] Configuring ephemeral storage...
[34.219.7.173] Installing Java 1.8...
[54.200.216.113] Installing Java 1.8...
[34.220.126.198] SSH online.
[34.220.126.198] Configuring ephemeral storage...
[34.220.126.198] Installing Java 1.8...
[34.219.7.173] Installing Spark...
[54.200.216.113] Installing Spark...
[34.220.126.198] Installing Spark...
[34.219.7.173] Configuring Spark master...
Spark online.
launch finished in 0:02:25.
Cluster master: ec2-34-219-7-173.us-west-2.compute.amazonaws.com
Login with: flintrock login test-cluster

At this point, you should be able to browse to the Spark UI by appending port 8080 to your "Cluster master" hostname.

Install TensorFlow and TensorFlowOnSpark

Next, we will use flintrock to install packages onto all of the nodes in your cluster:

# Fix https://github.com/tensorflow/tensorflow/issues/16397#issuecomment-360694501
flintrock run-command test-cluster 'sudo cp /etc/pki/tls/certs/ca-bundle.crt /etc/ssl/certs/ca-certificates.crt'

# Install TensorFlow and TensorFlowOnSpark
flintrock run-command test-cluster 'sudo pip install tensorflowonspark --ignore-installed'

Train a model

# Login to Spark master node
flintrock login test-cluster

# Download the mnist_estimator example
curl -O https://raw.githubusercontent.com/yahoo/TensorFlowOnSpark/master/examples/mnist/estimator/mnist_estimator.py

# Run training
export BUCKET=<your_s3_bucket>

export MASTER=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export AWS_REGION=<your_AWS_region>

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.task.maxFailures=1 \
--conf spark.stage.maxConsecutiveAttempts=1 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.AWS_REGION=${AWS_REGION} \
mnist_estimator.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--model s3://${BUCKET}/mnist_model \
--data_dir=s3://${BUCKET}/MNIST-data \
--steps 1000

You should be able to view the TensorFlow logs in the "Executors" tab of the Spark UI. When done training, both the downloaded MNIST data and the trained model should be available in your S3 bucket.

Destroy the Spark Standalone cluster

Shutdown the EC2 instances:

flintrock destroy test-cluster

And then delete your S3 bucket via the S3 Management Console.

Clone this wiki locally