Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters with at Scale
This repository contains the tensorflow implementation for reinforcement learning based Long Running Application scheduling with Hierarchical Reinforcement Learning (HRL).
- Python 3.5 or above
- Tensorflow 1.12.0
- scipy
- numpy
- pandas
- matplotlib
- sklearn
- Matlab 2019b or above
git clone https://github.com/Metis-RL-based-container-sche/Metis.git
pip install -r requirement.txt
Our project includes three parts:
- Cluster: Implementation of our seven real-world LRAs that exhibit inter-container interferences.
- Experiments: Metis scheduling workflow based on our real-world LRA setting.
-
Launch a cluster of tens of nodes on Amazon EC2 consisting of at least 1 manager node, several worker nodes, and some client nodes. Record the public DNS address or ip address of each node and make sure the manager and client nodes are accessible through the SSH key
~/.ssh/id_rsa
.export MANAGER=ec2-xxx-xxx-xxx-100.us-west-2.compute.amazonaws.com export WORKER1=ec2-xxx-xxx-xxx-101.us-west-2.compute.amazonaws.com # ...... export CLIENT1=ec2-xxx-xxx-xxx-201.us-west-2.compute.amazonaws.com # ...... $ ssh ubuntu@$MANAGER (manager)$ git clone https://github.com/Metis-RL-based-container-sche/Metis.git
-
If Docker has not been installed, install Docker on each manager, worker, and client node:
(manager)$ sudo Cluster/scripts/install_docker.sh (worker1)$ sudo Cluster/scripts/install_docker.sh # ......
-
Coordinating manager and worker nodes through Docker Swarm to form a Swarm cluster.
$ ssh $MANAGER docker swarm init # Output To add a worker to this swarm, run the following command: docker swarm join --token SWMTKN-1-1bu27zw7lzh6pnu1l0981bu1m6nqq2pcgk25kovuh565319cah-8480smxu3kp7cj5nfkck2itax 192.168.99.100:2377
Then execute the aforementioned command on each worker.
$ ssh $WORKER1 docker swarm join --token SWMTKN-1-1bu27zw7lzh6pnu1l0981bu1m6nqq2pcgk25kovuh565319cah-8480smxu3kp7cj5nfkck2itax 192.168.99.100:2377 # Output This node joined a swarm as a worker.
-
Build docker images of different workloads on the manager node.
(manager)$ cd Cluster/workloads (manager)$ ./build-all.sh
-
Launch certain workloads on certain worker node from the manager node.
(manager)$ cd Cluster/scripts # Launching 1 container for each workload on WORKER1: # '0' indicates idle; '1' ~ '7' indicate different workloads. (manager)$ ./service-launching.sh $WORKER1 0 1 2 3 4 5 6 7 # e.g., launch 3 workload-1, 2 workload-2, and 1 workload-3 container: # ./service-launching.sh $WORKER2 0 0 1 1 1 2 2 3
-
Sending requests from client node to the worker node
$ ssh ubuntu@$CLIENT1 (client1)$ git clone https://github.com/Metis-RL-based-container-sche/Metis.git # export WORKER1=ec2-xxx-xxx-xxx-101.us-west-2.compute.amazonaws.com # Send single request for testing (client1)$ cd Cluster/scripts (client1)$ curl $WORKER1:8081 # Or pressure the application with locust or other tools, e.g., (client1)$ cd Cluster/scripts (client1)$ python3 parallel_locust.py $WORKER1 # Default log path lies in Cluster/scripts/log
-
(Optional) Collecting performance benchmark datasets through automatically deploying, profiling, terminating, and re-deploying.
# Here the "0-1-2-3-4-5-6-7" or "0-0-1-1-1-2-2-3" means combination # of different workloads (as described step 5). (manager)$ ./profiling-go.sh $WORKER1 "0-1-2-3-4-5-6-7 0-0-1-1-1-2-2-3 0-0-0-0-0-0-0-7 0-0-0-0-1-1-1-7 2-4-4-6-6-6-6-7"
-
Terminate the swarm cluster on the manager node.
(manager)$ docker swarm leave --force
-
First check the data collected by Real-World LRA cluster is stored in the folder:
$ cd Experiments/ $ ls ./simulator/datasets/
***_sample_collected.npz
-
Train sub-schedulers in a 27-node sub-cluster:
$ cd Experiments/ $ ./shell/TrainSubScheduler.sh
Output: the well-trained sub-scheduler models, as well as corresponding log files will be store in the folder:
$ ls ./checkpoint/
subScheduler_*/
-
High-level model training based on previously well-trained sub-schedulers.
(0) Check the sub-scheduler models are stored in the folder:
$ cd Experiments/ $ ls ./checkpoint/
subScheduler_*/
Check the container batches data is stored in the folder or create your own batches:
$ ls ./data
batch_set_200.npz batch_set_300.npz batch_set_400.npz batch_set_1000.npz batch_set_2000.npz batch_set_3000.npz
(1) High-level training in a medium-sized cluster of 81 nodes:
$ ./shell/RunHighLevelTrainingMedium.sh 200 $ ./shell/RunHighLevelTrainingMedium.sh 300 $ ./shell/RunHighLevelTrainingMedium.sh 400
Output: the training log files including the RPS, placement matrix, training time duration .etc will be store in the folder:
$ ls ./checkpoint/
81nodes_*_*/
(2) High-level training in a large cluster of 729 nodes:
$ ./shell/RunHighLevelTrainingLarge.sh 1000 $ ./shell/RunHighLevelTrainingLarge.sh 2000 $ ./shell/RunHighLevelTrainingLarge.sh 3000
Output: the training log files including the RPS, placement matrix, training time duration .etc will be store in the folder:
$ ls ./checkpoint/
729nodes_*_*/
Vanilla RL is built directly upon Policy Gradient without our Hierarchical designs.
-
High-level training in a medium-sized cluster of 81 nodes:
$ ./shell/RunVanillaRLMedium.sh 200
Output: the training log files including the RPS, placement matrix, training time duration, etc will be store in the folder:
$ ls ./checkpoint/
Vanilla_81_*_*/
DC Method does not use sub-schedulers. Our code below shows its behaviors in a medium-sized cluster of 81 nodes. Each cluster is hierarchically divided into three subsets.
-
High-level training in a medium-sized cluster of 81 nodes:
$ ./shell/RunDCMedium.sh 200
Output: the training log files including the RPS, placement matrix, training time duration, etc will be store in the folder:
$ ls ./checkpoint/
DC_81_*_*/
Medea is implemented using Matlab, due to its outstanding performance in solving the Integer Linear Programming (ILP) Problem.
-
Generate the performance-constraints used in Medea:
$ cd Experiments $ ./shell/GenerateInterference.sh $ ls
interference_applist.csv interference_rpslist.csv
-
Run Medea in the folder:
$ cd testbed/Medea $ Matlab Medea.m
Output: the scheduling decision log files including the allocation matrix, constraint violations, time duration .etc will be store in the folder.
Paragon is re-implemented in Python. For the sake of fair comparison, we feed it with the full interference matrix information as Medea.
-
Make sure
interference_applist.csv
has been generated in former Medea setup:$ ls interference_applist.csv
Otherwise, generate the performance-constraints used in Medea:
$ cd Experiments $ ./shell/GenerateInterference.sh $ ls interference_applist.csv
-
Run Paragon of Medium size or Large size:
$ cd Experiments/shell $ # Medium size $ ./RunParagonMedium.sh 200 $ $ # Large size $ ./RunParagonLarge.sh 2000
Output: the default output shows the average throughput for each testing group as well as the scheduling latency.
For detailed output including container placement and per-container throughput breakdown for each node, please add
-v
after each python script:$ cd Experiments $ python3 ParagonExp.py --batch_set_size 200 --batch_choice 0 --size medium --verbose
[1] Medea: Panagiotis Garefalakis, Konstantinos Karanasos, Peter Pietzuch, Arun Suresh, and Sriram Rao. 2018. Medea: scheduling of long running applications in shared production clusters. In Proceedings of the Thirteenth EuroSys Conference (EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 4, 1–13. DOI:https://doi.org/10.1145/3190508.3190549
[2] Paragon: Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS ’13). Association for Computing Machinery, New York, NY, USA, 77–88. DOI:https://doi.org/10.1145/2451116.2451125
[3] Redis: an open source, in-memory data structure store. https://redis.io
[4] Model Server for Apache MXNet. https://github.com/awslabs/mxnet-model-server
[5] Image Super Resolution. https://github.com/idealo/image-super-resolution
[6] Locust: an open source load testing tool. https://locust.io
[7] Yahoo! Cloud Streaming Benchmark: Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing (SoCC ’10). Association for Computing Machinery, New York, NY, USA, 143–154. DOI:https://doi.org/10.1145/1807128.1807152