Evaluation of Batch Workload Characterization Techniques for Performance Modeling of Distributed Data Processing Systems

Note: This repository is a fork of the original repository for reasons of archiving.

This repository contains all materials and code related to my master's thesis, which focuses on workload similarity evaluation from big data processing platforms such as Apache Hadoop or Apache Spark. It uses Docker, Ansible and Terraform to ensure a consistent and reproducible development environment.

Key Components

Ansible Scripts: Script for the HiBench deployment, configuration and workload submits. These scripts can be found in the hibench folder.
Deployment: Contains Terraform scripts for setting up the HDInsight Azure computing cluster. These resources are located in the deployment folder.
HiBench Config: A benchmarking suite for big data applications, including configuration files and scripts to run various benchmarks. These configurations are available in the hibench/conf folder.
Systematic Literature Review Results & Layer Analysis: Layerdata, Jupyter notebooks and scripts for analyzing the layers used in performance models. These analyses are contained in the layer-analysis folder.
Similarity Metrics: Scripts and notebooks for calculating similarity metrics between workloads. These resources are found in the hibench/results folder.
Experiments Results: Scripts and data for analyzing benchmark results. The relevant files are located in the hibench/results folder.

Prerequisites

Ensure you have Docker Desktop 4.13.0 or later installed on your system.

Getting Started

Setting Up the Development Environment

Create and Open Docker Development Environment

Click on this link, to create the dev environment and open the container shell with you ide

Inside the IDE Terminal:

Authenticate with Azure using the command below:

    az login

Specify the Azure subscription to use by replacing [SUBSCRIPTION ID] with your actual subscription ID:

    az account set --subscription [SUBSCRIPTION ID]

Verify the currently selected Azure subscription:

    az account list --query "[?isDefault]"

Initialize Terraform with:

    make init

Deploying the Cluster

To deploy the cluster, execute the following command:

    make deploy

To destroy the cluster and clean up resources, use:

    make destroy

Makefile Commands

init: Initialize the Terraform configuration.
deploy: Deploy the infrastructure using Terraform.
destroy: Destroy the Terraform-managed infrastructure.
output: Retrieve the details of the HDInsight cluster from Terraform output.
destroy-force: Force delete the Azure resource group and clean up Terraform state files.
4: Resize the HDInsight cluster to 4 worker nodes.
8: Resize the HDInsight cluster to 8 worker nodes.
12: Resize the HDInsight cluster to 12 worker nodes.
submit: Run a MapReduce job in the cluster.
setup: Setup HiBench environment using Ansible.
ping: Ping all nodes in the HiBench inventory.
ssh: Establish an SSH connection to the endpoint.
ssh-wn0: Establish an SSH connection to the worker node 0.
upload: Upload data to the remote server via SCP.

Acknowledgments

Special thanks to my supervisor and all those who supported me throughout my Master's journey.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
deployment		deployment
hibench		hibench
layer-analysis		layer-analysis
Makefile		Makefile
README.md		README.md
ansible.cfg		ansible.cfg
compose-dev.yaml		compose-dev.yaml
dev.Dockerfile		dev.Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Batch Workload Characterization Techniques for Performance Modeling of Distributed Data Processing Systems

Key Components

Prerequisites

Getting Started

Setting Up the Development Environment

Inside the IDE Terminal:

Deploying the Cluster

Makefile Commands

Acknowledgments

About

Releases

Packages

Languages

dos-group/batch-workload-characteristics-analysis

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Batch Workload Characterization Techniques for Performance Modeling of Distributed Data Processing Systems

Key Components

Prerequisites

Getting Started

Setting Up the Development Environment

Inside the IDE Terminal:

Deploying the Cluster

Makefile Commands

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages