In a nutshell, ProtoVM is a tiny Parallel Java Bytecode Interpreter (BC) implemented in OpenCL capable of executing on heterogeneous architectures (multi-core CPUs, GPUs and FPGAs).
ProtonVM is a proof-of-concept for running a subset of the Java bytecodes in C++ and OpenCL. The bytecodes defined correspond to an extension of the small-subset of Java bytecode explained by Terence Parr, from the University of San Francisco of how to build a simple Virtual Machine. This project extends this simple bytecode interpreter to study the feasibility of running, as efficiently as possible, parallel bytecode interpreters on heterogeneous computer architectures.
ProtonVM currently runs integer arithmetic only. A possible extension is to include fp32
and fp64
math arithmetic and support complex math operations. This way ProtonVM could take
advantage of the compute-power of modern GPUs, and DSPs available on FPGAs.
ProtonVM has been presented at MoreVMs 2020 and it has been executed on NVIDIA GPUs, Intel HD Graphics and Xilinx FPGAs.
#define IADD 1
#define ISUB 2
#define IMUL 3
#define ILT 4
#define IEQ 5
#define BR 6
#define BRT 7 // branch if true
#define BRF 8 // branch if false
#define ICONST 9
#define LOAD 10
#define GLOAD 11
#define STORE 12
#define GSTORE 13
#define PRINT 14
#define POP 15
#define HALT 16
#define CALL 17
#define RET 18
#define DUP 19 // duplicate the top of the stack
#define IDIV 20 // integer division
#define LSHIFT 21 // shift to the left (multiply by two)
#define RSHIFT 22 // shift to the left (multiply by two)
#define ICONST1 23 // load constant 1 into the stack
#define GLOAD_INDEXED 24 // top_stack <- global[top-stack]
#define GSTORE_INDEXED 25
#define THREAD_ID 26 // load thread-id on top of the stack
#define PARALLEL_GLOAD_INDEXED 27 // load data by accessing device's heap using the thread-id (multi-heap configuration)
#define PARALLEL_GSTORE_INDEXED 28 // store data by accessing device's heap using the thread-id (multi-heap configuration)
ProtonVM provides different variations of the BC interpreter for testing and experimentation:
VM
: this is the baseline BC interpreter implemented in C++. It runs sequentially on the CPU.OCLVM
: Single-thread OpenCL BC interpreter. It is prepared for running a single device thread on the target device. The stack, data and code sections are stored on device's global memory.OCLVMPrivate
: Single-thread OpenCL BC interpreter. It is prepared for running a single device thread on the target device. The stack is stored in private memory, and data and code sections are stored on device's global memory.OCLVMParallelLoop
: This version of the interpreter is prepared for running with a multi-thread bytecode interpreter exploiting data parallelization. Each thread has its own stack and it performs exactly the same computation across device's threads. The OpenCL kernel is programmed to do the work per thread. Furthermore, this version uses a multi-heap (3 in this case), that allows accessing data in parallel. There are two heaps dedicated to read-only and one for write-only. The stack is stored in private memory, and the heaps are accessed using local memory.
// Run Vector Multiplication with ProtonVM on GPUs and FPGAs
void runParallelOpenCLInterpreter() {
int size = 1024;
int groupSize = 16;
// List of instructions
vector<int> vectorMul = {
THREAD_ID, // Push OpenCL Thread-ID into the stack
DUP, // Duplicate the Thread-ID on top of the stack
PARALLEL_GLOAD_INDEXED, 0, // Access global memory (e.g., GPU's global memory) using the thread-id and heap index 0
// and push the result (value) on top of the stack
THREAD_ID, // Push OpenCL Thread-ID on top of the stack
PARALLEL_GLOAD_INDEXED, 1, // Access global memory (e.g., GPU's global memory) using the thread-id and heap index 1
// and push the result (value) on top of the stack
IMUL, // Multiply the two elements on top of the stack and push the result
PARALLEL_GSTORE_INDEXED, 2, // Store the value on top of the stack into global memory (heap 2).
HALT // Stop the program (stack must be empty)
};
OCLVMParallelLoop oclVM(vectorMul, 0); // Create a parallel BC Interpreter and VM
oclVM.setVMConfig(100, 1024); // Set sizes for the stack and data sections
oclVM.setHeapSizes(1024); // Set a heap of 1024 int values
oclVM.setPlatform(0); // Select the OpenCL platform 0 for execution
oclVM.initOpenCL("src/interpreterParallelLoop.cl", false); // Compile the OpenCL BC Interpreter. The second parameter is for FPGA execution.
oclVM.initHeap(); // Init values on the heap
oclVM.runInterpreter(1024, groupSize); // Execute the BC interpreter using 1024 threads and groupsize of 16 threads.
}
cmake
GCC >= 9.0
OpenCL >= 1.2
mkdir build
cd build
cmake ..
make -j 12
$ bin/main
This work has been presented at MoreVMs 2020.
Link to the paper: ACM DL.
Video presentation: YouTube
Recommended citation:
@inproceedings{10.1145/3397537.3397563,
author = {Fumero, Juan and Stratikopoulos, Athanasios and Kotselidis, Christos},
title = {Running Parallel Bytecode Interpreters on Heterogeneous Hardware},
year = {2020},
isbn = {9781450375078},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3397537.3397563},
doi = {10.1145/3397537.3397563},
booktitle = {Conference Companion of the 4th International Conference on Art, Science, and Engineering of Programming},
pages = {31–35},
numpages = {5},
keywords = {Heterogeneous Architectures, Interpreters, Bytecode, FPGAs, GPUs},
location = {Porto, Portugal},
series = { '20}
}
This project is developed at The University of Manchester, and it is fully open source under the APACHE 2 license.
The work was partially funded by the EU Horizon 2020 Elegant 957286 and E2Data 780245 projects.