-
Notifications
You must be signed in to change notification settings - Fork 1
/
relwork.tex
81 lines (74 loc) · 4.75 KB
/
relwork.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
%Basic Linear Algebra Subprograms (BLAS) defines a small set of vector and
%matrix operations. There exist a few highly-optimized BLAS implementations,
%such as MKL \cite{mkl} and ATLAS \cite{atlas}.
%Distributed libraries \cite{trilinos, petsc, elemental}
%build on BLAS and distribute computation with MPI.
%BLAS provides a limited set of matrix operations and requires
%users to manually parallelize the remaining matrix operations.
Recent works on out-of-core linear algebra \cite{Toledo99, Quintana-Orti12}
redesign algorithms to achieve efficient I/O access and reduce I/O
complexity. These works are orthogonal to our work and can be adopted.
Optimizing I/O alone is insufficient. To achieve state-of-the-art in-memory
performance, it is essential to move data efficiently throughout
the memory hierarchy.
Many distributed frameworks have beeen developed for large-scale data analysis
and machine learning. MapReduce \cite{mapreduce} is used for parallelizing
machine learning algorithms \cite{Chu06}.
%However, MapReduce still requires
%low-level programming. As such, Pig Latin \cite{pig} and FlumeJava \cite{flumejava}
%are built on MapReduce to reduce programming complexity.
However, MapReduce is inefficient
for matrix operations because its I/O streaming primitives do not match matrix
data access patterns. Spark \cite{spark} provides more primitives for efficient
computation and are used for distributed machine learning (MLlib \cite{mllib}).
SystemML \cite{systemml, systemml2} develops an R-like scripting language for
machine learning on MapReduce and Spark, and deploys optimizations,
such as data compression \cite{Elgohary16} and hybrid parallelization
\cite{Boehm14}. These optimizations are orthogonal with the ones in FlashR
and can be adopted.
Distributed machine learning frameworks have been developed to train machine
learning models on large datasets. For example, GraphLab \cite{graphlab}
formulates machine learning algorithms as graph computation; Petuum \cite{petuum}
is designed for machine learning algorithms with certain properties such as
error tolerance; TensorFlow \cite{tensorflow} trains deep neural networks
with stochastic gradient descent and its variants.
There is a large literature for deploying lazy evaluation and operation fusion
in a programming framework to improve performance. There are a few attempts
in the APL literature for deferred operations. For example, Guibas et al.
\cite{Guibas78} defers operations for streaming data among operations and
reordering operations; Ching et al. \cite{Ching12} compiles APL code
to fuse operations for better parallelization. Riposte \cite{riposte} uses
\textit{tracing} to collect operations for vectorization and vector fusion
with JIT to speed up operations on vectors in R.
Delite \cite{delite} is a system designed to parallelize domain-specific languages
(DSL), such as OptiML \cite{optiml} for machine learning, in a heterogeneous
computation environment in a single machine. This system
defers operation execution to allow both data and task parallelism.
DESOLA \cite{desola} is a linear algebra library that defers matrix operations
and deploys runtime code generation to fuse operations and arry constraction.
All of the works above rely on compilation to achieve optimizations such as
operation fusion or operation reordering.
It is difficult to compile a dynamic programming language such as APL and R.
The compilation is inefficent or requires some constraints in the language,
while runtime compilation has large overhead.
FlashR adopts and enhances these techniques with a focus on large-scale
data analysis. Unlike most of these works, FlashR applies lazy evaluation
and operation fusion at runtime without compilation and focuses on reducing
data movement in the memory hierarchy.
%Efforts to parallelize array programming include Revolution R \cite{rro} and
%Matlab's parallel computing toolbox, which offer multicore parallelism and
%explicit distributed parallelism using MPI and MapReduce. Other works focus
%on implicit parallelization. Presto \cite{presto} extends R to sparse matrix
%operations in distributed memory for graph analysis. Accelerator
%\cite{accelerator} compiles
%data-parallel operations on the fly to execute programs on a GPU.
Sequoia \cite{sequoia} is a programming language designed to facilitate
memory hierarchy aware parallel programming on large arrays.
It exposes memory hierarchy to the programming model and performs static
analysis at compile time. In contrast, FlashR enhances an existing popular
programming language and hides memory
hierarchy from R users and optimize data movement at runtime.
TileDB's \cite{Papadopoulos16} designs an efficient strategy to support
array modification.
It manages data moficiation as ``fragments''. This strategy can be adopted by
FlashR for large modifications on matrices.