-
Notifications
You must be signed in to change notification settings - Fork 1
/
intro.tex
137 lines (123 loc) · 7.38 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
% What problem are you going to solve.
The explosion of data and the increasing complexity of data analysis
generate a growing demand for parallel, scalable statistical analysis
and machine learning tools that are simple and efficient.
These tools need to be programmable, interactive, and extensible,
allowing scientists to encode and deploy complex algorithms.
Successful examples include R, SciPy, and Matlab. Efficiency dictates that
tools should leverage modern computer architectures, including scalable
parallelism, high-speed networking, and fast I/O from memory and storage.
The current approach for utilizing the full
capacity of modern parallel systems often uses a low-level programming
language such as C and parallelizes computation with MPI or OpenMP.
This approach is time-consuming and error-prone, and requires machine learning
researchers to develop expertise in parallel programming models.
% How have others addressed the problem?
While conventional wisdom addresses large-scale data analysis and machine
learning with clusters
\cite{mapreduce,spark,systemml,tensorflow,petuum,graphlab}, recent works
\cite{flashgraph,gridgraph,Matveev17,hotos} demonstrate a single-machine
solution can process large datasets efficiently in a multicore
machine. The advance of solid-state drives (SSDs) allows us to tackle data
analysis in a single machine efficiently at a larger scale and more economically
than possible before. Previous SSD-based graph analysis frameworks
\cite{flashgraph, gridgraph, graphene}
have demonstrated the comparable efficiency to state-of-the-art in-memory graph
analysis, while scaling to arbitrarily large datasets. This work extends
these findings to matrix operations for machine learning.
% Why is it hard?
% What is the nature of your solution?
To provide a simple programming framework for efficient and scalable machine
learning, we present FlashR, an interactive R programming framework that
executes R code in parallel and out-of-core automatically. For
generality and simplicity, FlashR implements
a set of generalized operations (GenOps) and uses them to override many
R functions in the R \textit{base} package to perform parallel computation on
large matrices stored on SSDs. As such, FlashR parallelizes and scales
existing R code with little/no modification.
Our evaluation shows that we solve billion row, Internet-scale
problems on a single thick machine, which prevents the complexity,
expense, and power consumption of distributed systems when they are
not strictly necessary \cite{hotos}.
% Why is it new/different/special?
To utilize the full capacity of a large parallel machine, we overcome
many technical challenges to move data from SSDs to CPU efficiently for matrix
computations. Notably, there exist large performance disparities between CPU
and memory and between memory and SSDs, at least an order of magnitude between
every two layers. The ``memory gap'' \cite{Wilkes01} continues to grow, with
the difference between CPU and DRAM performance increasing exponentially.
There are also performance differences between
local and remote memory in a non-uniform memory architecture (NUMA), which are prevalent
in modern multiprocessor machines.
%Even though the I/O
%performance of SSDs has advanced to outperform hard drives by a large factor,
%FDespite advances in SSD I/O performance,
%they remain an order of magnitude slower than RAM.
%Most matrix computation engines increase data movement,
%because they perform an operation on an entire input matrix prior to moving
%to the next operation.
% RB --
%On the other hand, many analysis tasks are
%data-intensive. Matrix
%formulation further increases data movement between CPU and SSDs because
%a matrix computation framework typically performs an operation
%on the entire input matrices before moving to the next operation.
% As such,
%the performance of the data analysis tasks is usually limited by memory
%bandwidth instead of computing power.
FlashR implements a new runtime system that executes a sequence of matrix
operations in a memory hierarchy aware fashion and optimizes
data placement and movement in the memory hierarchy without users' awareness.
To achieve this, FlashR evaluates expressions lazily and fuses operations
aggressively in a single parallel execution job. FlashR
builds a directed acyclic graph (DAG) to represent a sequence of matrix
operations. To increase the ratio of computation to I/O, FlashR requires
only one pass over the input matrices to perform all operations in a DAG.
It assigns the same partitions from different matrices to the same NUMA node
to reduce remote memory access, performs two levels of matrix partitioning
and reorders computation on matrix partitions to reduce data movement
in the memory hierarchy. FlashR by default keeps only the output
matrices (leaf nodes) of the DAG in memory to have a small memory footprint.
% What are it's key features?
We implement multiple machine learning algorithms in FlashR. We demonstrate that
with today's fast commodity storage technology, the out-of-core execution of
FlashR achieves performance comparable to their in-memory execution, both
on a large parallel machine and in the cloud. Furthermore, FlashR outperforms
the same algorithms in H$_2$O \cite{h2o} and Spark MLlib \cite{spark} by a factor
of $3-20$ in a large parallel machine with 48 CPU cores. In the Amazon cloud,
FlashR using only one fourth of the resources still matches or even outperforms
H$_2$O and Spark MLlib. We argue that FlashR is a much
more cost-effective solution for large-scale data analysis in the cloud.
FlashR effortlessly scales to datasets with billions
of data points and its out-of-core execution uses a negligible amount of memory
compared with the dataset size. In addition, FlashR executes the R functions
in the R MASS \cite{mass} package with little modification and outperforms
the execution of the same functions in Revolution R Open \cite{rro} by more
than an order of magnitude.
Given its high-level array-oriented programming interface and superior performance,
we argue that FlashR significantly lowers the requirements for writing
parallel and scalable implementations of machine learning algorithms. It also
offers new design possibilities for data analysis clusters, replacing memory
with larger and cheaper SSDs and processing bigger problems on fewer nodes.
FlashR is released as an open-source project at \href{http://flashx.io}{http://flashx.io}.
Our key contributions include:
\begin{itemize}
\item We develop an R programming framework that parallelizes and
scales native R code automatically.
\item We design multiple techniques in our framework to move data from
I/O storage to the CPU cache efficiently and demonstrate that with today's I/O
technology, our SSD-based solution delivers performance approaching that of
in-memory solutions for many machine learning algorithms.
\item We demonstrate that with sufficient system-level optimizations, R code
can easily scale to terabytes of data in a single machine and significantly
outperform optimized parallel machine learning libraries.
\end{itemize}
%\begin{itemize}
%\item identify a set of generalized matrix operations that cover all matrix
%operations in NumPy.
%\item design a set of optimizations for these generalized matrix operations
%to maximize CPU cache hits and reduce I/O access. The optimizations are tailored
%specifically for machine learning.
%\item using SSDs, EM performance matches IM performance and scale to
%Internet-scale datasets.
%\end{itemize}