-
Notifications
You must be signed in to change notification settings - Fork 53
FlashX Quick Start Guide
Although FlashX is designed to run on top of a high-speed SSD array, it can also run on magnetic disks and completely in memory. Intuitively, FlashX runs slower on magnetic disks than on SSDs. The guide below walks you through the steps to run FlashX on a magnetic disk and completely in memory for small graphs. In the following steps, we denote the top FlashX directory as FG_TOP
and we assume we are in the top directory throughout this guide. The steps have been tested in Ubuntu 12.04 and Ubuntu 14.04.
FlashX requires the following libraries: libpthread, libaio, libnuma, librt, boost, libhwloc, BLAS, zlib
.
Users need to install these libraries before compiling the code of FlashX.
In Ubuntu, we install all tools and libraries for compiling FlashX as follows:
sudo apt-get install -y git cmake g++
sudo apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev libnuma-dev libaio-dev libhwloc-dev libatlas-base-dev zlib1g-dev
Users can download FlashX's source code from here or clone it from the github repository.
In the top directory of FlashX, run mkdir build; cd build; cmake ../; make
to create a new directory, enter the new directory and compile all code of FlashX. If you have multiple processing cores you may prefer to run make -j #procs
By default, FlashEigen is disabled because it requires users to install Trilinos in advance. FlashEigen was designed to compute eigenvalues of a graph with hundreds of millions of vertices or even billions of vertices. For smaller graphs, we can use FlashR with ARPACK to compute eigenvalues of graphs of multiple million vertices. As such, users who don't need to compute eigenvalues of very large graphs can just to Step 3.
To compile FlashX with FlashEigen, users need to install the Anasazi eigensolver in Trilinos first. The Trilinos packages require a Fortran compiler and a BLAS and LAPACK package.
sudo apt-get install gfortran libatlas-dev liblapack-dev
Follow the instructions in Trilinos' website to compile it. Here summaries the steps:
Inside the top directory of the Trilinos source code, create the following script named do-configure
:
#!/bin/sh
EXTRA_ARGS=$@
SOURCE_BASE=..
cmake \
-D CMAKE_BUILD_TYPE:STRING=RELEASE \
-D BUILD_SHARED_LIBS:BOOL=ON \
-D Trilinos_ENABLE_TESTS=OFF \
$EXTRA_ARGS \
${SOURCE_BASE}
Make the script executable.
chmod u+x do-configure
Create a directory named build
in the top directory and compile the Trilinos source code in the build directory.
mkdir build
cd build/
../do-configure -DTrilinos_ENABLE_Anasazi=ON
make -j32
make install
In the top directory of FlashX, run mkdir build; cd build; cmake -D ENABLE_TRILINOS:BOOL=ON ..; make
If users don't have R installed, they can follow the instruction to install R in Ubuntu.
sudo apt-get install r-base-dev
FlashR requires igraph and Rcpp packages. Before installing FlashR, a user needs to install igraph, which installs the Rcpp package automatically. Users should install the latest R in order to install igraph. In the R environment, run
> install.packages("igraph")
In the top directory of FlashX, run ./install_FlashR.sh
to install FlashR. The script tests if FlashEigen is compiled in Step 2. If FlashEigen is compiled, the script will compile FlashR with FlashEigen automatically.
We prepare a docker file for users to install FlashX in docker easily, by simply following the two steps below. The following docker file builds all components in FlashX but doesn't build FlashX with Trilinos.
git clone https://github.com/zheng-da/FlashxStuff.git
docker build -t flashx FlashxStuff/
For default, the docker file compiles FlashX source code with 4 processes. If users' platforms have many more CPU cores, they can utilize more CPU cores to accelerate the complication of FlashX. For example, if a user's platform has 32 CPU cores, he can modify line 40 in FlashxStuff/Dockerfile
to RUN make -j32
before building the docker image.
To run the docker image, users can simply run docker run -t -i flashx /bin/bash
.
FlashGraph can run both in memory and on SSDs.
To run FlashGraph, users need to construct a graph in the FlashGraph format. We demonstrate the process with graphs from Stanford's SNAP project. We currently require the input graph to be in edge list text format.
Go to the top directory of FlashX and download a graph from the SNAP website.
wget http://snap.stanford.edu/data/wiki-Vote.txt.gz
FlashX provides a tool el2fg
to convert an edge list file in text format to the graph format supported by FlashGraph. More information on el2fg
is shown here.
gunzip wiki-Vote.txt.gz
build/flash-graph/utils/el2fg flash-graph/conf/run_test.txt wiki-Vote.txt wiki-Vote
This tool takes a configuration file, a graph file in text edge list and a user-specified graph name. The command above creates two files wiki-Vote.adj
and wiki-Vote.index
.
For the purpose of demonstration, FlashGraph provides flash-graph/conf/run_test.txt
. The details on the configuration file can be found here.
The FlashGraph format allows one attribute for an edge. To construct a graph with edge attributes, users need to provide option -t
to indicate the attribute type ("I" for 32-bytes integers, "L" for 64-bits integers, "F" for single-precision floating point, "D" for double-precision floating point). The command below shows an example of converting a graph with integer edge attributes.
build/flash-graph/utils/el2fg flash-graph/conf/run_test.txt graph_edgelist.txt graph_name -t I
Although FlashGraph is designed to work with SAFS, it can also run in the standalone mode. In this mode, FlashGraph loads graphs in memory and perform computation on the in-memory graphs. This is the simplest setup for FlashGraph. This mode detects the number of CPU cores and runs in parallel automatically.
FlashGraph provides a few built-in graph algorithm implementations. Users can run these implementations in flash-graph/test-algs/test_algs. test_algs takes three arguments: a FlashGraph config file, a graph file of the FlashGraph format and an index file for the graph file from the local Linux filesystem. For example, the command below runs WCC (weakly connected components) on the wiki graph.
build/flash-graph/test-algs/test_algs flash-graph/conf/run_test.txt wiki-Vote.adj wiki-Vote.index wcc
...
There are 1183 empty vertices
There are 24 components (exclude empty vertices), and largest comp has 7066 vertices
FlashGraph is designed to run with SAFS, so that it can run on top of an SSD array. When running FlashGraph with SAFS, we get the full power of FlashGraph. It requires us to configure SAFS properly.
It is fairly simple to configure SAFS for a small SSD array in an SMP machine. Users only need to mount the SSDs, create data directories on SSDs and inform SAFS of the paths to the data directories. It becomes more complex to configure SAFS for a large SSD array in a NUMA machine, where we need to take processor affinity into account to maximize the performance of the SSD array. SAFS also provides a script to automate the process. To achieve the maximal performance from an SSD array, we refer users to SAFS configurations for more details.
To run SAFS, users need to provide a data config file that indicates the paths of the data directories of SAFS. Users need to set up the SAFS parameter root_conf
to specify the data config file. For example, if a user has two SSDs, mounted in /mnt/ssd1
and /mnt/ssd2
, respectively. The data config file contains two lines:
0:/mnt/ssd1
0:/mnt/ssd2
The number before ":" indicates the NUMA node Id where the SSD is attached to. For an SMP machine, the node Id is always 0.
To run FlashGraph with SAFS, we need to specify root_conf
in the FlashGraph config file.
For the purpose of demonstration, FlashGraph provides run_test.txt
(FlashGraph config file) and data_files.txt
(data config file) in flash-graph/conf/
. data_files.txt
specifies the directory where SAFS runs. Replace FG_TOP
in the two config files with the location of the top directory of FlashGraph.
Users have to make sure the directories in the data config file have been created. For example, if users use the example data config file flash-graph/conf/data_files.txt
, they have to create flash-graph/data
first.
mkdir flash-graph/data
NOTE: users should never manually create files or directories in the data directories where SAFS runs. Instead, users should always use SAFS-util
to operate on SAFS.
We need to load a graph to SAFS before we run FlashGraph.
If a user has a graph in the FlashGraph format, we load a file from a Linux filesystem to SAFS with SAFS-util
. The following commands load the wiki-Vote graph to SAFS. When loading graph files to SAFS, the recommended naming convention is that for a graph named graph_name
, the graph file should be named with graph_name.adj
and the index file should be named with graph_name.index
. This naming convention is important when users run FlashGraph in R.
build/utils/SAFS-util flash-graph/conf/run_test.txt load wiki-Vote.adj wiki-Vote.adj
build/utils/SAFS-util flash-graph/conf/run_test.txt load wiki-Vote.index wiki-Vote.index
If a user has a graph in the text format, we can construct a graph and store it on SAFS directly with el2al
.
build/flash-graph/utils/el2fg -e flash-graph/conf/run_test.txt wiki-Vote.txt wiki-Vote
The command below demonstrates how to run PageRank on wiki-Vote.
build/flash-graph/test-algs/test_algs flash-graph/conf/run_test.txt wiki-Vote.adj wiki-Vote.index pagerank2
...
The sum of pagerank of all vertices: 3146.831787
v5254: 6.384571
v7553: 6.444084
v4191: 6.734981
v2237: 7.413904
v2470: 7.496633
v2398: 7.745999
v2625: 9.750073
v6634: 10.646951
v15: 10.923159
v4037: 13.684291