-
Notifications
You must be signed in to change notification settings - Fork 21
Home
COntent ANalysis SYStem is a framework for mining scientific publications using Apache Hadoop. It is primarily developed by employees of the Centre for Open Science (CeON) at Interdisciplinary Centre for Mathematical and Computational Modelling (ICM), University of Warsaw (UW).
In order to run the application, you need to have Cloudera's Hadoop installed. The steps of the installation procedure are given below. IMPORTANT: Because of a bug in the Oozie version provided with Cloudera's Hadoop (by the way: this bug is removed in the version of Oozie available in the source code repository), you need to have Oracle Java JDK 1.6 installed. Oozie does not work with JDK 1.7.
The instructions below show how to install Cloudera Hadoop CDH4 with MRv1 in accordance with the instructions given in Cloudera CDH4 intallation guide.
It is important to know that Hadoop can be run in one of three modes:
- standalone mode - runs all of the Hadoop processes in a single JVM which makes it easy to debug the application.
- pseudo-distributed mode - runs a full-fledged Hadoop on your local computer.
- distributed mode - runs the application on a cluster consisting of many nodes/hosts.
Below we will show how to install Hadoop initially in the pseudo-distributed mode but with a possibility to switch between the standalone and the pseudo-distributed mode.
Installing Hadoop in pseudo-distributed mode (based on Cloudera CDH4 pseudo distributed mode installation guide) in case of 64-bit Ubuntu 12.04:
-
create a new file
/etc/apt/sources.list.d/cloudera.list
with contents:deb http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
-
add a repository key:
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
-
update
sudo apt-get update
-
install packages
sudo apt-get install hadoop-0.20-conf-pseudo
-
next, follow the steps described in the Cloudera's guide to installing Hadoop in the pseudo-distributed mode starting from the step "Step 1: Format the NameNode." This is available at Cloudera CDH4 pseudo distributed mode installation guide - "Step 1: Format the Namenode".
When you have Hadoop installed, you can switch between standalone and pseudo-distributed configurations (or other kinds of configurations) of Hadoop using the update-alternatives
command, e.g.:
-
update-alternatives --display hadoop-conf
for list of available configurations and information which one is currently active -
sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.empty
to set the active configuration to/etc/hadoop/conf.empty
which corresponds to Hadoop standalone mode.
You can view the web interfaces to the following services using appropriate addresses:
-
NameNode - provides a web console for viewing HDFS, number of Data Nodes, and logs - http://localhost:50070/
- In the pseudo-distributed configuration, you should see one live DataNode named "localhost".
- JobTracker - allows viewing the completed, currently running, and failed jobs along with their logs - http://localhost:50030/
##Oozie
Apache Oozie Workflow Scheduler for Hadoop is a workflow and coordination service for managing Apache Hadoop jobs. The description below is based on Cloudera CDH4 Oozie installation guide.
-
Install Oozie with
sudo apt-get install oozie oozie-client
-
Create Oozie database schema
sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
-
this should result an output similar to this one:
Validate DB Connection DONE Check DB schema does not exist DONE Check OOZIE_SYS table does not exist DONE Create SQL schema DONE Create OOZIE_SYS table DONE Oozie DB has been created for Oozie version '3.1.3-cdh4.0.1' The SQL commands have been written to: /tmp/ooziedb-8221670220279408806.sql
-
-
Install version 2.2 of ExtJS library:
- download the zipped library from http://extjs.com/deploy/ext-2.2.zip
- copy the zip file to
/var/lib/oozie
end extract it there
-
Install Oozie ShareLib:
mkdir /tmp/ooziesharelib cd /tmp/ooziesharelib tar -zxf /usr/lib/oozie/oozie-sharelib.tar.gz sudo -u hdfs hadoop fs -mkdir /user/oozie sudo -u hdfs hadoop fs -chown oozie /user/oozie sudo -u oozie hadoop fs -put share /user/oozie/share
-
Start the Oozie server:
sudo service oozie start
-
Check the status of the server:
-
From command-line:
oozie admin -oozie http://localhost:11000/oozie -status
as a result, should be printed out:
System mode: NORMAL
- Through a webpage - use a web browser to open a webpage at the following address: http://localhost:11000/oozie/
-
If you want to check if Oozie correctly executes its workflows, you can run some of the example workflows provided with Oozie as described in Cloudera Oozie example workflows. Note that contrary to what is written there, the Oozie server is not available at http://localhost:8080/oozie
but at http://localhost:11000/oozie
address.
During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. Citation matching module in CoAnSys scales up to handle great amounts of data using appropriate indexing and a MapReduce paradigm.
-
Fedoryszak, M. Tkaczyk, D. and Bolikowski, Ł. Large Scale Citation Matching Using Apache Hadoop, Research and Advanced Technology for Digital Libraries, Springer Berlin Heidelberg, 2013, 8092, 362-365
-
Dendek, P. J. Czeczko, A. Fedoryszak, M. Kawa, A. Wendykier, P. and Bolikowski Ł. Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop, arXiv, 2013
-
Dendek, P. J. Czeczko, A. Fedoryszak, M. Kawa, A. Wendykier, P. and Bolikowski Ł. How to perform research in Hadoop environment not losing mental equilibrium - case study, arXiv, 2013