Apache DataFu

Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.

It consists of two libraries:

Apache DataFu Pig: a collection of user-defined functions for Apache Pig
Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce

For more information please visit the website:

http://datafu.incubator.apache.org/

If you'd like to jump in and get started, check out the corresponding guides for each library:

Blog Posts

Presentations

Papers

Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)

Getting Help

Bugs and feature requests can be filed here. For other help please see the website.

Developers

Source release

If you are starting from a source release, then you'll want to verify the release is valid and bootstrap the build environment.

To verify that the archive has the correct MD5 checksum, the following two commands can be run. These should produce the same output.

openssl md5 < apache-datafu-sources-x.y.z-incubating.tgz cat apache-datafu-sources-x.y.z-incubating.tgz.MD5

To verify the archive against its signature, you can run:

gpg2 --verify apache-datafu-sources-x.y.z-incubating.tgz.asc

The command above will assume you are verifying apache-datafu-sources-x.y.z-incubating.tgz and produce "Good signature" if the archive is valid.

To build DataFu from a source release, it is first necessary to download a gradle wrapper script. This bootstrapping process requires Gradle to be installed on the source machine. Gradle is available through most package managers or directly from its website. Once you have installed Gradle and have ensured that the gradle is available in your path, you can bootstrap the wrapper with:

gradle -b bootstrap.gradle

After the bootstrap script has completed, you should find a gradlew script in the root of the project. The regular gradlew instructions below should then be available.

When building from a source release, the version for all generated artifacts will be of the form x.y.z. If you were to clone the git repo and build you would find -SNAPSHOT appended to the version. This helps to distinguish official releases from those generated from the code repository for testing purposes.

Building the Code

To build DataFu from a git checkout or binary release, run:

./gradlew clean assemble

The datafu-pig JAR can be found under datafu-pig/build/libs. The artifact name will be of the form datafu-pig-incubating-x.y.z.jar if this is a source release and datafu-pig-incubating-x.y.z-SNAPSHOT.jar if this is being built from the code repository.

The datafu-hourglass can be found in the datafu-hourglass/build/libs directory.

Generating Eclipse Files

This command generates the eclipse project and classpath files:

./gradlew eclipse

To clean up the eclipse files:

./gradlew cleanEclipse

Running the Tests

To run all the tests:

./gradlew test

To run only the DataFu Pig tests:

./gradlew :datafu-pig:test

To run only the DataFu Hourglass tests:

./gradlew :datafu-hourglass:test

To run tests for a single class, use the test.single property. For example, to run only the QuantileTests:

./gradlew :datafu-pig:test -Dtest.single=QuantileTests

The tests can also be run from within eclipse. You'll need to install the TestNG plugin for Eclipse. See: http://testng.org/doc/download.html.

Potential issues and workaround:

You may run out of heap when executing tests in Eclipse. To fix this adjust your heap settings for the TestNG plugin. Go to Eclipse->Preferences. Select TestNG->Run/Debug. Add "-Xmx1G" to the JVM args.
You may get a "broken pipe" error when running tests. If so right click on the project, open the TestNG settings, and uncheck "Use project TestNG jar".

Name		Name	Last commit message	Last commit date
Latest commit History 455 Commits
build-plugin		build-plugin
datafu-hourglass		datafu-hourglass
datafu-pig		datafu-pig
examples		examples
gradle		gradle
site		site
.gitignore		.gitignore
CONTRIBUTORS		CONTRIBUTORS
HEADER		HEADER
KEYS		KEYS
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASE.md		RELEASE.md
bootstrap.gradle		bootstrap.gradle
build.gradle		build.gradle
changes.md		changes.md
doap_DataFu.rdf		doap_DataFu.rdf
gradle.properties		gradle.properties
gradlew		gradlew
settings.gradle		settings.gradle
test.sh		test.sh
test_in_background.sh		test_in_background.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache DataFu

Blog Posts

Presentations

Papers

Getting Help

Developers

Source release

Building the Code

Generating Eclipse Files

Running the Tests

About

Releases

Packages

Languages

License

AjanthanAsogamoorthy/incubator-datafu

Folders and files

Latest commit

History

Repository files navigation

Apache DataFu

Blog Posts

Presentations

Papers

Getting Help

Developers

Source release

Building the Code

Generating Eclipse Files

Running the Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages