Skip to content
pdendek edited this page Jul 28, 2014 · 37 revisions

Table of Contents

Introduction

Oozie Maven Plugin (OMP) was created to simplify the creation of packages (tar.gz files) containing workflow definition of Apache Oozie as well as all other files needed to run a workflow (configuration files, libraries, etc.). In addition, this plugin enables workflow's reusability -- generated packages can be uploaded to a Maven's repository, added as a dependency to other workflows and reused as subworkflows. The plugin defines a new type of Maven's artifact called oozie, but it uses standard build lifecycles.

How to add the plugin to your project

The sources of OMP are available at https://github.com/CeON/oozie-maven-plugin.

The binaries are available in the Maven's repository of ICM. To use it, you need to add the following sections in your pom.xml:

    <build>
        <plugins>
            <plugin>
                <groupId>pl.edu.icm.maven</groupid>
                <artifactId>oozie-maven-plugin</artifactid>
                <version>current_version_number</version>
                <extensions>true</extensions>
            </plugin>
        </plugins>
    </build>

and

    <pluginRepositories>
        <pluginRepository>
            <id>icm</id>
            <name>ICM project repository</name>
            <url>http://maven.icm.edu.pl/artifactory/repo</url>
        </pluginrepository>
    </pluginrepositories>

Minimal project

Minimal project that uses OMP needs to contain the following files:

  • pom.xml
    <groupId>my.project.groupId</groupid>
    <artifactId>my-project-artifactId</artifactid>
    <version>VERSION_NUMBER</version>
    <packaging>oozie</packaging>
    
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceencoding>
    </properties>
    
    <build>
        <plugins>
            <plugin>
                <groupId>pl.edu.icm.maven</groupid>
                <artifactId>oozie-maven-plugin</artifactid>
                <version>1.1</version>
                <extensions>true</extensions>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <groupId>pl.edu.icm.oozie</groupid>
            <artifactId>oozie-runner</artifactid>
            <version>1.2-SNAPSHOT</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>icm</id>
            <name>ICM project repository</name>
            <url>http://maven.icm.edu.pl/artifactory/repo</url>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>icm</id>
            <name>ICM project repository</name>
            <url>http://maven.icm.edu.pl/artifactory/repo</url>
        </pluginrepository>
    </pluginrepositories>
 </project>
  • src/main/oozie/workflow.xml
You can generate these files with Oozie Maven Archetype:
 mvn archetype:generate -DarchetypeArtifactId=oozie-maven-archetype \
   -DarchetypeGroupId=pl.edu.icm.maven.archetypes -DarchetypeVersion=1.0-SNAPSHOT \
   -DinteractiveMode=false -DgroupId=my.project.groupId -DartifactId=my-project-artifactId \
   -Dversion=VERSION_NUMBER -DarchetypeRepository=http://maven.icm.edu.pl/artifactory/repo

Build

You can build the project by calling:

 mvn install

This call creates the package that can be uploaded to a Maven's repository and used in other projects. That package does not contain any dependencies (subworkflows, libraries), that should also be stored in the Maven's repository. The file created in this procedure is named artifactId-version-oozie-wf.tar.gz.

If you want to build a package intended for run on an Oozie server, you need to call

 mvn install -DjobPackage

The file created (artifactId-version-oozie-job.tar.gz) contains everything that is necessary to run a given workflow.

Support for Apache Pig

Oozie Maven Plugin supports scripts written in PigLatin.

Modification of a standard JAR file

OMP allows to use Pig's scripts from dependent modules. Such a module (containing Java classes such as UDFs used by Pig's scripts) should be added to your project as a direct dependency. A proper resource management in pom.xml file is necessary to ensure that a given dependent module contains Pig's scripts. For example, the following inset in pom.xml should guarantee that requirement:

        <build>
                <resources>
                        <resource>
                                <directory>src/main/pig</directory>
                                <filtering>false</filtering>
                                <includes>
                                        <include>**/*.pig</include>
                                </includes>
                                <excludes>
                                        <exclude>**/AUXIL*.pig</excludes>
                                </excludes>
                                <targetPath>${project.build.directory}/classes/pig</targetpath>
                        </resource>
                </resources>
        </build>

Once the above inset is added to pom.xml, the instruction mvn install will add to a generated JAR a directory pig with files *.pig copied from src/main/pig.

Example 1: file src/main/pig/lorem/ipsum/dolor/sit.pig will appear in the JAR file as pig/lorem/ipsum/dolor/sit.pig.
Example 2: file src/main/pig/lorem/ipsum/dolor/AUXIL_sit.pig will not be added to the JAR file, because it was excluded.

Working with Pig's scripts

The generation of workflows that utilize Pig's scripts is described here. In short: Pig's scripts are indicated in tags <script></script>, while imported scripts (i.e. macros) should be included in tags <file></file>. This in NOT the way OMP works.
OMP allows to work with Pig's scripts in two ways:

  1. EASY
  2. COMPLEX

EASY way

  • In this strategy, the POM file needs to contain the description of all used Pig's scripts in the folder pig (see section "Modification of a standard JAR file").
  • This strategy is very easy and it is a highly recommended way of working with OMP.
  • OMP automatically manages macro files:
    • in the JAR file, macro files have to be in a different path than /pig, e.g. /pig/macros
    • it is recommended to put regular scripts in src/main/pig, while macros should go into src/main/pig/macros
    • you CANNOT use tags <file></file> to store macro files (it can cause errors).

COMPLEX way

  • If you have already implemented a module with large number of scripts, then adjusting the code to the EASY way of working with Pig's scripts may be tedious. In that case, one can use COMPLEX strategy that has two disadvantages:
  • file pom.xml may be longer
  • additional descriptor files are necessary.
To use COMPLEX strategy you need to:
  • (this step can be omitted) place executable scripts in pig-scripts and macros in pig-macros, e.g. pom.xml
  • place information about the localization of a descriptor in workflow's pom.xml - only the first descriptor will be used e.g. pom.xml
  • add the descriptor, e.g. the following XSD schema
In a descriptor file, in <build><resources> tag, one needs to specify whether a given script is a main script or a macro. Rules for inclusion/exclusion of main scripts are described in tags <main-project-pig><scripts>, <main-project-pig><macros> and <deps-project-pig><all-scripts>, each of which has the same fields to fill:
  • srcProject - (used only for dependencies) a REGEX value describing dependency JAR file from which scripts schould be taken. By default all projects are accepted.
  • root - a folder name in a JAR from which scripts should be taken
  • preserve - information if a script should be copied to the final destination, e.g. /lib/, with or without the root folder. By default this option take the false value;
    • IF preserve=true, root=macros-pig, script_localization=macros-pig/my.pig, target=/lib THEN final_path=/lib/macros-pig/my.pig
    • IF preserve=false, root=macros-pig, script_localization=macros-pig/my.pig, target=/lib THEN final_path=/lib/my.pig
    • IF root=macros-pig, script_localization=macros-pig/my.pig, target=/lib THEN final_path=/lib/my.pig
  • target - the path under a current build directory, where included scripts will be send. A current build directory can be set to the top directory by using mainDirAsDst parameter. By default target is the empty string.
  • mainDirAsDst - (used only for dependencies) the explicit request to calculate target value against the top project, used in case of putting macro scripts from a subworkflow in the main workflow dir (e.g. to put subworkflow scripts in the main workflow lib directory). By default mainDirAsDst is false.
  • includes/include - a REGEX pattern indicating scripts to be included
  • excludes/exclude - a REGEX pattern indicating scripts to be excluded. Exclusion patterns are more important then inclusion, i.e. if a file follows both inclusion and exclusion pattern then file will be excluded.
It should be noticed that one can specify
  • multiple inclusion/exclusion patterns in scripts nodes
  • multiple scripts nodes in both main-project-pig and deps-project-pig nodes can be used

Integration tests

Oozie Maven Plugin defines the following integration test phases: pre-integration-test, integration-test and post-integration-test. Each integration test sends the following data to an Oozie server: workflow's definition, required libraries and test input data.

Configuration

Configuration files for integration tests are in src/test/resources/configIT directory. The configuration is divided into two parts described below.

Environment configuration

Environment configuration is stored in src/test/resources/configIT/env/IT-env-<PROFILE>.properties files. <PROFILE> denotes the name of a profile. You can create several profiles and choose the current one with -DIT.env=profile_name option. The default profile name is local, so, unless you indicate otherwise, OMP will use src/test/resources/configIT/env/IT-env-local.properties file.

The environment configuration file should contain the following variables:

  • oozieServiceURI -- the address of Oozie server, e.g. http://localhost:11000/oozie/
  • nameNode -- the address of HDFS' namenode, e.g. localhost:8020 or hdfs://localhost:8020
  • jobTracker -- the address of Job Tracker, e.g. localhost:8021
  • queueName -- the name of a queue, usually "default"
  • hdfsUserName -- the name of a user in HDFS
  • hdfsWorkingDirURI -- the address of a working directory in HDFS, where you store a workflow, input and output data. During tests, everything should happen in that directory. This directory should not exist before a test is executed, it is created at the beginning of a test and removed when the test is finished. hdfsWorkingDirURI variable should be of the form hdfs://server:port/directory/ (or webhdfs://...). You should pay attention to double "/" characters that can cause problems when specified right after server:port.
  • wfDir -- a directory in which a workflow's definition will be placed, it will be created as a subdirectory of hdfsWorkingDirURI
This file can also contain auxiliary variables. In addition, variables can call other variables (the order of variables is not important).

Workflow configuration

Workflow configuration is stored in src/test/resources/configIT/wf/IT-wf.properties. Variables in that file can call other variables defined here or in the environment properties file (see above).

The workflow configuration file should contain the following variables:

  • localDirInputData -- a local directory in your project that contains input data for tests (can contain subdirectories), e.g. src/test/resources/inputData
  • hdfsDirInputData -- a directory in the HDFS, where the test data from the local directory will be copied to, it should be a subdirectory of hdfsWorkingDirURI.
  • hdfsDirOutputData -- a directory in HDFS, where the output data will be stored. When the workflow is finished, the content of this directory will be copied to a local temporary directory. hdfsDirOutputData should be a subdirectory of hdfsWorkingDirURI.
In addition, the workflow configuration file should contain all other variables that are required by a given workflow. For example, if the workflow requires input data to be in a directory specified in a variable INPUT_DATA, then you need to add the following line to the IT-wf.properties file

INPUT_DATA=${hdfsDirInputData}

In the workflow's definition file (workflow.xml) you can use all variables defined either in the environment or the workflow configuration files.

Test

Test class

Test files should be placed in src/test/java directory. The name of each test file should end with IT, e.g. TestIT.java. Each test method should have the following lines:

        OozieRunner or = new OozieRunner();
        File workflowOutputData = or.run();

Variable workflowOutputData points to a temporary directory, where the workflow's output data were copied. Each method should check the correctness of the output data. For example:

        assertTrue(workflowOutputData.exists());
        assertTrue(workflowOutputData.isDirectory());
        assertTrue(workflowOutputData.listFiles().length > 0);

Frequently, the output files' names are of the form part-* (when they are results of MapReduce). In such cases, you may use the following code:

        for (File f : FileUtils.listFiles(workflowOutputData, null, true)) {
            if (f.isFile() && f.getName().startsWith("part-")) {
                // here you check the correctness of the data in the file f
            }
        }

Class FileUtils in the above example is defined in org.apache.commons.io package.

OozieRunner

Class OozieRunner helps to run a test workflow and gather the output data. To use it, you have to to add the following dependency to your pom.xml:

        <dependency>
            <groupId>pl.edu.icm.oozie</groupid>
            <artifactId>oozie-runner</artifactid>
            <version>1.2-SNAPSHOT</version>
            <scope>test</scope>
        </dependency>

In addition, you need to add ICM repository to pom.xml:

    <repositories>
        <repository>
            <id>icm</id>
            <name>ICM project repository</name>
            <url>http://maven.icm.edu.pl/artifactory/repo</url>
        </repository>
    </repositories>

Additional test options

  • -DskipCleanIT -- the working HDFS directory will not be removed when a test is finished
  • -DforceCleanOldData -- if the working HDFS directory already exists, this option forces its removal prior the beginning of a new test.

Skipping integration tests

Integration tests will be skipped if you don't have configuration files (IT-env-*.properties and IT-wf.properties) in your project or if the project was built with either -DskipITs or -DskipTests options.