Skip to content
/ Ivory Public
forked from sriksun/Ivory

Data Management + Feed Processing Platform over Hadoop

License

Notifications You must be signed in to change notification settings

cryptoe/Ivory

 
 

Repository files navigation

Ivory Overview

Ivory is a feed processing and feed management system aimed at making it
easier for end consumers to onboard their feed processing and feed
management on hadoop clusters.

Why?

* Dependencies across various data processing pipelines are not easy to
  establish. Gaps here typically leads to either incorrect/partial
  processing or expensive reprocessing. Repeated duplicate definition of
  a single feed multiple times can lead to inconsistencies / issues.

* Input data may not arrive always on time and it is required to kick off
  the processing without waiting for all data to arrive and accommodate
  late data separately

* Feed management services such as feed retention, replications across
  clusters, archival etc are tasks that are burdensome on individual
  pipeline owners and better offered as a service for all customers.

* It should be easy to onboard new workflows/pipelines

* Smoother integration with metastore/catalog

* Provide notification to end customer based on availability of feed
  groups (logical group of related feeds, which are likely to be used
  together)

Usage

a. Setup cluster definition
   $IVORY_HOME/bin/ivory entity -submit -type cluster -file /cluster/definition.xml -url http://ivory-server:ivory-port

b. Setup feed definition
   $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed1/definition.xml -url http://ivory-server:ivory-port
   $IVORY_HOME/bin/ivory entity -submit -type feed -file /feed2/definition.xml -url http://ivory-server:ivory-port

c. Setup process definition
   $IVORY_HOME/bin/ivory entity -submit -type process -file /process/definition.xml -url http://ivory-server:ivory-port

d. Once submitted, entity definition, status and dependency can be queried.
   $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -name <<name>> [-definition|-status|-dependency] -url http://ivory-server:ivory-port

   or entities for a particular type can be listed through
   $IVORY_HOME/bin/ivory entity -type [cluster|feed|process] -list

e. Schedule process
   $IVORY_HOME/bin/ivory entity  -type process -name process -schedule -url http://ivory-server:ivory-port

f. Once scheduled entities can be suspended, resumed or deleted (post submit)
   $IVORY_HOME/bin/ivory entity  -type [cluster|feed|process] -name <<name>> [-suspend|-delete|-resume] -url http://ivory-server:ivory-port

g. Once scheduled process instances can be managed through irovy CLI
   $IVORY_HOME/bin/ivory instance -processName <<name>> [-kill|-suspend|-resume|-re-run] -start "yyyy-MM-dd'T'HH:mm'Z'" -url http://ivory-server:ivory-port

Example configurations

Cluster:

<?xml version="1.0"?>
<!--
    Production cluster configuration
  -->

<cluster colo="ua2" description="" name="staging-red" xmlns="uri:ivory:cluster:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <interfaces>
        <interface type="readonly" endpoint="hftp://gsgw1001.red.ua2.inmobi.com:50070"
                   version="0.20.2-cdh3u0" />

        <interface type="write" endpoint="hdfs://gsgw1001.red.ua2.inmobi.com:54310"
                   version="0.20.2-cdh3u0" />

        <interface type="execute" endpoint="gsgw1001.red.ua2.inmobi.com:54311" version="0.20.2-cdh3u0" />

        <interface type="workflow" endpoint="http://gs1134.blue.ua2.inmobi.com:11000/oozie/"
                   version="3.1.4" />

        <interface type="messaging" endpoint="tcp://gs1134.blue.ua2.inmobi.com:61616?daemon=true"
                   version="5.1.6" />
    </interfaces>

    <locations>
        <location name="staging" path="/projects/ivory/staging" />
        <location name="temp" path="/tmp" />
        <location name="working" path="/projects/ivory/working" />
    </locations>

    <properties/>

</cluster>

Feed:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    Hourly ad carrier summary. Generated by hourly processing of rr logs
  -->

<feed description="RRHourlyAdCarrierSummary" name="RRHourlyAdCarrierSummary" xmlns="uri:ivory:feed:0.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <partitions/>

    <groups>rmchourly</groups>

    <frequency>hours</frequency>
    <periodicity>1</periodicity>

    <late-arrival cut-off="hours(6)" />

    <clusters>
        <cluster name="staging-red" type="source">
            <validity start="2009-01-01T00:00Z" end="2099-12-31T00:00Z" timezone="UTC" />
            <retention limit="months(24)" action="delete" />
        </cluster>
    </clusters>

    <locations>
        <location type="data" path="/projects/bi/rmc/rr/${YEAR}-${MONTH}-${DAY}-${HOUR}.concat/HourlyAdCarrierSummary" />
        <location type="stats" path="/none" />
        <location type="meta" path="/none" />
    </locations>

    <ACL owner="rmcuser" group="users" permission="0755" />

    <schema location="/none" provider="none" />

    <properties/>

</feed>


Process:

<?xml version="1.0" encoding="UTF-8"?>
<!--
    RMC Daily process, produces 34 new feeds
 -->
<process name="rmc-daily">
	<cluster name="staging-red" />

	<frequency>days(1)</frequency>

	<validity start="2012-04-03T06:00Z" end="2022-12-30T00:00Z" timezone="UTC" />

	<inputs>
		<input name="WapAd" feed="WapAd" start="today(0,0)" end="today(0,0)" />
	</inputs>

	<outputs>
	        <output name="TrafficDailyAdSiteSummary" feed="TrafficDailyAdSiteSummary" instance="yesterday(0,0)" />
	</outputs>

	<properties>
		<property name="lastday" value="${formatTime(yesterday(0,0), 'yyyy-MM-dd')}" />
	</properties>

	<workflow engine="oozie" path="/projects/bi/rmc/pipelines/workflow/rmcdaily" />

	<retry policy="backoff" delay="5" delayUnit="minutes" attempts="3" />
</process>

About

Data Management + Feed Processing Platform over Hadoop

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 97.9%
  • Perl 1.9%
  • Shell 0.2%