Skip to content
Karel Hübl edited this page Jan 3, 2018 · 7 revisions

Components of DISL project

  • Model - Core of each DISL project is data model, data mapping model and workflow making up generic model of data integration (DI) process. Model is independent on target runtime environment.
  • Patterns - Patterns are responsible for generating output or executing code on target runtime environment based on DISL model elements.
  • Context - Context defines named set of properties used to map model to topology of target runtime environment. The same model may (and should) be used to deploy and execute data integration process on multiple runtime environments. Context may also used to provide parametrisation for generic model of DI process.
  • Launcher - DISL launcher is basic entry point for DISL project. Launcher executes given DISL model element (table, mapping or job) based on passed parameters.
  • Scripts - DISL scripts are alternative entry points for DISL project. Scripts may be used to execute, simulate, deploy or test parts of data integration process. Scripts may be also used to generate documentation or reverse engineer database objetcs. Scripts may be launched from IDE, so any IDE supporting Groovy may be used for development of DISL projects. DISL scripts may be of course launched from command line or Groovy console.

Model

Elements of model are Groovy classes implemented using DISL API. DISL provides abstract parent class implementation for standard types of model elements in package org.disl.meta.

Data model

Data model describes data structures used as sources or targets of data transformations.

Table

Data model consists of DISL Tables. DISL Table typically represents table deployed in relational database. But DISL Table may also represent structure of text, XML or JSON file or service interface. DISL Table is defined as class extending org.disl.meta.Table. Structure of table is defined using instance properties of type org.disl.meta.Column. Anotations are used to specify additional metadata like comments, data types or integrity constraints.

Sample table definition:

import org.disl.meta.*

class DEPARTMENT extends Table {
	
@DataType('VARCHAR(10)')
Column ID

@DataType('VARCHAR(50)')
Column NAME
}

Check unit tests to get overview of available options for DISL Table metadata definitions.

Domains

Domains are used to define common properties for columns holding same type of values. Domain is implemented as class extending Column. One can define model specific properties which can be used in custom patterns. When using domains, table columns are modelled as fields of type Domain.

Sample domain definition:

import org.disl.meta.*

class AMOUNT extends Column {
	String dataType="NUMBER(38,2)"
	String xnaValue="0"
	String xunValue="0"
}

Constraints and comments

Constraints and comments in disl data model are defined using annotations. Annotations may be specified on table level or column level.

  • @Description - defines table or column level comment.
  • @DefaultValue - defines column default value.
  • @PrimaryKey - marks primary key column.
  • @NotNull - marks mandatory column.
  • @Check - defines check constraint on column level. It may be also used on table level enclosed in @CheckConstraints.
  • @Index - defines index. It must be enclosed in @Indexes.
  • @ForeignKey - defines foreign key. It can be used on column level or on table level enclosed in @ForeignKeys. Composite foreign key must be declared on table level.
  • @UniqueKey - defines unique key. It may be specified on column level or on table level enclosed in @UniqueKeys.

Sample table with annotations used to declare comments and constraints:

@Indexes(@Index(columns=["A","B"]))
@CheckConstraints(
	@Check(name='CHECK_1',value='1=1'))
@ForeignKeys([
	@ForeignKey(name='PARENT1_FK',targetTable=TestingTable,sourceColumn='PARENT1_B,PARENT1_C'),
	@ForeignKey(name='PARENT2_FK',targetTable=TestingTable,sourceColumn='PARENT2_B,PARENT2_C',targetColumn=('B,C'))])	
static class TestingTable extends Table {
	

CreateOrReplaceTablePattern pattern

@Description("Column A.")
@DataType("VARCHAR(255)")
@DefaultValue("'A'")
Column A

@PrimaryKey
@DataType("VARCHAR(255)")
Column B

@PrimaryKey
@DataType("VARCHAR(255)")
Column C

@DataType("VARCHAR(255)")
Column PARENT1_B

@DataType("VARCHAR(255)")
Column PARENT1_C

@Check("PARENT2_B<>'XNA'")
@DefaultValue("'XUN'")
@NotNull
@DataType("VARCHAR(255)")
Column PARENT2_B

@DataType("VARCHAR(255)")
Column PARENT2_C

@DataType("VARCHAR(255)")
@ForeignKey(targetTable=TestingTable,targetColumn='A')
Column PARENT3_A
}

Data mapping

Data mapping represents data transformation model defined in DISL.

Mappings

Mapping defines single data transformation. Each mapping consists of mapping sources, column mappings and optionally mapping target. Mapping sources are database tables, views or other mappings (subqueries). Column mappings are SQL expressions based on mapping source columns. Mapping target specifies target table used to load transformed data into. Each mapping may be transformed into SQL query. Patterns are responsible for generating and executing target technology and design pattern specific code. In case DISL Table used as mapping source represents non database object (text file, JSON ...) corresponding data must be imported into staging table before executing SQL query. Generated SQL query uses this staging table in its from clause. Importing data into staging table may be assured by corresponding table pattern. DISL Mapping is defined as class extending org.disl.meta.Mapping. Source and target tables are also defined as instance properties of corresponding type. Join conditions, set operations, filters, group by and having clause may be defined by corresponding method calls in initMapping() method. Column mappings are defined as instance properties of type ColumnMapping. ColumnMapping must be initialize by values returned by createExpressionMapping() or createAggregateColumnMapping() methods. Alternatively one can use shorthannd methods e() or a(). All methods accept one parameter containing SQL expression. It may be string or column of mapping source class. Groovy provides great string interpolation and multi line string support (see http://www.groovy-lang.org/syntax.html#_double_quoted_string). When referring to columns in SQL expressions string interpolation should be used whenever possible. For example use "AVG(${emp.SALARY})" instead of "AVG(emp.SALARY)". This enables proper refactoring and source code impact analysis in IDE. If mapping contains one ore more AgregateColumnMappings (createAggregateColumnMapping() or a() is used to initialize column mapping), group by clause is automatically generated as part of SQL query.
At runtime Mapping instances are created by MetaFactory class which ensures proper initialization of Mapping instance including dependency injection.

Sample mapping:

class Load_TARGET_TABLE extends LoadTableMapping {
	DEPARTMENT dep
	EMPLOYEE emp
	
	TARGET_TABLE target
	
	ColumnMapping DEPARTMENT_ID=e dep.ID
	ColumnMapping DEPARTMENT_NAME=e dep.NAME
	ColumnMapping AVG_SALARY=a "AVG(${emp.SALARY})"
	ColumnMapping FEMALE_AVG_SALARY=a femaleAvg(emp.SEX, emp.SALARY)
	
	
	void initMapping() {
		from dep
		innerJoin emp on "$dep.ID=$emp.DEPARTMENT_ID"
	}
}

SQL query generated from sample mapping:

/*Mapping DepartmentSubquery*/
	SELECT
		dep.ID as DEPARTMENT_ID,
		dep.NAME as DEPARTMENT_NAME,
		AVG(emp.SALARY) as AVG_SALARY,
		AVG(CASE WHEN emp.SEX='F' THEN
				emp.SALARY
		END
		) as FEMALE_AVG_SALARY
	FROM
		PUBLIC.DEPARTMENT dep
		INNER JOIN PUBLIC.EMPLOYEE emp  ON (dep.ID=emp.DEPARTMENT_ID)
	WHERE
		1=1
	GROUP BY
		dep.ID,dep.NAME
/*End of mapping DepartmentSubquery*/

Libraries

Library provides implementation of data transformation functions used by mappings. Functions defined in libraries return strings containing valid SQL expression based on input parameters. In fact library functions are SQL expression patterns which may be reused in multiple mappings. Arguments of library functions are typically placeholders for other SQL expressions - constants, columns or SQL expressions generated by onother library function calls. By using library function properly, one can resolve many pains of today's ETL / mapping tools :

  • avoid code redundancy
  • clean and concise coding - Avoid using complex SQL expressions in data mapping. Split implementation of complex data transformation logic into multiple properly named library functions. Use self descriptive constants instead of literals (codes, ids ...) with hidden meanings.
  • real unit testing - each library function may and should be covered by fully independent unit test. DISL provides good support for implementation unit tests of library functions.

Column mapping without library function:

ColumnMapping AVG_SALARY=a """\
AVG(CASE WHEN emp.SEX='F' THEN
				emp.SALARY
END	);

Column mapping using library function:

ColumnMapping FEMALE_AVG_SALARY=a femaleAvg(emp.SEX, emp.SALARY)

Both mapping produce same SQL expression in Runtime. The second option is concise and can be covered by unit test:

@Test
void testFemaleAvg() {
	def testData=[
		[GENDER:"'M'",AMOUNT:20000],
		[GENDER:"'F'",AMOUNT:25000],
		[GENDER:"'F'",AMOUNT:21000]]
	assertExpressionEquals( '23000',  femaleAvg('GENDER','AMOUNT'), testData)
}

Unit tests may be defined in separate files or as part of library:

@Test
void testAmountForFemale() {
	assertExpressionEquals('10000',amountForFemale("'F'", 10000))
	assertExpressionEquals(null,amountForFemale("'M'", 10000))
}

static String amountForFemale(def sexFlag,def amount) {
	"""\
	CASE WHEN $sexFlag=$FEMALE_CODE THEN
			$amount
	END
	"""
}

@Test
void testFemaleAvg() {
	def testData=[
		[GENDER:"'M'",AMOUNT:20000],
		[GENDER:"'F'",AMOUNT:25000],
		[GENDER:"'F'",AMOUNT:21000]]
	assertExpressionEquals( '23000',  femaleAvg('GENDER','AMOUNT'), testData)
}

static String femaleAvg(def sexFlag, def amount) {
	"AVG(${amountForFemale(sexFlag,amount)})"
}

Workflow

DISL execution workflow is defined using DISL Jobs. DISL job defines logical set of executable elements which should be executed at once. Executable elements are typically mappings and other jobs. Jobs may be used also to implement databse deployment and patching tasks.

Pattern

Pattern is composed from Steps. Each step represents template used to generate executable code based on DISL model element. The code template for each step is defined by getCode() method. The simulate() method outputs generated code to console for testing purposes. The executeInternal() method implements step execution. It typically takes generated code from getCode() method and executes it on target runtime environment (ExecuteSqlScriptStep) or just writes it to filesystem (FileOutputStep). Pattern steps are executed in serial order. Executable model elements have assigned pattern and delegate execution to it.

Context

Context defines named set of properties used to map model to topology of target runtime environment. The same model may (and should) be used to deploy and execute data integration process on multiple runtime environments. Context may also used to provide parametrisation for generic model of DI process. Context is identified by context name and may be specified in DISL project runtime. If context is not specified, default context named 'default' is used.

Context is defined by property file [contextName].context.properties must be part of DISL project. It should be located in src/main/resources folder. It is good practice to use default.context.properties for development environment (personal or shared) so that execution of unit tests and scripts from IDE does not require explicit context configuration. Integration and production environment topology may be defined in separate non default contexts.

For relational databases the topology is defined by configuration of logical schema in given context. Logical schema is just name which is used in DISL project to reference database schema. In runtime context properties are used to construct PhysicalSchema for each logical schema. PhysicalSchema defines JDBC connection and database schema name for DISL runtime.

Sample schema definitions in property files:

Oracle schema definition

default=Oracle
default.host=[host]
default.port=[port]
default.serviceName=[serviceName]
default.schema=[user]
default.user=[user]
default.password=[password]

MS SQL Server schema definition

default=Mssql
default.host=[host]
default.databaseName=[databaseName]
default.schema=[schema]
default.user=[user]
default.password=[password]

Vertica schema definition

default=Vertica
default.host=[host]
default.databaseName=[databaseName]
default.schema=[schema]
default.schema=[schema]
default.user=[user]
default.password=[password]

Abstract parents of model elements

JUnit tests

Build script