Friday, March 14, 2014

A highly extendable ETL framework architecture to solve common challenges for data loading

E(xtraction)T(ransformation)L(oad)  is common task we face in data analytics projects. And There are common challenges when we develop an ETL framework.


  • To design common and reusable ETL framework to load data from various data sources into single source such as your own database and integrate them into batch program.
  • To map data attributes between external sources and your canonical data schema
  • To provide user-friendly way to define rules for filtering data during batch load process 

Recently I worked on a project to extract defects from various defects management software like IBM RTC, HP MC, IBM CMVC and etc. and create analytics pipeline to run some machine learning algorithms like k-means clustering and SVM(support vector machine). I created  D4Jazz framework: an automated defects load framework for loading the data from different sources into central repository which is Rational Team Concert. 

It has the following list of features: 

1. It can be used for migrating defects from any external sources into RTC. . 
2. Software design patterns are identified in this Extraction-Transformation-Load (ETL) tool, and it has pluggable architecture and can be extended to any defect repositories. 
3. The current implementation includes loading defects from IBM Configuration Management Version Control (CMVC), comma separated values (CSV) text file and another RTC. 
4. It utilizes a custom RTC work item to store the external defect repository information, defect attributes mapping information, and defect extraction rule information. 
5. It is developed in Java and leverages Java client libraries available from RTC and CMVC.


Here is the architecture diagram. It is multi-tiers application. On top of the layers is the WebSphere modern batch-based job scheduler to run the ETL batch job which invokes job manager to do data extraction from the sources and load into central repository. And the framework defines set of interfaces for managing job (IJobManager), extracting data (IExtractor), loading data (ILoader), connecting to data source (IConnectionFactory). It provides extension points and allows application developers to plugin their own specific logic to load data from a specific data source. 

One of the useful design patterns in ETL framework is the decorator pattern. Decorator pattern allows application to custom existing interfaces and add additional features. For example, during the transformation phrase of ETL, there are usually large set of rules to apply and filter the data, and the set of rules applied could be different cross different data sources or highly varied. The decorator pattern allows us to build up a set of rules which implement a Decorable interface, then pipeline builder can pick any rules from the rule set and create its own rule set. 






No comments:

Post a Comment