Tuesday, July 29, 2014

TDD in ETL Projects

How can we implement TDD (Test Driven Development) in Business Intelligence or Data Warehousing projects that involve ETL processes?  Is there a way to implement TDD when the project team is using a tool for ETL or developers are writing programs to extract, transform and load data into the data warehouse or a set of target tables?  Let us explore.
Whether you consider extract, transform and load as distinct steps or all of them together as a single step, one thing to keep in mind is the nature of TDD you can implement here.  It is not going to be the typical JUnit approach of using assertions.   All your data are going to be in schemas or tables.  You are not going to embed input and output data in your test scripts.   If you embed input and output data in your assertions and write pure play JUnits, you can test the ‘T’ of ETL – i.e., transform.  So, what can you do to include E and L – or extract and load?
Think of flight operations in airports. Before the departure of every flight there are several tests or checks to ensure that the aircraft satisfies a set of pre-conditions.  This will include the verification of a whole lot of technical parameters and things related to the number of passengers, and other inventory such as food and other supplies.  At the destination there will be a bunch of tests or checks to ensure that all passengers including the crew members reached the destination.  This will also include routine verifications or checks on the technical parameters of the flight depending on a set of post-conditions.
Now, apply this analogy to implementing tests in ETL context.  The first step is to articulate the pre-conditions and write tests to assert or verify all pre-conditions.  Obviously, if you know all pre-conditions and if you check them one by one by hand, the next wise step is to automate all of them.  That is going to save your time and eliminate manual errors.
Let us assume that your source tables satisfied all pre-conditions and you initiated the ETL process.  When the ETL process completes, or the flight has reached the destination, it is necessary to run data quality checks.   This is a high-level health check. For this, you need scripts to verify row count, data integrity, aggregates (sum, average) and so on.  You can call it a sanity test.
The next step is about verifying all post-conditions.  These are the next level, detailed tests.  Many of these will depend on the pre-conditions and the data values present in source tables.  One way to handle this is to have a meta-table or an intermediate table and scripts to compute and store expected results corresponding to post-conditions. Using this table, you need to run another set of scripts to verify if your ETL process populated the target tables with the right set of data.  The meta-table or intermediate table is what holds your expected results.  The target tables hold the actual results.   Depending on the test strategy of your project, design the meta-table or intermediate table to hold as many rows or pieces of data.  In case of complex projects, you may need multiple meta-tables or intermediate tables.
If you are using manual tests to verify post-conditions, convert them in to scripts.  Populate the results of your scripts in a table or create a log file.  Write a script to analyze errors. This can save you lot of time and money.
Doing all these in small steps – incrementally and iteratively will help you adopt test driven development.  If you decide to do all this as a one big chunk somewhere in the project –mostly at project end, it is going to be classical phase driven approach.  Try TDD!
Are there other ways to implement TDD in ETL projects?  Let us discuss.