Design aspects for ETL & Data Integration

Important design aspects for ETL

Written by

Passionned Group is a leading analyst and consultancy firm specialized in Business Analytics and Business Intelligence. Our passionate advisors assist many organizations in selecting the best Business Analytics Software and applications. Every two years we organize the election of the smartest company.

Important design aspects for ETLFor a proper Business Intelligence system we need to have some sort of ETL in place. When extracting and filtering data from the source systems the following aspects are important:

Completeness

Indicators and other types of management information need data from tables in the source system. If this requires one or more attributes, it is highly recommended to copy ALL attributes (from a table) to the SA. Why? Well, firstly it is simpler: instead of having to name each attribute explicitly, the table name will do*. Secondly, we can produce indicators or dimensions – which should be based on data in tables that are already in the SA – faster. After all, there is no need to first adjust the extraction. The disadvantage is that more data need to be processed, which may be a burden on the loading time of the SA.

Different formats, protocols and platforms

In most cases, the extraction process requires communication with various computer platforms – mainframes, Unix, Linux, iSeries, Windows – and with different file formats such as RDBMS, ODBC, Ole-DB, flat files, XML, VSAM and IMS. This certainly applies in a highly heterogeneous IT environment. However, some sources cannot be accessed this way. This may be the case with purchased software that does not use a standard database or common protocols. The organization is then dependent on the willingness of the software supplier to supply usable extractions at an acceptable price.

Source system documentation (metadata)

Quite often source systems are not very well documented. The actual meaning of certain tables and attributes are unknown, or even worse, the relationships between tables have not been documented. The first problem can be solved relatively easily: we can retrieve the table meaning by asking several users to query the table content. The latter problem is not so easily solved: the tables first need to be linked with each other in different ways, but even then there may still be ambiguity about the type of relationship, in which case we will need to inspect both sides in order to figure out whether the relationship is mandatory or optional. In short: creating and designing extractions – and thus the data warehouse – is much easier with proper documentation at hand.

Frequency

As described earlier, the renewal rates of both data warehouse and reports should go in even step with the frequency of events in the business processes. The frequency of the extraction process depends on more aspects though. It very much depends on how busy the source system is at specific times. During daytime, the source system is often rather busy. For that reason transferring large amounts of data can only take place during nighttime. In most cases, real-time data transferring is not such a big deal. Admittedly, it does stress the source system somewhat because each change must be processed twice: once in the database itself and once to the SA or ODS.

Changes only

The source system should preferably only supply data that has been changed or adjusted since the previous upload. If this is not possible, data should be selected by modification date. Obviously, the above does not apply to the initial load of the data warehouse, because then we need all data, including historical data. If the source system has no mechanism to specifically deliver changed data – called Changed Data Capture (CDC) – and large amounts of data are involved, then the data warehouse can best – or only – be updated during the weekend, simply because the time frame for processing is too tight . Another option is to update the data warehouse in real-time, however this requires sophisticated tools that continuously detect changes in the source system and directly – or with minor delay – send these on to the data warehouse.

Back-dated transactions

‘Back-dated’ transactions have their own specific issues. They contain data that are booked retroactively – at a certain date in the past – and therefore cause trouble because once transactions are loaded into the data warehouse may not be changed. Solutions for this problem (including entering transactions with opposing values) do exist; but will not be discussed in this book.

* Some ETL tools translate (under the hood) the table name command to a series of attributes. This requires working at attribute level during data warehouse maintenance. In other words: these tools translate the command ‘select * from [tabelname]’ into ‘select [attribute1], [attribute2], etc. from [tabelname]’.

Tip: avoid spaghetti! Make sure the extraction stays as clean as possible and do not put any functional intelligence such as calculations for indicators in it. Such calculations do belong in the data warehouse and data marts, and not in reports. Distributing functional intelligence across multiple steps complicates the data warehouse maintenance process and strongly reduces scalability.

Comment on this post by Daan van Beek

Your email address will not be published. Required fields are marked *

A selection of our customers

Become a customer with us now

Do you also want to become a customer of ours? We are happy to help you with design aspects for etl & data integration or other things that will make you smarter.

Daan van Beek, Managing Director

DAAN VAN BEEK MSc

Managing Director

contact me directly

Fact sheet

Number of organizations serviced
2652
Number of training courses
2653
Number of participants trained
2654
Overall customer rating
8.9
Number of consultants & teachers
2655
Number of offices
3
Number of years active
14