ETL stands for the three words Extract, Transform, and Load. The ETL process covers extracting data, mostly from different types of systems, transforming it into a structure that’s more appropriate for reporting and analysis, and finally loading it into the database and/or cube(s).
But these days, ETL is much more than that
Most ETL software also covers:
- Data profiling and data quality control.
- Monitoring and cleansing of the data.
- Real-time and on-demand data integration.
- Extraction of Big Data using Hadoop.
- Master data management.
1. Extract from source
Extract data from different internal and external sources, structured and/or unstructured. Plain queries are sent to the source systems using native connections, message queuing, ODBC, or OLE-DB middleware. The data will be put in a Staging Area (SA), usually with the same structure as the source. Sometimes we only the data that is new or has been changed. Some ETL tools can do this automatically, providing a changed data capture (CDC) mechanism.
2. Transform the data
Once the data is available in the Staging Area it’s all on one platform and one database. Then we can easily combine tables, filter and sort the data using specific attributes, pivot to another structure, and make calculations. Check on data quality and cleanse the data if necessary. After preparing all the data we can implement slowly-changing dimensions, in which case we want to keep track of when attributes change over time in reports and analyses.
3. Load the data
Finally, data is loaded into a data warehouse, usually into fact and dimension tables. From there, the data can be combined, aggregated, and loaded into datamarts or cubes as is deemed necessary. The business user analyzes and uses the transformed data with BI tools and instruments like data visualization software, dashboards, OLAP tools, and reporting tools. The data can then be used to improve the quality of the organization’s decision-making.
An ideal ETL architecture contains a data warehouse
Below you’ll find the ideal ETL architecture supporting the three major steps in ETL.
Read more about ETL & data
Passionned Group has written extensively about data and everything that comes with it. Take a look at the articles below or consult our knowledge base.
Data profiling and data quality control
Profiling the data will give direct insight into the data quality of the source systems. It can display how many rows have missing or invalid values, or what the distribution is of the values in a specific column. Based on this knowledge, one can specify business rules in order to cleanse the data, or keep really bad data out of the data warehouse. By profiling the data before designing your ETL process, you are better able to design a system that is robust and has a clear structure.
Metadata management & ETL
Information about all the data that is processed, from sources to targets by transformations, is often put into a metadata repository; a database containing all the metadata. The entire ETL process can be ‘managed’ with metadata management. For example, one can query how a specific target attribute is built up in the ETL process, called data lineage. Or you want to know what the impact of a change will be, for example the size of the order identifier (id) is changed, and in which ETL steps this attribute plays a role.
Learn more about ETL
Learn more about ETL & Analytics and download our ETL Tools & Data Integration Survey 2018.
Also consider taking our our unique Data Warehousing & Data Governance training course which can help you to understand all aspects of ETL and Data Integration.