The difference between what ETL suppliers think on the one hand and users want on the other hand, seems to get bigger every year. It starts with the name. Users still talk about ETL (Extract, Transform and Load); while suppliers think it is passé. They talk about data integration and master data management, something that is wider than the original ETL, but describes the same problem.
ETL or data integration?
If we assume that Google statistics give an indication of what people are interested in, we see that globally 240,000 people per month (in the Netherlands: 2400) search for “ETL”. The search term “data integration” is “only” entered by 74,000 people (in the Netherlands: 720). On the first Google page of “data integration”, we find almost exclusively supplier information. The first “ETL” page contains a lot of (neutral) information on the subject itself and is almost free of suppliers. Good positioning of the products remains a problem, as is clear from the above example.
Through their acquisitions in recent years, some of the major suppliers (IBM, Oracle) offer several products that seem to do pretty much the same thing. Those companies that want to rationalize these products will wonder which product will still exist in the future and, more importantly, which product will not. In addition, Open Source products are rapidly emerging. Two years ago these were little more than amateur productions, but now we are talking about products that are comparable with the market leaders in terms of functionality.
The role of the products
The role of the products has also changed over time. Four or five years ago, there were two possibilities for an ETL tool, namely:
- It was used periodically to transfer data from one or more source systems into a data warehouse. During this process, the data was cleaned-up to some extent, de-duplicated and validated as much as possible.
- It was deployed when migrating data. The source data from the old system (often a self-built system that was replaced by a newly purchased ERP system) was transferred into the new system in one go with much effort and great difficulty. A process that could be so difficult that many companies decided to take no history with them at all.
These were very long procedures, often run through the night, or, if it was too much, even during the weekends. Often it went wrong and they had to try it again the next day or at the weekend. Both the above-mentioned problems still exist. The data warehouse is now brought up-to-date more often and is sometimes it is done in a somewhat smarter way. This is what ETL & data integration tools have always done and will probably always continue to do.
The boring image of ETL
The suppliers invented “data integration” to get rid of this rather boring image, but have not marketed it well, according to the figures from Google. Data integration is not a tool but an architecture, the foundation you need to achieve the Holy Grail “a source of truth”.
Data integration can ensure that there is only one enterprise-wide definition for each data element. No more discussions about what a customer is for example: somewhere in your integrated environment it is defined unambiguously. Data integration is a dream, dreaming is allowed, but dreams can turn into nightmares.
Often we do not know exactly what data we have. Only at the moment when we try to link data together, do we find that it is different from what we had always thought. This in itself a problem that definitely needs to be addressed, but it upsets the schedule that has been laid down.
Improvements in data integration products
Compared to the traditional ETL products, there are a number of significant improvements in the data integration products. The main areas to which the suppliers have paid attention are:
- Real-time access to data sources and Big Data
- Data Quality, origin of the data (lineage) and data profiles
- Cloud computing and SaaS
- Master Data Management
Essential difference between ETL and Data Integration
An essential difference between traditional ETL and data integration is the ability to be able to tap into the real-time transactions and immediately load this data into a file (data warehouse) or analyze the data instantly with a Business Intelligence front end tool. Not with a separate product with its own metadata and its own user interface, but with a single integrated product.
Many data warehouse initiatives failed in the past because the information that emerged was not current enough and therefore not interesting. It is difficult to create the daily schedule of a factory if you do not know how many staff have reported in as being sick or recovered. If the information only arrives in the data warehouse tomorrow, you cannot make decisions about the production capacity now.
Naturally, not all information needs to be available within five minutes, but certain decisions can only be made on the basis of data that is fully up-to-date. And that is a big advantage of data integration compared to the old ETL.
Complexity of the ETL process
The ETL process has long been underestimated in terms of complexity and therefore in terms of cost. Reading old “legacy” files where both the design as well as the content is unclear belongs to the “challenge” category if you are of American origin. A Dutchman who likes plain language usually calls it what it is: a problem.
There are a number of problems. The main problem is probably metadata, namely what is the content of a data element and where does it come from. Say for example somebody sees “turnover” in one file and sees “revenue” in another file, they are neatly added together in order to arrive at a new data element that we call “total turnover”. The problem however is that the turnover is stated in euros and the revenue in dollars. Total turnover has thereby become a field whose content does not provide information that is fit to support decisions.
After many bad experiences in this field, most ETL tools are now equipped with extensive options to monitor the quality of the data, to record the origin and to identify strange data by means of profiling and to apply possible improvements. The only thing that remains is to make use of these facilities!
Cloud computing and SaaS
Cloud computing and SaaS (Software as a Service) are two concepts that often go hand in hand. This combination is viewed from a pricing perspective in the first place. We asked all the suppliers to submit a tender for two hardware configurations in our ETL Tools & Data Integration Survey. A smallish one based on a Windows Server and a medium-sized second one, based on a UNIX server.
The price differences between the suppliers, as far as one was willing to publish prices, were very large (and that was before hardware, installation and configuration). SaaS is generally seen as a much fairer pricing model, you pay for what you use and if it appears that you need more capacity, you can easily have this. The Cloud also gives much more flexibility than an own data centre, without penalties for outsourcing the software, without buying the hardware, backups etc. However, one must of course take security into account. In many cases, sensitive company information is involved and it is often difficult to explain that you really have no idea where the data is located. Another consideration is that sometimes it is not easy to link software from different suppliers in the Cloud: this can significantly limit the choice of supplier.
Master Data Management
Master Data Management (MDM) is also a relatively new phenomenon within the ETL/Data Integration infrastructure. One of the biggest data quality problems comes from the fact that certain important information is maintained in more than one place. A commonly used example is the contact details of employees, suppliers and customers:
- The address must be accurate, that is to say that it exists;
- The address must have the correct postcode;
- The street name must be spelled correctly;
- The address must be up to date: the person has not moved.
Master Data Management can ensure that key data in different files is synchronized and that there is one source of the truth.