During our interviews with the different ETL vendors at the beginning of 2018 we have seen a number of general trends and topics emerge. Especially those vendors that operate on the cutting edge are very eloquent on where they see ETL and Data Integration going.
Apart from the “hard”, technical developments, we can see that a small number of vendors have a keen eye for the “soft” side of ETL and Data Integration, being data governance and the role of human intervention in maintaining data quality. This was perhaps the most surprising aspect of our survey. We believe that this is also where vendors can truly distinguish themselves.
With the rising importance of Data Integration and maturing concepts like Big Data, a strong technical solution is a prerequisite to compete in the market. However, with increasing amounts of data and a greater reliance on data for operational processes and decision-making, data quality becomes a strategic issue.
Maintaining data quality is first and foremost a process that should be embedded in business processes, and the responsibility for its execution can never be delegated to computers. Some vendors, however, are doing a wonderful job developing the tools that can be used to support the employees responsible for these processes. This, and other developments, will be discussed below.
1. Data Governance
Some vendors are realizing that data quality management is an integral part of data integration. They recognize the need for policies on meta data, visibility, profiling, cleansing and validation. As stated earlier, these are primarily business processes. These business processes for maintaining data quality are called Data Governance.
With the emergence of Data Governance, new roles are defined in the corporation, that deal specifically with overseeing data changes and data entry into the corporate data system. One example is the Data Steward, who is responsible for the management of a set of data elements, both content and meta data.
Some vendors are providing software solutions to support these business processes around Data Governance.
Vendors have different names for this vision on data governance, ranging from “total information management” (SAP) to “data lake” and “data refinery” (IBM and Pentaho) to “Data Network” (SAS).
2. Dealing with sensitive data
A long-standing issue within the BI world is that BI and ETL developers usually have full access to all data available to them. Restricting access means that the developers cannot perform their tasks to the fullest, but granting full access touches or exceeds legal (and ethical) limits.
Some vendors are recognizing the need to obscure very sensitive data from developers without frustrating their work. To enable developers to work on sensitive data, without exposing it, vendors are developing data masking features in their software.
Furthermore, data masking can be used when loading and analyzing data sets that contain personal information. Privacy legislation often prohibits exposing individual information without probable cause. By masking the data, analyses can be made without invading the privacy of individuals.
3. Pushing workload from IT to business users
Some vendors recognize that the workload on IT departments is increasing greatly as the importance of data is increasing. These IT departments are not able to staff up accordingly. So some vendors are developing ways to pull in the business users to do work that would previously have been done by IT staff. On the one hand this means enabling business users to score, maintain, and improve the quality of “their” data. On the other hand it means that through a more user friendly interface, business users can create their own extracts from “Big Data” data sets. A vendor such as SAS calls this “transforming business users into information experts”. IBM is also trying to push data integration from IT specialists to business users, with tooling like InfoSphere Data Click. SAP is also simplifying its solution to bring in the business users.
4. Relational databases no longer king
Relational Databases are no longer the default option for data storage, because of the limited scalability of “traditional” databases. Every vendor is working hard to connect to HADOOP file systems, implementing features from functional programming languages such as MapReduce to distribute operations.
Vendors are also developing solutions to treat “unstructured” data on HADOOP as if it were tables from a relational database. Most vendors are working hard to develop connectors and adapters to connect to non-relational databases such as HBase, implementing the Pig language and supporting non-sql dialects like HiveQL. Some vendors are setting up marketplaces to allow third parties to develop adapters and connectors, to speed up development.
Since by definition the sheer size of Big Data makes it impossible to move the data around with ETL tooling, we asked vendors how they see the relation between their ETL software and Big Data. What is the use of ETL software when the data set is so large that it cannot be extracted, transformed or loaded? Surprisingly, the vendors have no boilerplate answers to this question. Most improvised answers centered around the possibility of extracting subsets of data to speed up further analysis. A notable exception was SAS, where they have developed the vision that they want to focus on transforming data where it resides, which means means push-down query execution. SAS is embedding SAS processes in e.g. HADOOP file systems. Pentaho also provides in-Hadoop execution. We see this as the beginning of a trend, where the “L” in ETL will lose significance.
5. Incremental loading
Increasing amounts of data make the option of regular full loads less attractive. Data traffic is increasing, load windows are decreasing and in some cases (near) real time updating is necessary.
Most vendors are working hard on their Changed Data Capture (CDC) solutions, to support incremental and real-time data processing. We see different approaches in the market, from using database triggers, to deciphering database logs, to building complete proprietary CDC solutions to bypass the native database systems and to have more control over CDC.
Some vendors are developing concepts to deal with the Internet of Things. The Internet of Things generates constant streams of (telemetric) data from instruments. This makes low-latency data integration desirable, if the data is to be presented through Business Intelligence tooling for immediate action, or if the data is driving BPM (Business Process Management) software. It also means that ETL, which was previously confined to office automation, is now touching on SCADA (Supervisory Control And Data Acquisition) systems.
6. Cloud based data integration
Having all servers and software run on premise is also no longer the default option. Many vendors are exploring the possibilities of cloud based Data Integration. Some vendors are completely rewriting their software from scratch into Java so it can run in the cloud. Others are using Amazon Elastic Cloud to host virtual servers to host their software.
There is a general agreement among vendors that most instances will continue to run on premises in the foreseeable future, but especially for mid-market companies, the Cloud solutions will be an attractive and scalable solution.
The ETL market is definitely maturing. Having a strong technical solution is now a prerequisite to compete in the market. Every vendor is working very hard to provide connectivity to all data sources, especially to the emerging non-traditional (big) data sources.
We also see that different vendors are developing different emphases. Some operate very much in the technical realm and their focus is providing the most connectivity and the utmost interchangeability of data. Others are developing a strong vision on the quality, maintenance and application of reliable data.
These are exciting times and we at Passionned Group are eager to see where all these interesting developments are leading to. Please refer to the full text of our ETL Tools & Data Integration survey to see which vendor is going where.