In this article I’m going to share some of my experiences from the SAS Analyst Conference, May 27 – 29, 2015, in Marbella. I had a one-on-one meeting with Scott Gidley, senior director Research and Development, and we had an interesting discussion about the future of data management, Hadoop, Big Data, data warehouses, and Data Lakes.
I see a clear trend in data management and integration: the role of the data warehouse is changing rapidly and a lot of analytics will be done without it. Data warehouse developers should be warned!
The changing role of the DWH
Scott Gidley: “We are seeing customers looking to limit investment in traditional data warehouses”. Analytics will be done more and more without a data warehouse. It stays mainly for corporate reporting and compliance purposes but for advanced analytics, especially real-time, connections are made directly to the data sources (if possible). The data warehouse is the final destination for corporate data and slowly changing dimensions (type 2) will stay for reasons of compliance and traceability. I think a lot of (traditional) data warehouse developers need to keep a close eye on this or they will miss the Big Data boat.
In addition I observed a lot of other significant trends:
Embedding analytics in Hadoop
Embedding SAS analytics in Hadoop is quite a unique approach. Most BI companies try to extract data from Hadoop but SAS injects intelligence and lets Hadoop do the processing. There is a philosophy behind it: Hadoop runs on every platform. “We are pretty excited about having now the ability to put data management and analytics where the big data resides (Hadoop) rather than having to pull it out of the data lake, process it and put it back,” says Scott. I think this approach and their tight integration with Hadoop can broaden their market significantly. “A lot of data is in Hadoop and people are not aware of it,” Scott continues, “but once data has landed in the Hadoop ecosystem it becomes difficult to govern and secure”.
Randy Guard, vice president Product Management: “We will be the analytic tool of choice in a Hadoop cluster”. He continues explaining his vision: “put the analytics where the data is”. Their goal is clear:
- Analytics workload of choice in Hadoop
- Data integration toolset of choice for Hadoop
- Visualization product of choice for Hadoop
These are nice principles but sometimes you need to combine data from multiple internal and external sources first to show the 1 million dollar insights. That is something to think about. We are not sure if that is possible with Hadoop, or that it is a wise thing to do.
The need for OLAP decreases quickly
Today a lot of data processing is done in-memory and computers (in a grid) are so fast that the need for OLAP cubes and dedicated OLAP tools decreases quickly.
Data Lakes and the Big Data Management challenge
The ratio of big data and regular data is changing significantly. I think 10 years ago 98% of the relevant data for decision making was structured and had a manageable size, today perhaps 50% of the relevant data is structured and the other part is too big to be stored in a data warehouse.
Figure 1: SAS solving the Big Data management challenge with Hadoop
Companies, especially technology companies that don’t have a history of traditional data warehousing, are seriously considering storing all their data in Hadoop, which can act as a data lake, skipping costly appliances like Teradata or data warehouse processing. They are the early adopters of Hadoop data lakes and Scott thinks that they will be embraced by other industries quickly as they provide value (cost advantages and flexibility).
Crowdsourcing of data
The importance of open data sets is increasing. Governments, hospitals, but also insurance companies are preparing very relevant data sets and publishing them on the internet so they can be used by everyone who needs it. They are open (accessible and described) and, most of the time, free. Users can like (open) data sources and data sets and other people can see that: “Okay, I can trust this data source”. The one version of the truth we aim for will then be defined by the crowd and not by the controller anymore.
Data models and tagging
With Big Data solutions all over the world being discussed and implemented, the need to rethink the role of data modeling is obvious. Big Data is often unstructured and if it has a structure it might quickly be altered by the owner and publisher. We have to move away from the idea that we can model everything, it is impossible nowadays. The solution SAS proposes for unstructured data is tagging the content and meaning of data elements (meta data). This could be done by the owner, the users, and the publishers or even automatically with some kind of data profiling. SAS will start using data profiling to tag data as it arrives, this feature will be in the upcoming release of Data Loader.
The big trend behind all this
Many organizations today operate in a very challenging and dynamic environment. Today’s standards will become obsolete tomorrow. We will see more and more unstructured data and very large data sets ready to be analyzed real-time. We have no time to model the data and no time to store the data, the decisions can’t wait.
Organizations must act now on what they see (adequately) or they are out of business tomorrow. Data quality will be a bigger challenge than ever since the rise of big data. “Got data? You’ll have a data quality problem. Got big data? You’ll have a big data quality problem” says Andy Bitterer on Twitter. I think this is 100% true for human-generated content, but I’m doubting if it is true for machine or device-generated content. Whoever or whatever generated the content, SAS addresses Big Data Quality challenge with their Data Quality for Hadoop solution.