SAS & Big Data management | Hadoop Embedded Analytics

SAS solving the Big Data management challenge with Hadoop


Passionned Group is a leading analyst and consultancy firm specialized in Business Analytics and Business Intelligence. Our passionate advisors assist many organizations in selecting the best Business Analytics Software and applications. Every two years we organize the election of the smartest company.

The future of data management

In this article I’m going to share some of my experiences from the SAS Analyst Conference, May 27 – 29, 2015, in Marbella. I had a one-on-one meeting with Scott Gidley, senior director Research and Development, and we had an interesting discussion about the future of data management, Hadoop, Big Data, data warehouses, and Data Lakes.

I see a clear trend in data management and integration: the role of the data warehouse is changing rapidly and a lot of analytics will be done without it. Data warehouse developers should be warned!

The changing role of the DWH

Scott Gidley: “We are seeing customers looking to limit investment in traditional data warehouses”. Analytics will be done more and more without a data warehouse. It stays mainly for corporate reporting and compliance purposes but for advanced analytics, especially real-time, connections are made directly to the data sources (if possible). The data warehouse is the final destination for corporate data and slowly changing dimensions (type 2) will stay for reasons of compliance and traceability. I think a lot of (traditional) data warehouse developers need to keep a close eye on this or they will miss the Big Data boat.
In addition I observed a lot of other significant trends:

Embedding analytics in Hadoop

Embedding analytics in HadoopEmbedding SAS analytics in Hadoop is quite a unique approach. Most BI companies try to extract data from Hadoop but SAS injects intelligence and lets Hadoop do the processing. There is a philosophy behind it: Hadoop runs on every platform. “We are pretty excited about having now the ability to put data management and analytics where the big data resides (Hadoop) rather than having to pull it out of the data lake, process it and put it back,” says Scott. I think this approach and their tight integration with Hadoop can broaden their market significantly. “A lot of data is in Hadoop and people are not aware of it,” Scott continues, “but once data has landed in the Hadoop ecosystem it becomes difficult to govern and secure”.
Randy Guard, vice president Product Management: “We will be the analytic tool of choice in a Hadoop cluster”. He continues explaining his vision: “put the analytics where the data is”. Their goal is clear:

  • Analytics workload of choice in Hadoop
  • Data integration toolset of choice for Hadoop
  • Visualization product of choice for Hadoop

These are nice principles but sometimes you need to combine data from multiple internal and external sources first to show the 1 million dollar insights. That is something to think about. We are not sure if that is possible with Hadoop, or that it is a wise thing to do.

The need for OLAP decreases quickly

Today a lot of data processing is done in-memory and computers (in a grid) are so fast that the need for OLAP cubes and dedicated OLAP tools decreases quickly.

Data Lakes and the Big Data Management challenge

The ratio of big data and regular data is changing significantly. I think 10 years ago 98% of the relevant data for decision making was structured and had a manageable size, today perhaps 50% of the relevant data is structured and the other part is too big to be stored in a data warehouse.
Data Lakes and the Big Data Management challenge
Figure 1: SAS solving the Big Data management challenge with Hadoop
Companies, especially technology companies that don’t have a history of traditional data warehousing, are seriously considering storing all their data in Hadoop, which can act as a data lake, skipping costly appliances like Teradata or data warehouse processing. They are the early adopters of Hadoop data lakes and Scott thinks that they will be embraced by other industries quickly as they provide value (cost advantages and flexibility).

Crowdsourcing of data

Crowd sourcing of dataThe importance of open data sets is increasing. Governments, hospitals, but also insurance companies are preparing very relevant data sets and publishing them on the internet so they can be used by everyone who needs it. They are open (accessible and described) and, most of the time, free. Users can like (open) data sources and data sets and other people can see that: “Okay, I can trust this data source”. The one version of the truth we aim for will then be defined by the crowd and not by the controller anymore.

Data models and tagging

With Big Data solutions all over the world being discussed and implemented, the need to rethink the role of data modeling is obvious. Big Data is often unstructured and if it has a structure it might quickly be altered by the owner and publisher. We have to move away from the idea that we can model everything, it is impossible nowadays. The solution SAS proposes for unstructured data is tagging the content and meaning of data elements (meta data). This could be done by the owner, the users, and the publishers or even automatically with some kind of data profiling. SAS will start using data profiling to tag data as it arrives, this feature will be in the upcoming release of Data Loader.

The big trend behind all this

The big trend behind all thisMany organizations today operate in a very challenging and dynamic environment. Today’s standards will become obsolete tomorrow. We will see more and more unstructured data and very large data sets ready to be analyzed real-time. We have no time to model the data and no time to store the data, the decisions can’t wait.
Organizations must act now on what they see (adequately) or they are out of business tomorrow. Data quality will be a bigger challenge than ever since the rise of big data. “Got data? You’ll have a data quality problem. Got big data? You’ll have a big data quality problem” says Andy Bitterer on Twitter. I think this is 100% true for human-generated content, but I’m doubting if it is true for machine or device-generated content. Whoever or whatever generated the content, SAS addresses Big Data Quality challenge with their Data Quality for Hadoop solution.

Daan van BeekDaan van Beek is the managing director of the Passionned Group, co-author of the ‘ETL Tools & Data Integration Survey’ and can be reached on LinkedIn.

Comment on this post by Rick Van der Linden

Your email address will not be published. Required fields are marked *

A selection of our customers

Become a customer with us now

Do you also want to become a customer of ours? We are happy to help you with sas & big data management (hadoop embedded analytics) or other things that will make you smarter.

Daan van Beek, Managing Director

DAAN VAN BEEK MSc

Managing Director

contact me directly

Fact sheet

Number of organizations serviced
1393
Number of training courses
1394
Number of participants trained
1395
Overall customer rating
8.9
Number of consultants & teachers
1396
Number of offices
3
Number of years active
14