The term ‘Big Data’ covers virtually all articles about information technology and management information. The candidates for the Dutch BI Award have also shown in recent years that they are serious about Big Data. But what exactly is Big Data? Many, including myself, struggle with the term. There is a lot of confusion about Big Data because there is no generally accepted definition. We all know that it has to deal with large volumes of data involving fast processing time and several different manifestations. This doesn’t say much, because what is a large volume? What is fast and what are several different manifestations?
Different definitions in use
For some, Big Data is properly structured sensor data or machine-generated data, for others this means unstructured text data originating from social media, while others describe it as the semi-structured data contained in, for example, weblogs.
“Big” is a relative concept.
The fact that the word “big” is a relative concept does not make it any easier. Something that is a lot for a European company, may be seen as average for an American company of the same size. And is it really about the volume of data? Or is it more about what we can do with the data, for example, analysis – which has little to do with the volume.
The Vs of Big Data
In order to describe Big Data, the market usually uses the Vs (Volume, Velocity, Variety, Variation, Visibility, Value). I’ve lost count; the more Vs, the less clear the definition. I’d like to say: “enough is enough” and this also applies to data volume.
The result of an analysis is not necessarily better with increased data volume. The quality of data is usually much more important than the amount. Several definitions have now appeared, but I have yet to see an acceptable one. Anyway, there is considerable confusion about the Big Data concept.
What is a Big Data system?
I will try to look at various Big Data systems from different perspectives and to shed light on the Big Data issue. Processing large volumes of data is definitely the most common characteristic of Big Data systems.
But there is another, which is that most such systems combine the features of production systems and BI systems. In essence, any Big Data system is a production system because it collects and stores new data, and is simultaneously a BI system because the new data is not intended to support business process but to analyze it.
Business processes support
By “new data” I mean, primarily, data that has not been previously collected and stored within the organization and is mostly a new type of data. For example: a retailer’s Big Data system collects data from a camera system to find out how customers walk through their store.
Or a multinational corporation collects unstructured data from social media to see what people think and write about it. Traditionally, new data is registered and processed by production systems, such as a general ledger, cash management, and claim-processing systems. However, these systems are not designed for analytics, but instead to support business processes.
In fact, no one thought about using them for analytics when they were designed, only for making data entry possible. That’s why it is sometimes extremely difficult to build a BI system that extracts appropriate data to the production databases for analytics or reporting: staging areas must be set up, ETL and replication processes must be designed, and so on.
Big Data systems are hybrid systems
It’s still going on: the developers of new production systems still fail to see that the organization may need to use the data for other purposes, for example, for analysis. In other words, what makes Big Data systems so special is the fact that they are hybrid systems; that is, they are production/BI systems at the same time. I think that this is also what makes Big Data applications so special, and most of them collect enormous amounts of data especially suitable for the required form of analytics.
Maybe we should just redefine the term Big Data. Let’s start with no longer associating the word “big” with a certain amount, but instead – in the tradition of IT – let’s make an acronym out of it: Business Intelligence Generated Data, that’s what BIG Data stands for: data that is specially generated and stored for analysis.
Hence, a BIG Data System is a system that generates, collects, stores, and processes data to support the primary goal of business intelligence. It follows that BIG Data is data that is managed by a BIG Data system. Redefining the term Big Data hopefully clarifies the meaning of this promising category of systems and ends the confusion.
Rick F. van der Lans
is an independent consultant, author, and presenter in the fields of data warehousing, Business Intelligence, application integration, and database technology.