A recent study by one of the analysts firms tells us that “in 2011 the world will create a staggering 1.8 zettabytes” and “by 2020 the world will generate 50 times the amount of information [now]”.
In this article we explore the three biggest challenges of using big data and what to do about it: intelligent filtering, outstanding performance and good data visualization. The main question here is if the business intelligence software we are using today is capable of tackling these challenges.
Big data is becoming a serious business
Big data is so to speak hot and it is becoming a serious business. The amount of data may double every two years. The question is does the amount of valuable information double at the same rate? According to IBM”s Mr. Big Data, vice president Rod Smith: “maybe 90 percent of it is not very useful”. In general, we may assume the more (complex) the data the bigger the dataset and the more difficult it is to derive information from it. We need to manage big data, that’s for sure. The other 10 percent may be of very great value.
But which technologies and techniques can be used in order to be more successful with big data. Does Business Intelligence software offer a solution? Do we need special analytical tools on top of it? Or is it just filtering the data properly before we load it into our data warehouse or tool?
“You ain’t seen nothing yet” is a saying that is especially true with regard to big data. It’s very difficult to see any piece of information in a very large amount of information. Especially if you’re looking at the wrong items or using tools that are not capable of analyzing big data (everyone has heard of the ‘query from hell’ or reports that run six hours).
Processing and using big data has three big challenges from a technical point of view. Firstly, intelligent filtering. Secondly, outstanding performance. Thirdly, good visualization.
Because big data is really big and there are many applications, you first need to know what you are looking for and what the purpose is. Is it your company name (and all variations, typos) on Twitter or Facebook or blogs? Is it the temperature of your medicines, provided by sensors, in an airplane flying to Japan? Are you looking for the weather conditions in a specific country? Do you need to know what the most likely path is which visitors of your website took before they ordered your product? In some cases you don’t need all the big data, but only a subset. It depends on the type of application you want to build. So, the first step in building a big data application is to know exactly what you are looking for.
The Telco’s of this world had a clue how to process big data sets of structured data. But today, a whole new set of difficulties comes up because of the lack of structure in today”s data, the size of it and the speed by which it is generated. Mobile devices, sensors in cars, planes and RFID chips, large scale eCommerce, all generate a lot of data in seconds that is not yet structured enough to store in a (single) relational database.
Few technologies that can handle big data should be considered as part of business intelligence solutions. To name a few: massive parallel processing (see the ETL Tool criteria), cloud computing platforms, distributed databases, grid computing and the Apache Hadoop framework. Sure, filtering first is key, but you can use it only if the rest of the data isn’t relevant at all and the data set you should filter on is stored in a way which supports filtering.
This looks like a ‘Catch 22’ situation: to be able to filter intelligently, normally you should put the data in the database first, but to put the big data in the database you should sometimes filter it first. That is where Hadoop comes into the picture or any of its alternatives: GemFire, MarkLogic, and neo4j. Because of its distributed nature and sophisticated technology it can handle large data sets very well, for fast storage as well as quick retrieval. It can work on pieces of data in parallel. Today, the Hadoop community is working on a data warehouse system (Hive DWH) based on Hadoop technology.
Good data visualization
If the big data is available in a way which can be queried, inside or outside the cloud, the last difficulty is to visualize the data in a proper manner. The main purpose of visualization is to communicate the information clearly and in an effective manner. That means not necessarily the data should always be presented in graphs. Some information could very well be effectively communicated in a sorted plain list or a pivot table. It depends on the purpose (what relations are you looking for in the data set) and the nature of the data.
According to research performed by Cleveland and McGill there is a difference in the accuracy in quantitative, ordinal and nominal tasks (Mackinlay, 1986). Higher tasks are accomplished more accurately than lower tasks. Tasks to perceive different positions in the data rank as best and differences in shapes are the perceived as being the worst. The following table (adopted from Cleveland and McGill) ranks which task (type of graph) is most effective for each data type (quantitative, ordinal and nominal).
Table 1: ranking of encoders by effectiveness; position is for quantitative data the best visualization; shape the least.
But these basic rules are a little bit too simplistic for big data, although they can be useful as basic guidelines. They also gave an implicit warning to be careful when using 3D graphs because volume is not included in the different types of tasks by the top five.
Identifying complex relationships in large data sets should persuade you to use a combined approach using different techniques on top of the guide lines. Simulation and animation (for example showing over time the variation of a multidimensional indicator is one of them. A good example is demonstrated in the following video. Multiple dimensions are shown using color, size and position.
But more concepts and methods can be helpful to visualize the data in the right way. An overview of visualization methods is given here.
To return to the main question, ‘Does business intelligence software provide enough tools and functions to tackle the challenges of big data?’, we must conclude that the Business Intelligence vendors are on their way, but may fall short on a few areas, depending on the BI software you are using.
- Large unstructured datasets: Business Intelligence software may possibly read it well, but they are not doing a great job because of the size and the lack of structure of the data. This may cause real performance troubles despite the fact some tools have advanced In-Memory technology. You should consider using Hadoop technology (or an alternative) and built a data warehouse first by using for example a data integration platform like Informatica or less sophisticated ETL tools. When the big data is in a cube or data warehouse performance shouldn’t be a problem anymore.
- Data visualization: although, a lot of visualization methods can be used out-of-the-box with Business Intelligence tools, on average they provide between the 15 and 20 types of visualizations, the possibilities of most of the BI software are still quite limited compared to all visualization methods that are available. If you need advanced visualizations, take a look at the list of data visualization software. It can sometimes be integrated with BI software like Trendalyzer (a Google API), or you may use it on top of BI software or as an alternative.
Last but not least, if your business intelligence platform can handle big data in a proper way, you have the right technology that suites you. But, worldwide there is a shortage of people with deep analytical skills that can make effective use of the big data technology (McKinsey, 2011). So, don’t focus only on technology, consider also the skill side of big data & business analytics.
Download our fully independent & in-depth evaluation of Business intelligence software in the Business Intelligence Tools Survey and see what software matches the requirements for big data best.