Unstructured Data Analysis | Building an analytical ETL system
Passionned Group is a leading analyst and consultancy firm specialized in Business Analytics and Business Intelligence. Our passionate advisors assist many organizations in selecting the best Business Analytics Software and applications. Every two years we organize the election of the smartest company.

After users become familiar with your BI project’s benefits, they’ll likely want more. Be prepared to provide analysis of unstructured data. We’ll show you how to start.

Building an analytical environment

You’ve spent the last five years defining, establishing, and building an analytical environment for your organization. You received accolades for finally providing access to structured information from your company’s transactional systems through a business intelligence (BI) tool with underlying data marts, a data warehouse, and a data integration tool. Now — all of a sudden, it seems — your colleagues are asking for access to other kinds of content such as email, documents, and audio-visual media through your analytical architecture so they can use this content for predictive analytics in the BI application. Where should you start?

Content Definitions

To help your coworkers access this “unstructured” content, you first need to understand the types of data they want to retrieve. You probably have a good handle on the traditional transactional data that is housed in your analytical environment, especially the information in your databases (stored using multidimensional, relational, and other legacy formats). What your colleagues are asking for is pretty much all of the other content in the organization, which could make up as much as 80 percent of all corporate information assets.

Content outside the firewall

Users also want the ability to analyze content about the organization which is traditionally only available “outside the firewall!” This unstructured content is less ordered in terms of its information hierarchy, but the information is just as valuable for your performance management application. TDWI has described two types of data sources: semi-structured and unstructured. Semi-structured data includes spreadsheets, flat files, XML documents and RSS feeds. Unstructured data inside your organization is everything else you can imagine, such as email, word processing files, audio-visual content, Web pages, and text fields in all of your organization’s applications. You may also be asked to provide access to, and analysis of, information outside your organization. This content comes in similar formats but is used for different business purposes.

Analysis Types

Your colleagues need access to this semi-structured and unstructured content to answer questions such as “What are our company’s contractual obligations across the enterprise,” and “is our organization meeting its compliance reporting requirements?” Your coworkers may also want to gain access to content created outside of your company to address questions such as “What do our customers think of our products and services,” “what are our competitors doing,” and “what trends and buzz in the marketplace could influence our organization?”

Architectural Approaches

The most commonly accepted approach today is to use textual analytics and/or extract, transform, and load (ETL) software to impose order on a data set that may comprise many different types of data. These tools deconstruct textual content (often using natural language processing) into data about specific, defined items such as customers or products. These items then are translated into a traditional data structure, such as records in a database row, or entities in a hierarchy. This approach provides some clear advantages. The most obvious is that this content can be integrated into your current BI environment for presentation and analysis. This is a good incremental step if you are trying to get a sense of what value you can derive from this kind of analysis. However, this is only an incremental step to a more fully featured solution. This approach only addresses a subset of the semi-structured and unstructured data, and does not provide new analytical tools for exploring combinations of structured, semi-structured, and unstructured content.

A search paradigm

A more radical and interesting approach to this problem is to apply techniques that are not commonly used in a more structured environment, primarily from the search software field. Today, we generally assume that when a user accesses the BI platform, they have a precise query in mind such as when your sales manager asks, “What is the most recent Region One sales forecast and how does it compare to actual sales?” In contrast, most search platform users are not totally sure what they are looking for. They may have a few parameters in mind (e.g., my new car should be red and have a high-safety rating), and they need the search platform to help them find information that is as relevant as possible to their query. The sales manager using a search paradigm might ask, “I see that performance was off in Region One last quarter. What were the causes of performance decline in Region One last quarter?” Such “fuzzy” search logic implies a different approach to integrating your semi-structured and unstructured content into your analytic platform. Rather than looking at text as your only data source, you need to provide access to all types of content.

Search by Data Visualization

This means that rather than folding this content into your data warehouse through an ETL process, you may need to consider some of the newer content ETL products just introduced for this type of initiative. Data visualization and presentation also evolve when you take a more “search”-focused approach to such content. In its simplest application, this means searching in your BI application, optimized for the content your users will be trying to search. In a more complex scenario, you could offer analysis techniques such as a content terrain (or “heat” mapping) which is similar to a regular topographical map: this visual technique demonstrates content clusters based on particular areas of concentration within your enterprise.

Getting Started

To start down this path, you will obviously need to take a more holistic view of your organization’s information and technology architecture to learn what data is available to your end users. You also need to spend time learning what is missing today from the BI environment. Don’t be surprised if people at first cannot articulate their needs in this arena — most people do not believe current tools can support this analysis!
Source and more information: Enterprise Systems Journal

Comment on this post by Rick Van der Linden

Your email address will not be published. Required fields are marked *

A selection of our customers

Become a customer with us now

Do you also want to become a customer of ours? We are happy to help you with unstructured data analysis (building an analytical etl system) or other things that will make you smarter.

Daan van Beek, Managing Director


Managing Director

contact me directly

Fact sheet

Number of organizations serviced
Number of training courses
Number of participants trained
Overall customer rating
Number of consultants & teachers
Number of offices
Number of years active