Keeping Big Data from Becoming Big GIGO

Written by Robert Kugel | Jun 27, 2012 8:33:11 PM

Anyone who focuses on the practical uses of information technology, as I do, must consider the data aspects of adopting any new technology to achieve some business purpose. Reliable data must be readily available in the necessary form and format, or that shiny new IT bauble you want to deploy will fall short of expectations. Our research benchmarks cover a range of core business and IT processes, and they regularly demonstrate that data deficiencies are a root cause of issues organizations have in performing core functions; typically the larger the company, the more severe the data issues become.

Today, “big data” is a popular buzzword, connoting the ability of companies to tap into large amounts of structured and unstructured data using advanced data processing technologies and analytics to achieve insights not available using more conventional techniques. Big data is an important phenomenon, but in using it organizations must ensure that it doesn’t become a larger, faster way of experiencing the GIGO effect: Big Garbage In – Big Garbage Out. At IBM’s recent business analytics analyst summit, IBM Fellow and Chief Technology Officer Brenda Dietrich talked about the challenges of dealing with big data. In passing she provided insight into what may be one of the most important ways to address this challenge.

Today, handling data used for business analytics requires rigor to ensure the quality of the output. For this reason, most of today’s analytics are based on structured data, usually internally held. Our benchmark research confirms that today’s big-data efforts usually use existing internal structured data, not external and/or unstructured sources of information. The most commonly used source, customer data from CRM and other systems, is employed by two-thirds (65%) of companies, while 60 percent use data from transaction systems such as ERP and CRM. Only one-fifth (21%) use social media, 13 percent use multimedia and a mere 5 percent use data from smart meters. In the future, however, companies will need to exploit a greater amount of external, less structured and more complex data (multimedia, for example), because the most significant opportunities to profitably employ analytics in the future will come from expanding the scope of data applied to business issues.

One illustration of an analytics application that combines structured internal and unstructured external data is an analytic application that devises a just-in-time right-of-way maintenance process for an electric utility. Such maintenance is labor-intensive and expensive, but then so is dealing with the impact of downed power lines in severe weather. Rather than employing the usual approach of setting a fixed maintenance schedule, using these analytics it’s possible for a utility to combine internal data about the right of way, customer density and past incidents with real-time weather sensors (measuring rainfall, for example) and three-dimensional imaging to deploy maintenance crews more optimally. Efforts can be diverted away from less pressing areas to those where vegetation density poses the greatest threat of damage in the near term.

Analysts have considerable scope to devise innovative approaches using broad sets of data to address every aspect of managing a business, including operations, risk management, modeling and forecasting, to name just a few. However, broadening the sources of data used in business analytics poses difficult challenges. In today’s analytics, to ensure the quality of the answers, specific data used in reporting and modeling must be selected and values slotted into the appropriate pigeonholes. This approach will grow increasingly impractical when businesses try to utilize larger sets of information, especially when this includes an ever-changing set of unstructured information. Unstructured data exists outside of formal databases and includes text in documents, social media and the Web, as well as voice, video and other rich media. A major point of Dietrich’s presentation was that to make sense of this torrent of ever-changing data it will be necessary to develop models that are far more adaptive and flexible than today’s at ingesting data. Analytic modeling must be iterative, to accommodate learning and the evolving sets of data available. And the models must be able to assess the degree of trustworthiness of the information used and communicate the degree of confidence that consumers of the results should place on the results. With so much of the information coming from sources outside the company, the system needs to learn how accurate it is and how applicable it is to the analysis being performed.

Over time, these broad data sets can become quite reliable and therefore extremely useful. Meanwhile, the basic techniques of judging the quality of the data are important in themselves. It may be that even using the best available information will result in an answer that is “garbage” (that is, low-quality information that you shouldn’t trust), but at least the system will tell you that’s what it is. Being able to discern the difference will enable organizations to look for answers much more freely without having to spend time developing structure or doing rigorous testing to ensure quality.

I think today’s big-data efforts represent the first small step in what’s likely to be a long process of developing technologies, techniques and practices to use in designing and employing analytics that work with large sets of data of all types. The skeptic in me sees the danger of big data becoming big GIGO. The futurist in me sees the potential to apply information technology intelligently to develop better business practices in a way that is not feasible today. Both of us will be watching closely to see what happens.

Regards,

Robert Kugel – SVP Research

View full post