Data virtualization is not new, but it has changed over the years. The term describes a process of combining data on the fly from multiple sources rather than copying that data into a common repository such as a data warehouse or a data lake, which I have written about. There are many reasons for an organization concerned with managing its data to consider data virtualization, most stemming from the fact that the data does not have to be copied to a new location. It could, for instance, eliminate the cost of building and maintaining a copy of one of the organization’s big data sources. Recognizing these benefits, many database and data integration companies offer data virtualization products. Denodo, one of the few independent, best-of-breed vendors in this market today, brings these capabilities to big data sources and data lakes.
Google Trends presents a graphic representation of the decline of the popularity of the term data federation and the rise in popularity
Denodo takes a different approach. Its tools consider the costs of each part of the individual query and evaluate trade-offs. As the saying goes, there’s more than one way to skin a cat; in this case there’s more than one way to execute a SQL statement. For example, suppose you wish to create a list of all sales of a certain set of products. Your company has 1,000 products (maintained in one system) and hundreds of millions of customer transactions (maintained in another system). The federated approach would bring both data sets to the federated system, join them and then find the desired subset of products. An alternative would be to ship the table of 1,000 products to the system that holds the customer transactions, load it as a temporary table and join it to the customer transaction data to identify the desired subset before sending the product data back to its source. Today’s data virtualization evaluates the costs in time of the two alternatives and selects the one that would produce the result set the fastest.
Data virtualization can make it easier, and
Not all the work is eliminated by data virtualization. You must still design the logical model for the data that you want to provide, such as which tables and which columns to include, but that’s all. Virtualization eliminates load processes and the need to update the data. In the case of big data, there are no extra clusters to set up and maintain. The logical data warehouse or data lake uses the security and governance system already in place. As a result, users can avoid some of the organizational battles about data access since the “owner” of the data continues to maintain the rights and restrictions on the data. Our research shows that organizations that have adequate data virtualization capabilities are more often satisfied with the way their organization manages big data than are organizations as a whole (88% vs. 58%) and are more confident in the data quality of their big data integration efforts (81% vs. 54%).
In its most recent release, version 6.0, Denodo enhanced its cost-based query optimizer for data virtualization. Many of the optimizer’s features would be found in any decent relational database management system, but the challenge becomes greater when the underlying resources are scattered among multiple systems. To address this issue Denodo collects and maintains statistics about the various data sources that are evaluated at run time to determine the optimal way to execute queries. The product offers connectivity to a variety of data sources, both structured and unstructured, including Hadoop, NoSQL, documents and websites. It can be deployed on premises, in the cloud using Amazon Web Services or in a hybrid configuration.
Performance can be a key factor in user acceptance of data virtualization; users will balk if access is too slow. Denodo has published some benchmarks showing that performance of its product can be nearly identical to accessing data loaded into an analytical database. I never place much emphasis on vendor benchmarks as they may or may not reflect an actual organization’s configuration and requirements. However, the fact that Denodo produces this type of benchmark indicates its focus on minimizing the performance overhead associated with data virtualization.
When I first looked at Denodo, prior to the 6.0 release, I expected to see more optimization techniques built into the product. There’s always room for improvement, but with the current release the company has made great strides and addressed many of these issues. In order to maximize the software’s value to customers, I’d like to see the company invest in developing more technology partnerships
If your organization is considering data virtualization technology, I recommend you evaluate Denodo. The company won the 2015 Ventana Research Technology Innovation Award for Information Management, and its customer Autodesk won the 2015 Leadership Award in the Big Data Category. If your organization is deluged with big data but is not considering data virtualization, it probably should be. As our research shows, it can lead to greater satisfaction with and more confidence in the quality of your data.
Regards,
David Menninger
SVP & Research Director
Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.