Enterprises face a bewildering level of choice in relation to data platforms, as evidenced by the number of software providers and products assessed in our recent Data Platforms Buyers Guide. There are numerous data platform providers and products to choose from, but also a diverse array of functional and architectural options. Is the workload primarily operational or analytic? Will it be deployed on-premises or in the cloud? Should it be distributed or centralized? Data warehouse or data lakehouse? Data fabric or data grid? These are often presented as binary choices, but they very rarely are. Many enterprises are looking for data platforms that support the flexibility to address a combination of functional and architectural options. Multi-product providers, such as IBM, are arguably at an advantage, if they can clearly articulate how and when their different options work together. At its recent Think 2024 customer event, IBM announced several enhancements to its watsonx.data lakehouse offering and articulated how it can coexist with IBM Cloud Pak for Data as part of a data fabric architecture.
IBM unveiled its watsonx brand at Think 2023, delivering an artificial intelligence (AI) and data platform designed to address the AI development life cycle, data storage processing and AI governance. The watsonx platform
The concept of the data lake emerged over a decade ago in response to the demand for analytic data platforms that could economically store and process large volumes of raw data, either in the cloud or on premises. The
IBM watsonx.data is on IBM Cloud, Amazon Web Services and Microsoft Azure, as well as on premises, and enables users to store large volumes of data in object storage using the Apache Iceberg table format for transactional consistency. The data can be processed using a choice of query engines, including Apache Spark and Presto, as well as IBM’s own Db2 and Netezza. At Think 2024, IBM announced that it had updated watsonx.data with support for Presto 2.0, including Presto C++, co-developed by IBM employees as part of the open-source Linux Foundation project, to run Presto with the Velox open source C++ acceleration library for improved performance. IBM also announced the addition of IBM Data Gate for watsonx to enable access to data in IBM zSystems environments, facilitating the development of AI models using transactional mainframe data. IBM also announced the integration of a semantic layer into IBM Knowledge Catalog, embeddable within watsonx.data, to accelerate data discovery and enrichment via semantic search capabilities. Previously, IBM had also announced the addition of vector database capabilities based on the open-source Milvus database, enabling the use of vector search to augment GenAI with context from enterprise content and data via retrieval-augmented generation.
As I previously noted, while some of the functionality delivered in watsonx is new, some is also available via other IBM products, such as IBM Cloud Pak for Data. It has subsequently become clearer how these products relate to each other, with the IBM watsonx.data license including entitlements to IBM Cloud Pak for Data platform software, as well as other prerequisites such as the Red Hat OpenShift Container Platform. IBM recently announced Cloud Pak for Data 5.0, including a new Immersive Experience feature which enables administrators to easily toggle between the dedicated user experiences for Cloud Pak for Data and watsonx, facilitating the use of watsonx.data as a data lakehouse within a larger data fabric architecture supported by Cloud Pak for Data and its combination of data integration, data governance, data observability, master data management and data lineage functionality. The Immersive Experience feature also provides access to IBM Data Product Hub, which was unveiled at Think 2024, to facilitate the development and sharing of data products. Included as part of IBM Cloud Pak for Data 5.0 and integrated with IBM watsonx.data, IBM Data Product Hub is built on Watson Knowledge Catalog to enable and control data access based on metadata and governance rules, with additional functionality for defining and enforcing data contracts between data producers and consumers. As such, IBM Data Product Hub has the potential to support the holistic view of data production and consumption that we see as critical to data intelligence.
IBM has also boosted its data fabric capabilities with the recently closed acquisition of the StreamSets real-time data integration capabilities from Software AG, along with the webMethods Integration Platform as a Service. These investments complement IBM’s previous acquisitions of Databand.ai for data observability in 2022 and Manta for data lineage in 2023, as well as the company’s ongoing internal development of data processing and management capabilities. The breadth of functionality available from IBM has the potential to be overwhelming for would-be customers, but the product positioning is becoming clearer with watsonx.data and Cloud Pak for Data as the delivery vehicles for data lakehouse and data fabric environments, respectively, and Cloud Pak for Data’s Immersive Experience feature facilitating co-existence. I recommend that any enterprises considering their data platform options include IBM in their evaluations. The company was rated Exemplary in our recent Data Platforms Buyers Guides for Operational Data Platforms, Analytic Data Platforms and overall Data Platforms.
Regards,
Matt Aslett