Maintaining data quality and trust is a perennial data management challenge, often preventing organizations from operating at the speed of business. Recent years have seen the emergence of data observability as a category of DataOps focused on monitoring the quality and reliability of data used for analytics and governance projects and associated data pipelines. There is clear overlap with data quality, which is more established as both a discipline and product category for improving trust in data. Organizations that have made investments in data quality might reasonably ask whether they need data observability, while those that have invested in data observability might wonder whether they can eschew traditional data quality tools.
So where does data quality end and data observability begin? Is one a subset of the other? It is worth exploring the relationship between data observability and data quality in greater detail to understand the value of each as well as how and when they intersect.
The first thing to state is that data quality predates data observability, having been an important aspect of data management for decades. However, its importance has arguably never been greater. As organization’s aspire to be more data-driven, it is critical to trust the data used to make those decisions. Without data quality processes and tools, organizations may make decisions based on old, incomplete, incorrect or poorly organized data. Assessing the quality of data used to make business decisions is not only more important than ever but also increasingly difficult given the growing range of data sources and the volume of data that needs to be evaluated. Poor data quality processes can result in security and privacy risks as well as unnecessary data storage and processing costs due to data duplication.
One of the most time-consuming aspects of analytics initiatives is assessing the quality of data. Almost two-thirds (64%) of participants in Ventana Research’s Analytics and Data
It is also important to understand that data quality is both a discipline and a product category. As a discipline, data quality refers to the processes, methods and tools used to measure the suitability of a dataset for a specific purpose. The precise measure of suitability will depend on the individual use case, but important characteristics include accuracy, completeness, consistency, timeliness and validity.
The data quality product category is comprised of the tools used to evaluate data in relation to these characteristics. Data observability has emerged as a separate product category. It includes software focused on automating the monitoring of data to assess its health based on key attributes including freshness, distribution, volume, schema and lineage. The use of automation expands the volume of data that can be monitored while also improving efficiency compared to manual data monitoring and management through automating data quality checks and recommended remediation actions. As such, automation is often cited as a distinction between data observability and data quality software. Focusing on automation as a distinction, however, relies on an outdated view of data quality software.
Although data quality software has historically provided users with an environment to manually check and correct data quality issues, the use of machine learning to automate the
A clearer distinction can be drawn from the scope and focus of the functionality. Data quality software is concerned with the suitability of the data to a given task. In comparison, data observability is concerned with the reliability and health of the overall data environment. Data observability tools monitor not just the data in an individual environment for a specific purpose at a given point in time, but also the associated upstream and downstream data pipelines. In doing so, data observability software ensures that data is available and up to date, avoiding downtime caused by lost or inaccurate data due to schema changes, system failures or broken data pipelines.
To put it another way, while data quality software is designed to help users identify and resolve data quality problems, data observability software is designed to automate the detection and identification of the causes of data quality problems, potentially enabling users to prevent data quality issues before they occur. For example, as long as the data being assessed remains consistent, data quality tools might not detect a failed pipeline until the data has become out of date. Data observability tools could detect the failure long before the data quality issue arose. In comparison, a change in address might not be identified by data observability tools if the new information adhered to the correct schema. It could be detected — and remediated — using data quality tools.
Data quality and data observability software products are therefore largely complementary. This is supported by the fact that some vendors offer products in both categories, while others offer products that could be said to offer functionality associated with both data observability and data quality. Potential customers are advised to pay close attention and evaluate purchases carefully. Some data observability products do offer quality resolution and remediation functionality traditionally associated with data quality software, albeit not to the same depth and breadth. Additionally, some vendors previously associated with data quality have adopted the term data observability but may lack the depth and breadth of pipeline monitoring and error detection capabilities. We will be taking these issues into account when assessing product capabilities as part of our forthcoming 2023 DataOps Value Index study, which includes an evaluation of data observability products.
Regards,
Matt Aslett