Data Pipelines Buyers Guide: Market Observations

Written by Matt Aslett | Oct 19, 2023 10:00:00 AM

The 2023 Ventana Research Buyers Guide for Data Pipelines research enables me to provide observations about how the market has advanced.

The development, testing and deployment of data pipelines is essential to generating intelligence from data. Just as a physical pipeline is used to transport water between turbines, generators and transformers in the generation of hydroelectric power, so data pipelines are used to transport data between the stages involved in data processing and analytics to generate business insight. Healthy data pipelines are necessary to ensure data is integrated and processed in the sequence required to generate business intelligence (BI).

The concept of the data pipeline is nothing new, but it is becoming increasingly important as organizations adapt data management processes to be more data driven. Data pipelines have traditionally involved batch extract, transform and load processes, but data-driven processes require more agile, continuous data processing as part of a DataOps approach to data management, with an increased focus on extract, load and transform (ELT) processes, as well as change data capture and automation and orchestration.

Data-driven organizations are increasingly thinking of the steps involved in extracting, integrating, aggregating, preparing, transforming and loading data as a continual process that is orchestrated to facilitate data-driven analytics. By 2026, three-quarters of organizations will adopt data engineering processes that span data integration, transformation and preparation, producing repeatable data pipelines that create more agile information architectures.

The need for more agile data pipelines is driven by the need for real-time data processing. Almost a quarter (22%) of organizations who participated in Ventana Research’s Analytics and Data Benchmark Research are currently analyzing data in real time, with an additional 10% analyzing data every hour. More frequent data analysis requires data to be integrated, cleansed, enriched, transformed and processed for analysis in a continuous and agile process.

Traditional batch extract, transform and load data pipelines are ill-suited to continuous and agile processes. These pipelines were designed to extract data from a source (typically a database supporting an operational application), transform it in a dedicated staging area, and then load it into a target environment (typically a data warehouse or data lake) for analysis.

Exact, transfer and load (ETL) pipelines can be automated and orchestrated to reduce manual intervention. However, since they are designed for a specific data transformation task, ETL pipelines are rigid and difficult to adapt. As data and business requirements change, ETL pipelines need to be rewritten accordingly.

The need for greater agility and flexibility to meet the demands of real-time data processing is one reason we have seen increased interest in ELT pipelines. These pipelines involve the use of a more lightweight staging tier, which is required simply to extract data from the source and load it into the target data platform. Rather than a separate transformation stage prior to loading as with an ETL pipeline, ELT pipelines make use of pushdown optimization, leveraging the data processing functionality and processing power of the target data platform to transform the data.

Pushing data transformation execution to the target data platform results in a more agile data extraction and loading phase, which is more adaptable to changing data sources. This approach is well-aligned with the application of schema-on-read applied in data lake environments, as opposed to the schema-on-write approach in which a schema is applied as it is loaded into a data warehouse.

Since the data is not transformed before being loaded into the target data platform, data sources can change and evolve without delaying data loading. This potentially enables data analysts to transform data to meet their requirements rather than have dedicated data integration professionals perform the task.

As such, many ELT offerings are positioned for use by data analysts and developers rather than IT professionals. This can also result in reduced delays in deploying business intelligence projects by avoiding the need to wait for data transformation specialists to (re)configure pipelines in response to evolving BI requirements and new data sources.

Like ETL pipelines, ELT pipelines may also be batch processes. Both can be accelerated by using change data capture techniques. Change data capture (CDC) is similarly not new but has come into greater focus given the increasing need for real-time data processing. As the name suggests, CDC is the process of capturing data changes. Specifically, in the context of data pipelines, CDC identifies and tracks changes to tables in the source database as they are inserted, updated or deleted. CDC reduces complexity and increases agility by only synchronizing changed data rather than the entire dataset. The data changes can be synchronized incrementally or in a continuous stream.

The development, testing and deployment of both ETL and ELT pipelines can be automated and orchestrated to provide further agility by reducing the need for manual intervention. Specifically, the batch extraction of data can be scheduled to occur at regular intervals of a set number of minutes or hours, while the various stages in a data pipeline can be managed as orchestrated workflows using data engineering workflow management platforms.

Data observability also has a complementary role to play in monitoring the health of data pipelines and associated workflows as well as the quality of the data itself. Many products for data pipeline development, testing and deployment also offer functionality for monitoring and managing pipelines and are integrated with data orchestration and/or observability functionality.

There remains a need for batch ETL pipelines, not least of which to support existing data integration and analytic processes. However, ELT and CDC approaches have a role to play alongside automation and orchestration in increasing data agility, and all organizations are recommended to consider the potential advantages of more agile data pipelines driving BI and transformational change.

This research evaluates the following vendors that offer products that address key elements of data pipelines as we define it: Alteryx, AWS, Astronomer, BMC, Databricks, DataKitchen, dbt Labs, Google, Hitachi Vantara, IBM, Infoworks.io, Matillion, Microsoft, Prefect, Rivery, SAP, StreamSets and Y42.

You can find more details on our site as well as in the Buyers Guide Market Report.

Regards,

Matt Aslett

View full post