I have recently written about the organizational and cultural aspects of being data-driven, and the potential advantages data-driven organizations stand to gain by responding faster to worker and customer demands for more innovative, data-rich applications and personalized experiences. I have also explained that data-driven processes require more agile, continuous data processing, with an increased focus on extract, load and transform processes — as well as change data capture and automation and orchestration — as part of a DataOps approach to data management. Safeguarding the health of data pipelines is fundamental to ensuring data is integrated and processed in the sequence required to generate business intelligence. The significance of these data pipelines to delivering data-driven business strategies has led to the emergence of vendors, such as Astronomer, focused on enabling organizations to orchestrate data engineering pipelines and workflows.
Astronomer was founded in 2018, building on a background of providing data engineering services to create a business around the Apache Airflow open-source workflow monitoring and management project. Apache Airflow began as an internal development project within Airbnb in 2014, but became an Apache Software Foundation project in 2016. It enables data engineers to use Python to programmatically author, schedule and monitor workflows. Astronomer’s workforce has been heavily involved in the Apache Airflow development project since the company’s foundation, contributing to significant enhancements such as the scheduler performance and high-availability capabilities delivered in 2020 with Apache Airflow 2.0, for example.
It was also in 2020 that Astronomer made the decision to focus its attention on building a cloud offering through which it could deliver Apache Airflow as a managed platform, rather than providing technical support services for Apache Airflow deployments. The resulting Astro service was launched in June 2022. Its development was financed by venture capital funding, including a $213 million Series C round announced in March 2022 that was led by Insight Partners, along with Meritech Capital, Salesforce Ventures, J.P. Morgan, K5 Global, Sutter Hill Ventures, Venrock and Sierra Ventures. The funding round also facilitated Astronomer’s acquisition of data pipeline observability and data lineage specialist Datakin, which was founded by the creators of the OpenLineage and Marquez open-source projects. The addition of data lineage capabilities based on the OpenLineage project adds data observability capabilities to Astronomer’s Airflow-as-a-service capabilities, with the company positioning the combined offering as a data orchestration cloud service.
The need for more agile data pipelines is driven by the need for real-time data processing. More frequent data analysis requires data to be integrated, cleansed, enriched, transformed and processed for analysis in a continuous and agile process. As such, data-driven organizations are increasingly treating the steps involved in extracting, integrating, aggregating, preparing, transforming and loading data as a continual process, with data pipelines used to enable the flow of information through the organization, increasingly scheduled, automated and orchestrated by data engineers without the need for constant manual intervention. I assert that by 2024, 6 in ten organizations will adopt data engineering processes that span data integration, transformation and preparation, producing repeatable data pipelines that create more agile information architectures.
While data engineers can deploy and run Apache Airflow on-premises or in the cloud, Astronomer’s Astro service delivers these capabilities as a managed service available on Amazon Web Services, Google Cloud or Microsoft Azure. In developing Astro, Astronomer reengineered Airflow for the cloud, with optimized configuration and auto-scaling capabilities, while the managed service approach is also designed to reduce infrastructure consumption for long-term tasks as well as reducing the need for data engineers to shoulder security, upgrading and other management responsibilities, enabling them to focus on data pipelines. Astro also offers users the ability to visually monitor activity and data pipeline dependencies and, thanks to the acquisition of Datakin, the ability to collect lineage metadata as well as identify and monitor data quality metrics to improve trust in data. As I noted earlier this year, monitoring the quality and reliability of data is a key component of data observability's role in ensuring healthy data pipelines.
While Apache Airflow and OpenLineage provide the core building blocks of the Astro data orchestration cloud service, there are opportunities for Astronomer to expand on this functionality with automated anomaly detection, alerting and root cause analysis, for example. That said, I recommend that all organizations currently managing Apache Airflow deployments or considering the use of Apache Airflow for evaluating data platforms to orchestrate data engineering pipelines and workflows include Astronomer in the evaluations.
Regards,
Matt Aslett