ISG Software Research Analyst Perspectives

Dremio Enables Self-Service Analytics for the Data Lakehouse

Written by Matt Aslett | May 15, 2024 10:00:00 AM

I previously wrote about the potential for rapid adoption of the data lakehouse concept as enterprises combined the benefits of data lakes based on low-cost cloud object storage with the structured data processing functionality normally associated with data warehousing. By layering support for table formats, metadata management and transactional updates and deletes as well as query engine and data orchestration functionality on top of low-cost storage of both structured and unstructured data, the data lakehouse enables enterprises to not only store and process data from multiple applications, but also enable it to be analyzed by multiple users in multiple departments for many purposes, including business intelligence and artificial intelligence. Vendors such as Dremio have added capabilities to the core concept to better equip enterprises to rely on the use of data lakehouses for self-service analytics and AI.

Dremio was founded in 2015 to build a business around the Apache Arrow in-memory columnar data format, which was developed to enable high-performance analysis of large volumes of data. Apache Arrow underpins the company’s SQL Query Engine, which is designed to deliver high-performance BI and interactive analytics directly on the data stored in a data lake or other data platforms across cloud, on-premises or hybrid environments. The SQL Query Engine is one of three core components of Dremio’s Unified Lakehouse Platform, alongside governed self-service analytics and data lakehouse management based on Dremio’s data catalog for the Apache Iceberg table format. In combination, these capabilities are designed to enable enterprises to connect and govern data in on-premises and cloud data lakes as well as other data sources across the database estate and make it available to data analysts and business users to access and analyze on a self-service basis.

With customers in a variety of industries including, financial services, healthcare, retail, manufacturing and consumer packaged goods, Dremio has raised more than $400 million in funding from the likes of Adams Street Partners, Cisco Investments, Insight Partners, Lightspeed Venture Partners, Norwest Venture Partners, and Sapphire Ventures. Most recently, Dremio raised a $160 million Series E funding round in January 2022, which valued the company at over $2 billion.

It is common for enterprises to create data lake environments to persist structured and unstructured data in object storage, either on-premises or in the cloud. More than one-half (53%) of participants in Ventana Research’s Analytics and Data Benchmark Research currently use object stores in their analytics efforts and an additional 18% plan to do so within the next two years. Data lakes provide a relatively low-cost alternative to data persistence in traditional full-stack data warehouse environments, which combine compute and storage. Data warehousing providers have responded to take advantage of data lakes by adapting products to independently scale compute and storage by deploying on or alongside a data lake.

Meanwhile, data lakehouse vendors have integrated the functionality associated with data warehousing into the data lake itself. This includes distributed SQL query engines; support for atomic, consistent, isolated and durable transactions; updates and deletes; concurrency control; metadata management; data indexing; data caching; schema enforcement and evolution; query acceleration; semantic models; data governance; version control; access control and auditing.

Dremio’s Unified Lakehouse Platform is available as software for deployment on-premises and in the cloud as well as a cloud service. It is made up of three core sets of capabilities addressing SQL query processing and acceleration, lakehouse management and unified analytics. For SQL query processing and acceleration, the platform’s SQL Query Engine enables the processing and transformation of data in cloud data lakes as well as federated querying of metastores and databases on-premises and in the cloud. SQL Query Engine enables users to create virtual tables known as Views from the source data for query acceleration. It also offers pre-computed data summaries, known as Reflections, that accelerate complex aggregations and other operations as well as using Columnar Cloud Cache for in-memory data processing.

Dremio’s lakehouse management capabilities provide a data catalog based on the Apache Iceberg table format that can be accessed using SQL Query Engine as well as other query engines such as Apache Spark or Apache Flink. The lakehouse management capabilities include automated optimization of Apache Iceberg tables, centralized data governances, and Git-like data branching and version control as well as isolated and consistent data transformations based on Dremio’s Nessie open-source project. The unified analytics capabilities take advantage of Dremio's Universal Semantic layer to provide self-service access to discover data in the data catalog and take advantage of the SQL Query Engine to accelerate analysis for data and business analysts using a variety of analytics tools and applications from the likes of Alteryx, Domo, Google Cloud, IBM, Microsoft, MicroStrategy, Qlik, SAP and Salesforce’s Tableau. I assert that through 2027, almost all enterprises using data catalog products will increase business user access, facilitating self-service data discovery and accelerating data intelligence and democratization initiatives. Dremio has added generative AI-based capabilities to lower the barriers to accessing and working with data, including auto-generated descriptions and labeling as well as the conversion of natural language questions to SQL queries.

While Dremio has always offered features and functionality of value to data engineers, the recent addition of lakehouse management capabilities enables the company to articulate a larger value proposition for technology decision-makers that addresses the advantages of self-service analytics and AI. I anticipate further investment in generative AI capabilities, such as vector search and automated semantic data modeling. I recommend that any organization considering the data lakehouse approach evaluate Dremio’s Unified Lakehouse Platform when evaluating options to take advantage of its combination of query acceleration and data management.

Regards,

Matt Aslett