Reflections on BARC’s recent briefing with Dremio by Timm Grosser, BARC’s Senior Analyst for Data Management
What is a data lake engine?
In short, it should help to find data in its (cloud) data lake quickly and easily, and to evaluate it with a high level of query performance. Technically speaking, it is an SQL-based query engine with a semantic layer that enables queries on different data storage systems (on-premises or cloud-based). It acts as a central access point for JDBC/ODBC-compatible user tools.
About Dremio
Dremio was established in 2015 with headquarters in Santa Clara, USA. Currently, around 120 employees work for the technology supplier. Customers include companies such as Diageo, Microsoft, NCR, PayPal, Standard Chartered and Transunion. In the DACH region, DATEV, DB Cargo, Henkel and Software AG (Cumulocity IoT) already use Dremio. Datev, DBCargo and Henkel are among the showcase customers in the DACH region. Dremio is suitable for use by companies from all industry sectors. A dedicated team was established in early 2020 to focus on the German-speaking market. The company plans to expand this team in the future.
In 2018, Dremio Enterprise Edition was launched as a supplement to the open source Dremio Community Edition product. The Enterprise Edition primarily includes additional enterprise functions related to data protection and security as well as services. Dremio can be used on-premises and/or in your own cloud account (AWS, Azure).
Dremio is available in the AWS and Azure marketplaces and is a co-sell partner of both these providers.
Another strong global partner is Tableau. Tableau uses Dremio primarily for SQL data access to distributed file systems and has already convinced several customers to work with Dremio.
Dremio is leveraged and has recently received a US$70 million cash injection.
The mission
With its technology, Dremio aims to simplify and accelerate access to data for analytical workflows and to make this more cost-effective than other players in the market. Its cost-effectiveness extends beyond technology license fees: its approach of not moving and duplicating data in the overall architecture also saves costs. Dremio follows the approach of providing fast, flexible access to (distributed) data via a user-friendly interface. This is to avoid additional persistent layers such as aggregations and to give users a platform to perform ad hoc analyses.
Its main users are business analysts, data scientists and data engineers. Dremio considers it important to be seen as an agnostic tool. It enables the querying of different data storage technologies at different locations (cross-cloud, on-premises/cloud, etc.).
The technology
Dremio goes back to Apache Drill, an SQL engine for Hadoop. This has never really caught on for analytical workloads, mainly for reasons of performance and complex handling. Apache Arrow – a technology that Dremio makes use of – was co-developed by Dremio co-founder and CTO Jaques Nadeau and is still being developed. Apache Arrow provides a cross-language development platform for in-memory data and specifies a standardized language-independent column storage format for flat and hierarchical data. This becomes interesting when it comes to linking different data stores with different formats for analytical queries and still delivers good performance.
Dremio provides connectors to various relational and non-relational storage technologies and distributed file systems. The next step is to execute queries on the connected systems. These queries are defined as virtual datasets and are executed (live) at the time of execution. Each query uses acceleration mechanisms such as massive parallel processing (MPP), query optimization or push-down options. The push-down allows the delegation of workload to the source systems. One of Dremio’s main performance features is called “Reflections”. These resemble “materialized views” and persist physically optimized data representation in column-based Parquet files if desired. With each query, Dremio checks whether a persisted (precalculated) intermediate result is available, thus saving computing time. An internal catalog is available for data searches in the technical metadata. The technical metadata can be tagged with wikis and tags, making it findable for the more professional user. The solution does not replace an enterprise data catalog, but can be integrated with one.
Further expansion of Reflections is a particularly exciting aspect of Dremio’s roadmap. Today, these still have to be set up manually. In the future, the system is set to provide “intelligent” support with the help of data from the query behavior.
Analyst opinion
Dremio is a query engine for analytical workloads, preferably on (cloud) data lakes. The technology offers an approach to virtually merge today’s complex, heterogeneous system landscapes. The idea of an access layer across different cloud offerings seems especially attractive and opens up the possibility to operate on more than one cloud platform, giving the analyst a lot of flexibility in data delivery. A technology/vendor lock-in is avoided.
Dremio calls itself a query engine with a semantic layer and leaves the data processing to the specialists. This is how it successfully differentiates itself from providers such as Databricks and Denodo. The promise: high performance at low cost.
It remains to be seen to what extent its performance convinces customers. We are looking forward to finding out more in our upcoming reference customer meetings and technological deep dives.