Dictionary

Lakehouse

A lakehouse combines the flexibility of a data lake with the structure and performance of a data warehouse. You store raw and processed data in open formats and run SQL analytics and machine learning on the same platform, without managing two separate systems.

What is a lakehouse?

A lakehouse is a data platform that combines the flexibility of a data lake with the structure and performance of a data warehouse. You keep your data (raw and processed) in open file formats on cheap object storage, but querying happens with SQL performance and ACID guarantees you'd expect from a warehouse.

The concept was popularised by Databricks around 2020 and has since been picked up by most major vendors. Microsoft Fabric, Snowflake, and Amazon Redshift all offer lakehouse-style features today, with differences in the details.

The goal is simple: stop running two separate systems (a data lake for ML and raw data, a data warehouse for BI) and use one platform that serves both worlds. That removes duplicated storage, duplicated ETL, and endless arguments about which system holds the truth.

How did the lakehouse concept emerge?

Through the 2010s many organisations built a data lake alongside their existing data warehouse. The lake had to handle everything the warehouse wouldn't: unstructured documents, logs, IoT data, experiments for machine learning. The assumption was that you could dump everything cheaply and add structure later.

In practice the opposite often happened. Data lakes turned into data swamps: a chaos of files without schema, without governance, without reliable query capabilities. Reporting stayed in the warehouse, while expensive ETL pipes moved data between the two.

The lakehouse emerged as an answer to that split. Three technical breakthroughs made it possible:

Open table formats like Delta Lake, Apache Iceberg, and Apache Hudi brought ACID transactions, schema evolution, and time travel to files on object storage.
Columnar file formats like Parquet made fast analytical queries on raw files feasible.
Decoupled compute: compute can scale independently of storage, so Spark, SQL, or Python notebooks can all work on the same data.

How does a lakehouse work?

A lakehouse is built around shared storage of files in open formats. On top of that run different compute layers for different use cases.

Storage layer
Cheap object storage such as Azure Data Lake Storage, Amazon S3, or Google Cloud Storage. All data lives here in open formats so multiple engines can read it without copies.

Table layer
An open table format (Delta, Iceberg, Hudi) adds a schema definition and transaction log on top of the files. That lets you run updates, deletes, and merges as if it were a classic database, with ACID guarantees.

Compute layer
Different engines for different tasks: Spark for large batch jobs and ML, SQL endpoints for BI tools like Power BI, Python notebooks for data science. Every engine sees the same data.

Governance layer
Central metadata management and access control. In Microsoft Fabric that role is played by OneLake and Microsoft Purview. In Databricks it's Unity Catalog.

Many lakehouses follow a medallion architecture with three tiers:

Bronze: raw data as it arrives.
Silver: cleaned, validated, integrated data.
Gold: business-specific aggregations ready for reporting.

Open file formats (Delta, Iceberg, Parquet)

The power of a lakehouse stands or falls on open formats.

Parquet
Columnar file format optimised for analytical queries. Most lakehouse table formats store their data as Parquet files under the hood.

Delta Lake
Developed by Databricks, now open source. Adds a transaction log to Parquet, giving you ACID transactions, schema evolution, and time travel. The default format inside Microsoft Fabric and Databricks.

Apache Iceberg
Similar concept, developed at Netflix, widely adopted by Snowflake, AWS, and Google. Strong in environments where multiple engines need to work on the same data.

Apache Hudi
Strong for upsert-heavy workloads and streaming. Less popular than Delta or Iceberg, but still present in some stacks.

In 2024 Databricks and Snowflake announced interoperability between Delta and Iceberg. For many organisations that means the choice between them is no longer a lock-in decision.

Lakehouse versus data lake versus data warehouse

Lakehouse versus data lake

A pure data lake has no table structure, no ACID transactions, no guarantees about data consistency. You can run SQL through external tools, but performance and reliability are limited.

A lakehouse adds all of that through an open table format. You get lake flexibility with warehouse discipline.

Lakehouse versus data warehouse

A classic data warehouse uses a proprietary storage format and a strict schema. Excellent for BI, less suitable for unstructured data or heavy ML workloads.

A lakehouse uses open formats and is more flexible across workloads, but needs more engineering to reach warehouse-grade governance and performance. For pure BI workloads, a classic warehouse is sometimes still the simpler choice.

Platforms that offer a lakehouse

Microsoft Fabric
Lakehouse around OneLake, with Delta as the default table format and deep integration with Power BI. The strongest option for Microsoft-oriented organisations.
Databricks
The originator of the term and still the benchmark for data engineering and ML. Strong on large, complex data workloads.
Snowflake
Classic warehouse with a lakehouse layer (Polaris Catalog, Iceberg support) on top. Interesting for teams already on Snowflake who want to move gradually to open formats.
AWS
The combination of S3, Glue, Athena, and Redshift Spectrum forms a lakehouse architecture, but you integrate more yourself than in Fabric or Databricks.

The choice comes down to the data and tools you already run, how much data engineering capacity you have, and whether BI or data science is the centre of your roadmap. For organisations already on Microsoft 365 and Power BI, Fabric is often the shortest path to a working lakehouse.

Last Updated: April 18, 2026 Back to Dictionary

Keywords

lakehouse data lakehouse data lake data warehouse microsoft fabric onelake delta lake parquet databricks etl

/ Related

Related Terms

Term

Calculation group

A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...

Read definition

Term

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of detecting every change in a source system and forwarding it to downstream systems. It keeps you...

Read definition

Term

Data lineage

Data lineage shows the full journey data takes inside an organisation. From the original source to the final report, with meaning and contex...

Read definition

Term

Data mesh

Data mesh is an organisational model for data in which each business domain owns its datasets and offers them as products. It breaks with th...

Read definition

Term

Data warehouse

A data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimi...

Read definition

/ Further reading

From the blog.

Sketched illustration of repetitive tasks flowing through a gear into a dashboard showing eight hours saved per week.

Article · Jan 28, 2026

10 Practical Steps to Automate Your Business Processes

Ten practical steps to automate your business processes without AI hype. Start small, fix the process first, use the tools you already own, ...

Hand-drawn illustration of a magnifying glass hovering over a grid of small task cards. One highlighted plum card inside the lens flows into a prioritized automation pipeline on the right.

Article · Jan 28, 2026

How to Identify Automation Opportunities in Your Business Processes

Find the automation opportunities in your business that are actually worth building. A five-question test, the hotspots we keep seeing, and ...