Dictionary

Delta Lake

Delta Lake is an open storage format that extends plain Parquet files with transactions, schema enforcement, and time travel. It forms the foundation of Microsoft Fabric, Databricks, and many other lakehouses.

What is Delta Lake?

Delta Lake is an open storage format for large datasets in a lakehouse. It sits on top of Parquet files and adds a transaction log. That log delivers properties you usually only find in a classic database: ACID transactions, schema evolution, time travel, and efficient updates.

Delta Lake originated at Databricks and has been open source under the Linux Foundation since 2019. It is the default format of Microsoft Fabric and the storage base of OneLake. Anyone building a lakehouse in the Microsoft stack today almost always writes Delta tables.

Picture Delta Lake as a ledger sitting next to your storage cupboard. The Parquet files in the cupboard do not really change, but a strict log tracks which files belong to which version. That lets you prevent two teams overwriting the same table, or roll back to last week's version with a single command.

What does Delta Lake fix?

A classic data lake with loose Parquet or CSV files has three fundamental pain points.

No transactions
If a write process crashes halfway, half-written files stay behind. Readers see inconsistent data. A database would roll back, a pure lake makes you fix it yourself.

No schema guarantee
Nothing stops someone from adding a file with a different set of columns tomorrow. Downstream queries break without warning.

Updates and deletes are painful
Removing a row from thousands of Parquet files requires full rewrites or complex partition logic. That does not work for GDPR requests or CDC-driven ingestion.

Delta Lake solves each of these with one elegant idea: a JSON-based transaction log that is updated on every write. The log describes which files are currently valid, which have been marked removed, and which schema version is in force.

How does Delta Lake work?

Parquet as base
Data itself is still stored as Parquet files. Any tool that can read Parquet (Spark, Trino, Python, SQL engines) can read Delta too, at least in read-only mode.

The _delta_log folder
Next to the Parquet files sits a _delta_log folder full of JSON files. Each JSON file describes one commit: which files were added, which were removed, which schema changes happened. Periodically these logs are rolled up into checkpoint files so readers stay fast.

ACID transactions
Concurrent writes are ordered through optimistic concurrency. Conflicts are detected against the transaction log and bad commits roll back. Readers always see a consistent version.

Time travel
Because every version of the table is in the log, you can query an older version: SELECT * FROM sales VERSION AS OF 42 or TIMESTAMP AS OF '2026-01-01'. Handy for audit, reproducibility, and undoing mistakes.

Schema enforcement and evolution
On every write the schema is checked against the existing one. A mismatch is rejected by default. With explicit options you can allow evolution: adding columns or widening data types.

Delta Lake versus Iceberg versus Hudi

Delta Lake is not alone. Three open table formats compete for the same role.

Delta Lake
Built by Databricks, now open source. Deepest integration with Spark, Fabric, and Databricks. Simple concept, broad tooling, large ecosystem.

Apache Iceberg
Started at Netflix, broadly adopted by Snowflake, AWS, and Google. Strong in environments where multiple engines need to write the same table. Richer catalog semantics.

Apache Hudi
Born at Uber, strong on streaming use cases and upsert-heavy workloads. Smaller community than Delta or Iceberg.

In 2024 Databricks and Snowflake announced interoperability between Delta and Iceberg. For many organisations the choice is less permanent than it used to be.

When do you pick Delta Lake?

  1. Microsoft Fabric or Databricks as your platform. Both lean heavily on Delta. Anything else costs you integration and performance.

  2. BI workloads on top of a lake. Delta gives you the performance and consistency to run Power BI directly against lakehouse tables.

  3. GDPR and right-to-be-forgotten duties. Delta's DELETE and MERGE statements remove rows surgically instead of rewriting entire partitions.

  4. Change data capture. Delta supports Change Data Feed, which lets downstream consumers read only the changed rows.

Pitfalls

Small files
Streaming writers sometimes produce thousands of tiny Parquet files. Queries slow to a crawl. Run OPTIMIZE (Fabric, Databricks) or a compaction job periodically to merge them.

Transaction log drift
The log grows over time and affects query startup. VACUUM and automatic checkpointing keep it healthy.

Wrong partitioning
Too fine and you hit the small-file problem, too coarse and queries scan too much. Choose partition keys based on how users query, not on what looks logical in the source.

Not every reader is Delta-aware
Pure Parquet readers see only the files, not the log. They can still show rows you deleted. Always use a Delta-aware engine for transactional correctness.

Last Updated: April 23, 2026 Back to Dictionary
Keywords
delta lake lakehouse microsoft fabric onelake databricks parquet acid time travel data engineering iceberg open table format