Calculation group
A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...
Read definitionDelta Lake is an open storage format that extends plain Parquet files with transactions, schema enforcement, and time travel. It forms the foundation of Microsoft Fabric, Databricks, and many other lakehouses.
Delta Lake is an open storage format for large datasets in a lakehouse. It sits on top of Parquet files and adds a transaction log. That log delivers properties you usually only find in a classic database: ACID transactions, schema evolution, time travel, and efficient updates.
Delta Lake originated at Databricks and has been open source under the Linux Foundation since 2019. It is the default format of Microsoft Fabric and the storage base of OneLake. Anyone building a lakehouse in the Microsoft stack today almost always writes Delta tables.
Picture Delta Lake as a ledger sitting next to your storage cupboard. The Parquet files in the cupboard do not really change, but a strict log tracks which files belong to which version. That lets you prevent two teams overwriting the same table, or roll back to last week's version with a single command.
A classic data lake with loose Parquet or CSV files has three fundamental pain points.
No transactions
If a write process crashes halfway, half-written files stay behind. Readers see inconsistent data. A database would roll back, a pure lake makes you fix it yourself.
No schema guarantee
Nothing stops someone from adding a file with a different set of columns tomorrow. Downstream queries break without warning.
Updates and deletes are painful
Removing a row from thousands of Parquet files requires full rewrites or complex partition logic. That does not work for GDPR requests or CDC-driven ingestion.
Delta Lake solves each of these with one elegant idea: a JSON-based transaction log that is updated on every write. The log describes which files are currently valid, which have been marked removed, and which schema version is in force.
Parquet as base
Data itself is still stored as Parquet files. Any tool that can read Parquet (Spark, Trino, Python, SQL engines) can read Delta too, at least in read-only mode.
The _delta_log folder
Next to the Parquet files sits a _delta_log folder full of JSON files. Each JSON file describes one commit: which files were added, which were removed, which schema changes happened. Periodically these logs are rolled up into checkpoint files so readers stay fast.
ACID transactions
Concurrent writes are ordered through optimistic concurrency. Conflicts are detected against the transaction log and bad commits roll back. Readers always see a consistent version.
Time travel
Because every version of the table is in the log, you can query an older version: SELECT * FROM sales VERSION AS OF 42 or TIMESTAMP AS OF '2026-01-01'. Handy for audit, reproducibility, and undoing mistakes.
Schema enforcement and evolution
On every write the schema is checked against the existing one. A mismatch is rejected by default. With explicit options you can allow evolution: adding columns or widening data types.
Delta Lake is not alone. Three open table formats compete for the same role.
Delta Lake
Built by Databricks, now open source. Deepest integration with Spark, Fabric, and Databricks. Simple concept, broad tooling, large ecosystem.
Apache Iceberg
Started at Netflix, broadly adopted by Snowflake, AWS, and Google. Strong in environments where multiple engines need to write the same table. Richer catalog semantics.
Apache Hudi
Born at Uber, strong on streaming use cases and upsert-heavy workloads. Smaller community than Delta or Iceberg.
In 2024 Databricks and Snowflake announced interoperability between Delta and Iceberg. For many organisations the choice is less permanent than it used to be.
Microsoft Fabric or Databricks as your platform. Both lean heavily on Delta. Anything else costs you integration and performance.
BI workloads on top of a lake. Delta gives you the performance and consistency to run Power BI directly against lakehouse tables.
GDPR and right-to-be-forgotten duties. Delta's DELETE and MERGE statements remove rows surgically instead of rewriting entire partitions.
Change data capture. Delta supports Change Data Feed, which lets downstream consumers read only the changed rows.
Small files
Streaming writers sometimes produce thousands of tiny Parquet files. Queries slow to a crawl. Run OPTIMIZE (Fabric, Databricks) or a compaction job periodically to merge them.
Transaction log drift
The log grows over time and affects query startup. VACUUM and automatic checkpointing keep it healthy.
Wrong partitioning
Too fine and you hit the small-file problem, too coarse and queries scan too much. Choose partition keys based on how users query, not on what looks logical in the source.
Not every reader is Delta-aware
Pure Parquet readers see only the files, not the log. They can still show rows you deleted. Always use a Delta-aware engine for transactional correctness.
A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...
Read definitionChange Data Capture (CDC) is the practice of detecting every change in a source system and forwarding it to downstream systems. It keeps you...
Read definitionData mesh is an organisational model for data in which each business domain owns its datasets and offers them as products. It breaks with th...
Read definitionA data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimi...
Read definitionDAX is the formula language behind Power BI, Excel Power Pivot and Analysis Services. You use it to build calculations like totals, margins ...
Read definition
A step by step guide on how you can create an event log for process mining.
Test data ideas fast with pretotyping. Learn how to validate concepts in days, avoid over-engineering, and build what truly adds value.