Calculation group
A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...
Read definitionApache Parquet is a columnar file format designed for analytical queries on large datasets. It is the storage layer under almost every modern lakehouse, from Microsoft Fabric to Databricks and Snowflake.
Apache Parquet is an open file format for structured data, optimised for analytical queries. Instead of storing rows one after another (as CSV or JSON do), Parquet stores values column by column. That may sound like a minor detail, but on large datasets it delivers orders-of-magnitude faster queries and much smaller files.
Parquet was launched in 2013 by Twitter and Cloudera and is today the de facto standard in the Hadoop and lakehouse ecosystem. It sits as the underlying layer in Microsoft Fabric, Databricks, Snowflake (for external tables), Amazon S3, Azure Data Lake, and practically every data engineer's toolbox.
Compare Parquet to the difference between a paper filing cabinet and a ledger with columns. To total up all the salaries in a ledger you just run down the salary column. In a row-based system you open every folder and hunt for the right page.
Analytical queries usually touch few columns but many rows: sum of revenue by month, customers by region. With row-based storage you must read every whole record from disk, including columns you do not use. With columnar storage you only read the columns you need.
Three benefits flow from this:
I/O savings
A query that touches three out of seventy columns reads only a small fraction of the data from disk or blob storage. On cloud storage where you pay per byte, that shows up directly on the bill.
Better compression
Values within one column look alike (same data type, often similar scale). Compression algorithms like Snappy, Zstd, or GZIP extract much more from them than from mixed row data. Parquet files are typically two to ten times smaller than a CSV with the same content.
Vectorised execution
Modern query engines process column values in batches on CPU or GPU. That is far faster than interpreting one row at a time.
A Parquet file is more than a flat list. It has several layers that together allow efficient scanning.
Row groups
A file is split into blocks of typically 128 MB, each a row group. Inside a row group, every column is stored separately. That enables parallel processing per row group.
Column chunks and pages
Inside a row group, values per column are grouped into column chunks, which in turn consist of pages. Compression happens at page level.
Statistics and indexes
Every column chunk carries min and max values, null counts, and sometimes a bloom filter. Query engines use them to skip whole chunks when the requested value cannot be in there. That is called predicate pushdown and delivers dramatic speed-ups.
Schema in the footer
At the end of every file sits a footer with the full schema, data types, encoding, and statistics. That lets any reader interpret a Parquet file without extra metadata.
Parquet versus CSV
CSV is row-based, text, without schema. Fine for small exchanges between systems. Anything bigger than a few million rows is faster, smaller, and safer (because typed) in Parquet.
Parquet versus JSON
JSON is flexible for nested structures but wastes space with repeated keys. Parquet supports nested structures too (lists, maps, structs) and does so far more efficiently.
Parquet versus ORC
ORC is a similar columnar format from the Hive ecosystem. Technically close, but Parquet won in practice because it is better supported outside Hadoop, including by Python, Spark, and all major lakehouse vendors.
Delta Lake, Apache Iceberg, and Apache Hudi all store their data as Parquet files under the hood. What those formats add is a transaction log and metadata, not a different storage format. A lakehouse table is essentially a folder of Parquet files plus a log that describes which files belong to which version.
That is exactly why open formats matter: your data is never locked in a proprietary binary. Any tool that reads Parquet (Python, Spark, Trino, DuckDB) can work directly against your lakehouse, albeit without transactional guarantees.
Too many small files
Parquet is built for large files. If your ETL writes a new 2 MB Parquet every minute, you lose the columnar advantage and queries crawl. Plan for compaction or use a format with built-in optimisation like Delta.
Wrong row group size
The default sits at 128 MB. For very narrow tables (few columns) or very wide ones (hundreds of columns), tuning is worthwhile.
Changes are expensive
Parquet is immutable. Updating one row means rewriting the whole file. That is why ETL/ELT pipelines rely on append or on Delta/Iceberg for updates.
No schema guarantee at folder level
Each Parquet file carries its own schema. When different runs write different schemas, you end up with a folder of incompatible files. Schema enforcement needs to live in a layer above, such as Delta or Iceberg.
A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...
Read definitionChange Data Capture (CDC) is the practice of detecting every change in a source system and forwarding it to downstream systems. It keeps you...
Read definitionData mesh is an organisational model for data in which each business domain owns its datasets and offers them as products. It breaks with th...
Read definitionA data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimi...
Read definitionDAX is the formula language behind Power BI, Excel Power Pivot and Analysis Services. You use it to build calculations like totals, margins ...
Read definition
A step by step guide on how you can create an event log for process mining.
Test data ideas fast with pretotyping. Learn how to validate concepts in days, avoid over-engineering, and build what truly adds value.