Dictionary

Parquet

Apache Parquet is a columnar file format designed for analytical queries on large datasets. It is the storage layer under almost every modern lakehouse, from Microsoft Fabric to Databricks and Snowflake.

What is Parquet?

Apache Parquet is an open file format for structured data, optimised for analytical queries. Instead of storing rows one after another (as CSV or JSON do), Parquet stores values column by column. That may sound like a minor detail, but on large datasets it delivers orders-of-magnitude faster queries and much smaller files.

Parquet was launched in 2013 by Twitter and Cloudera and is today the de facto standard in the Hadoop and lakehouse ecosystem. It sits as the underlying layer in Microsoft Fabric, Databricks, Snowflake (for external tables), Amazon S3, Azure Data Lake, and practically every data engineer's toolbox.

Compare Parquet to the difference between a paper filing cabinet and a ledger with columns. To total up all the salaries in a ledger you just run down the salary column. In a row-based system you open every folder and hunt for the right page.

Why columnar storage?

Analytical queries usually touch few columns but many rows: sum of revenue by month, customers by region. With row-based storage you must read every whole record from disk, including columns you do not use. With columnar storage you only read the columns you need.

Three benefits flow from this:

I/O savings
A query that touches three out of seventy columns reads only a small fraction of the data from disk or blob storage. On cloud storage where you pay per byte, that shows up directly on the bill.

Better compression
Values within one column look alike (same data type, often similar scale). Compression algorithms like Snappy, Zstd, or GZIP extract much more from them than from mixed row data. Parquet files are typically two to ten times smaller than a CSV with the same content.

Vectorised execution
Modern query engines process column values in batches on CPU or GPU. That is far faster than interpreting one row at a time.

How is a Parquet file structured?

A Parquet file is more than a flat list. It has several layers that together allow efficient scanning.

Row groups
A file is split into blocks of typically 128 MB, each a row group. Inside a row group, every column is stored separately. That enables parallel processing per row group.

Column chunks and pages
Inside a row group, values per column are grouped into column chunks, which in turn consist of pages. Compression happens at page level.

Statistics and indexes
Every column chunk carries min and max values, null counts, and sometimes a bloom filter. Query engines use them to skip whole chunks when the requested value cannot be in there. That is called predicate pushdown and delivers dramatic speed-ups.

Schema in the footer
At the end of every file sits a footer with the full schema, data types, encoding, and statistics. That lets any reader interpret a Parquet file without extra metadata.

Parquet versus CSV, JSON, and ORC

Parquet versus CSV
CSV is row-based, text, without schema. Fine for small exchanges between systems. Anything bigger than a few million rows is faster, smaller, and safer (because typed) in Parquet.

Parquet versus JSON
JSON is flexible for nested structures but wastes space with repeated keys. Parquet supports nested structures too (lists, maps, structs) and does so far more efficiently.

Parquet versus ORC
ORC is a similar columnar format from the Hive ecosystem. Technically close, but Parquet won in practice because it is better supported outside Hadoop, including by Python, Spark, and all major lakehouse vendors.

Parquet as the base of Delta and Iceberg

Delta Lake, Apache Iceberg, and Apache Hudi all store their data as Parquet files under the hood. What those formats add is a transaction log and metadata, not a different storage format. A lakehouse table is essentially a folder of Parquet files plus a log that describes which files belong to which version.

That is exactly why open formats matter: your data is never locked in a proprietary binary. Any tool that reads Parquet (Python, Spark, Trino, DuckDB) can work directly against your lakehouse, albeit without transactional guarantees.

Pitfalls

Too many small files
Parquet is built for large files. If your ETL writes a new 2 MB Parquet every minute, you lose the columnar advantage and queries crawl. Plan for compaction or use a format with built-in optimisation like Delta.

Wrong row group size
The default sits at 128 MB. For very narrow tables (few columns) or very wide ones (hundreds of columns), tuning is worthwhile.

Changes are expensive
Parquet is immutable. Updating one row means rewriting the whole file. That is why ETL/ELT pipelines rely on append or on Delta/Iceberg for updates.

No schema guarantee at folder level
Each Parquet file carries its own schema. When different runs write different schemas, you end up with a folder of incompatible files. Schema enforcement needs to live in a layer above, such as Delta or Iceberg.

Last Updated: April 23, 2026 Back to Dictionary

Keywords

parquet apache parquet columnar lakehouse delta lake microsoft fabric onelake data engineering open format analytics

/ Related

Related Terms

Term

Calculation group

A calculation group applies one DAX pattern to every measure in your model. You write YTD, MTD and YoY% once instead of repeating them for e...

Read definition

Term

Change Data Capture (CDC)

Change Data Capture (CDC) is the practice of detecting every change in a source system and forwarding it to downstream systems. It keeps you...

Read definition

Term

Data mesh

Data mesh is an organisational model for data in which each business domain owns its datasets and offers them as products. It breaks with th...

Read definition

Term

Data warehouse

A data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimi...

Read definition

Term

DAX (Data Analysis Expressions)

DAX is the formula language behind Power BI, Excel Power Pivot and Analysis Services. You use it to build calculations like totals, margins ...

Read definition

/ Further reading

From the blog.

Article · Nov 18, 2025

How to create an event log for process mining

A step by step guide on how you can create an event log for process mining.

Article · Nov 10, 2025

Pretotyping Data Projects

Test data ideas fast with pretotyping. Learn how to validate concepts in days, avoid over-engineering, and build what truly adds value.