Dictionary

Data warehouse

A data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimised for fast reads and aggregation across large volumes of historical data.

What is a data warehouse?

A data warehouse is a central database purpose-built for reporting and analysis. It collects data from different source systems (ERP, CRM, webshop, accounting) and structures it into a consistent model, keeping the full history. Analysts, dashboards, and AI applications pull their data from here without loading the underlying operational systems.

You can think of a data warehouse as the memory of an organisation. Operational systems live in the here and now: how much stock do you have today, which invoice do you send today. A warehouse answers questions about evolution: how has our revenue trended over three years, which customers churn after which interaction, how does this quarter compare with the previous one.

The classic definition comes from Bill Inmon: a subject-oriented, integrated, time-variant, non-volatile collection of data that supports decision-making. Those four properties still apply today, even though warehouses now look very different technically from those of the 1990s.

What do you use a data warehouse for?

  1. Management reporting
    Monthly revenue reports, quarterly analyses, board dashboards. Classic, and still the biggest user.

  2. Self-service analytics
    Business users build their own reports in tools like Power BI without loading the operational systems or writing SQL against ten different databases themselves.

  3. Historical analysis and trends
    Operational systems overwrite old values (customer address, product price). A warehouse preserves the history so you can see how things evolved.

  4. Data integration
    One place where customer id from CRM and customer id from billing merge into a single customer view. Without a warehouse, every analyst reinvents that integration each time.

  5. Foundation for AI and ML
    Machine learning models need clean, historical, structured data. The warehouse is usually the starting point.

How is a data warehouse structured?

A classic warehouse has several layers, each with its own role.

Staging layer
Raw data from the source systems lands here with minimal processing. Its purpose is to limit load on the sources and make the rest of the pipeline restartable.

Integration layer or data vault
The data is cleaned, de-duplicated, and aligned around business entities (customer, product, order). This is where the truth shared by every report lives.

Presentation layer or data marts
This layer holds star schemas (fact tables surrounded by dimension tables) optimised for fast reporting. The Kimball methodology is the classic approach.

Two modelling philosophies dominate:

  • Kimball (dimensional modelling)
    Pragmatic, business-friendly, star schemas per subject. Fast to build, good for reporting.

  • Inmon (normalised)
    One large normalised enterprise model first, then data marts. More rigorous, but slower to deliver first value.

Technically, modern warehouses rely on columnar storage. Column-based storage keeps all values of a column next to each other. This is ideal for aggregations (sums, averages) over billions of rows and far more efficient than the row-based storage of classic transactional databases.

Data warehouse versus operational database

Both are databases, but their purpose and setup are fundamentally different.

An operational database (OLTP) supports daily transactions. Many small reads and writes, heavy normalisation to avoid redundancy, focus on current state. Think of the database under a webshop handling hundreds of orders a second.

A data warehouse (OLAP) supports analysis. Few but large reads, denormalisation to limit joins, focus on historical and aggregated values. Think of a report summarising three years of sales data by region and product category.

The two aren't interchangeable. Running reports directly against an operational database loads that system and slows the business down. Running transactions against a warehouse fails or is painfully slow. ETL or ELT pipelines therefore move operational data daily or continuously into the warehouse.

Cloud data warehouses

Over the last decade cloud warehouses have become the default. The main players:

  • Snowflake
    Pure cloud warehouse, multi-cloud, decoupled storage and compute, easy data sharing between organisations.

  • Google BigQuery
    Serverless, pay per query, deeply woven into the Google Cloud ecosystem.

  • Amazon Redshift
    AWS-native warehouse, a natural fit if you're already on AWS, with Redshift Spectrum for querying S3 data.

  • Microsoft Fabric Warehouse
    Part of Microsoft Fabric, built on OneLake, deeply integrated with Power BI and Azure.

  • Databricks SQL Warehouse
    A SQL layer on top of a lakehouse, strong for teams that want to combine analytics and ML.

The choice between these platforms depends less on raw performance (they're all fast enough) and more on which ecosystem you already use, which tools your team knows, and how you want to organise data governance.

Data warehouse versus lakehouse

The rise of the lakehouse concept raised the question whether a classic warehouse is still needed.

A classic warehouse excels at structured data, predictable performance, strong governance, and straightforward SQL access. Less suited for unstructured data (documents, images) or heavy-duty machine learning workloads.

A lakehouse combines a data lake (flexible storage of all kinds of data in open formats) with warehouse-like features (ACID transactions, SQL, performance). A better fit for organisations that want one platform for BI and data science.

In modern architectures both often coexist. The lakehouse stores raw and semi-processed data for experiments and ML. The warehouse (or a structured layer inside the lakehouse) serves self-service reporting with strict definitions and strong security. Microsoft Fabric is a good example of a platform that brings both together in one environment.

Last Updated: October 17, 2025 Back to Dictionary
Keywords
data warehouse datawarehouse dwh lakehouse microsoft fabric etl elt business intelligence star schema snowflake power bi