Bias
Bias in AI is a skew that creeps into models through data, algorithms, or human choices. It is not always harmful, but it has to be managed ...
Read definitionA data contract is an explicit agreement between the producer and the consumers of a dataset: which schema, which quality, which frequency, which owner. Without a contract, data lives on goodwill, and goodwill does not scale.
A data contract is an explicit, machine-readable agreement between the producer of a dataset and its consumers. It describes the schema (columns, types, required fields), quality expectations (no nulls on a key, no duplicates), update frequency, ownership, and the process for changes. Once you produce data for others, you also become accountable for what you promise.
Data contracts became popular around 2022 inside the data mesh conversation, but they stand on their own: classic data teams benefit from them too. They bring the same discipline that APIs have had in software to the world of tables and streams.
Compare a data contract to a commercial agreement between two companies. Verbal deals work while volumes are small and everyone knows each other. Once multiple parties get involved or the stakes rise, you want your commitments in writing.
Without them, data pipelines typically hit three problems.
Breaking changes without warning
A database team renames a column or changes a type. Downstream systems break. The producer did not know your team relied on that column.
Confusion over semantics
Is revenue inclusive of VAT or not? Does order date start at booking or at delivery? Without an explicit definition, every consumer ends up with a different truth in their reports.
Unclear accountability
When data is wrong, who fixes it? Without a contract, everyone points down the chain.
A data contract pins every agreement down and makes it testable. A new release of the source is blocked if it breaks the contract. Just as with a public API.
Schema
All fields with their types, required or optional, with meaning per field. Usually expressed in JSON Schema, Avro, Protobuf, or a YAML variant.
Semantic definition
What does each field mean in business terms? OrderDate is the date the order was created in the source system, not the shipping date.
Quality expectations
Unique keys, expected value ranges, referential integrity, null percentages tolerated, expected volumes per run.
SLA
When is the data available (every day by 07:00), how fresh (no older than 15 minutes), how stable (99.5 percent uptime).
Owner and contact
Which team is accountable, how can consumers ask questions, report incidents, or request changes.
Version control
How are changes rolled out? Which backward compatibility guarantees, how long do old versions stay, how are deprecations communicated.
A healthy workflow around data contracts usually looks like this:
Design the contract. Producer and consumers collaborate on a first version. That conversation alone surfaces misunderstandings that had been quiet for years.
Publish the contract. The contract lives in a repository (Git, a catalog, a dedicated data contract tool). It is version-controlled.
Generate tests. The contract produces automatic tests in your ETL pipeline. On every run, schema and quality are validated. Failing tests block the next step.
Propose changes through a pull request. Whoever wants to change the source proposes it in the contract. Breaking changes require explicit consent from consumers or a compatibility period.
Monitoring and incidents. When a consumer reports an SLA breach, it is against the contract. No debates over what should have been, only over what was actually agreed.
Open Data Contract Standard
A YAML-based standard under the Bitol community. A solid starting point to formalise schema, SLA, and ownership.
dbt contracts
dbt supports contracts on models: column types and required columns are checked at build time.
Great Expectations, Soda Core
Data quality tools that can enforce contract rules inside the pipeline.
Schema registries
For streaming data (Kafka), the Confluent Schema Registry has been around longer and plays a similar role at message level.
Microsoft Purview
In Microsoft stacks, Purview records definitions, classifications, and ownership, tied to data lineage. Not yet a full data contract tool, but a useful building block.
Multiple teams produce and consume each other's data. From three or more teams onwards, informal agreements stop scaling.
Data feeds customer-facing products. A wrong figure in an internal report is annoying. A wrong figure on a customer dashboard or inside an AI agent is a crisis.
Compliance-sensitive data. Financial reporting, medical data, GDPR-covered personal data. The contract also supports the audit trail.
Decentralised data landscape. Data mesh or any federated model does not work without contracts. Domain autonomy requires clear interfaces.
Contract as a dictate
A contract imposed unilaterally by the producer gets ignored. Treat it as a conversation, not a decree.
Too much overhead
For internal, stable datasets with one consumer, a full contract is overkill. Reserve it for datasets with multiple consumers or external impact.
Schema only, no semantics
A contract that only captures JSON Schema catches syntactic breaks, not semantic ones. Price suddenly in euros instead of dollars slips through.
No consequences
If contract breakage does not break the build, it remains a nice idea without teeth. Wire it into CI/CD.
Owner moved teams
Teams reorganise and the owner in the contract no longer matches. Periodic review is part of good governance.
Bias in AI is a skew that creeps into models through data, algorithms, or human choices. It is not always harmful, but it has to be managed ...
Read definitionData lineage shows the full journey data takes inside an organisation. From the original source to the final report, with meaning and contex...
Read definitionData mesh is an organisational model for data in which each business domain owns its datasets and offers them as products. It breaks with th...
Read definitionA data warehouse is a central database that collects data from many source systems and structures it for reporting and analysis. It's optimi...
Read definitionETL and ELT stand for Extract, Transform, Load and Extract, Load, Transform. They are two ways of moving data from source systems into a cen...
Read definition
Ten practical steps to automate your business processes without AI hype. Start small, fix the process first, use the tools you already own, ...
Simple guide to set up version control for Power BI using PBIP, Git and clean repo structures. Learn branching, deployments and safe AI work...