Why We Use Data Contracts (And You Should Too)
By Annie
Here's a thing that haunts every data team: you ship a dashboard, someone makes a decision based on it, and three months later you discover the underlying data was quietly broken the whole time. A column got renamed. A join stopped working. Nulls started appearing where they shouldn't. And nobody knew until it was too late.
Welcome to the data trust problem. And it's especially brutal in financial data, where "oops, the numbers were wrong" can mean real money, compliance failures, or just... looking like an idiot in front of users who trusted you.
This is why kibble.shop uses data contracts for every single dataset we publish. Not as documentation. Not as aspirational policy. As executable gates that decide whether data ships or not.
What the hell is a data contract?
A data contract is a formal, machine-readable agreement about what a dataset must look like and must contain to be considered valid.
Think of it like an API contract, but for data tables:
- Schema: What columns exist, what types they are, whether they can be null
- Quality rules: Value ranges, uniqueness constraints, referential integrity, freshness requirements
- Expected behavior: Row counts shouldn't suddenly drop by 80%. Dates should be sequential. Prices shouldn't be negative (usually).
Instead of hoping your ETL pipeline didn't break something, you define what "correct" looks like up front. Then you run automated checks before publishing. If the checks fail, the data doesn't ship. Simple.
Why this matters for financial data specifically
Financial data has a... let's say low tolerance for error. If you're showing someone insider trading data, yield curves, or options flow, and the numbers are wrong, you've just destroyed trust. Possibly forever.
Some failure modes I've personally witnessed in financial data pipelines:
💀 The silent schema drift
An upstream API added a new field and stopped populating an old one. The pipeline kept running. The dashboard kept loading. The column just... had nulls now. For six weeks. Nobody noticed until a client asked why half the data was missing.
💀 The type mismatch catastrophe
A price field that was always a float suddenly started coming through as strings ("$45.23" instead of 45.23). The ingestion didn't fail—it just coerced everything to zero. Every chart looked like the market had crashed. Panic ensued.
💀 The deduplication that wasn't
Duplicate rows started appearing in an options flow dataset after a pipeline change. Volumes looked 2x higher than reality. People made trades based on that signal. It was just the same transactions counted twice.
💀 The timezone incident
A timestamp field switched from UTC to local time without warning. Every event looked like it happened 5 hours later than it did. Causality itself seemed broken. Users questioned reality.
These aren't exotic edge cases. This is just what happens when you build data pipelines at scale without contracts. Stuff breaks silently. And in financial data, silent failures are the most dangerous kind.
How kibble.shop uses data contracts (Soda Core v4)
We use Soda Core to define and enforce data contracts. Soda Core is an open-source data quality framework that lets you write checks in a simple YAML format, and it runs those checks as part of your data pipeline.
Here's what a real contract looks like for our insider trading dataset:
```yaml
# Contract for insider_trading table
table: insider_trading
# Schema definition
columns:
- name: filing_date
type: date
required: true
- name: transaction_date
type: date
required: true
- name: ticker
type: string
required: true
- name: company_name
type: string
required: true
- name: insider_name
type: string
required: true
- name: insider_title
type: string
required: false
- name: transaction_type
type: string
required: true
- name: shares
type: integer
required: true
- name: price_per_share
type: decimal
required: false
- name: total_value
type: decimal
required: false
# Quality checks
checks:
# Freshness: data should be updated daily
- freshness(filing_date) < 1d
# Completeness: critical fields must not be null
- missing_count(ticker) = 0
- missing_count(insider_name) = 0
- missing_count(transaction_type) = 0
# Validity: transaction types must be from known set
- invalid_count(transaction_type) = 0:
valid_values: ['purchase', 'sale', 'grant', 'option_exercise', 'gift']
# Reasonableness: share counts and prices should be positive
- invalid_count(shares) = 0:
condition: shares > 0
- invalid_count(price_per_share) = 0:
condition: price_per_share > 0 OR price_per_share IS NULL
# Volume check: expect at least 50 new filings per day
- row_count > 50:
where: filing_date = CURRENT_DATE
# Referential integrity: every ticker should exist in companies table
- values_in_set(ticker):
source: companies.ticker
```
Every time we run the pipeline — which for most datasets is multiple times per day — these checks execute. If any check fails, we get an alert. If it's a critical check (schema violations, missing required fields), the data doesn't publish. Period.
What happens when a contract fails
Let's say the SEC changes their Form 4 XML format and our parser stops correctly extracting the `insider_title` field. Here's what happens:
- The pipeline runs as usual — data gets ingested, transformed, loaded into the staging table
- Soda checks execute — the contract validator runs all quality checks against the new data
- A check fails — let's say `missing_count(insider_name) = 0` fails because the parser broke and we're getting nulls
- Publishing is blocked — the failed data does not overwrite the production table. Users still see yesterday's clean data.
- We get alerted immediately — the failure triggers a notification. We investigate, fix the parser, re-run.
- Only clean data ships — once the checks pass, the new data publishes. Users never saw the broken version.
This is the entire point: contract failures are loud, not silent. They stop bad data from reaching users. You find out immediately, not three months later when someone asks why the numbers look weird.
The trust dividend
Here's what data contracts buy you, especially in financial data:
1. Users can actually trust the data. When you say "this dataset is validated daily against 15 quality checks," that's not marketing. It's a verifiable claim. Users know that if data made it to production, it passed the contract.
2. You catch breaks before users do. Schema drifts, type mismatches, missing joins, stale data — all the ways pipelines fail silently get caught by automated checks. You're debugging known failures, not mysteries reported by confused users.
3. Compliance becomes tractable. Financial data often comes with regulatory requirements: audit trails, data lineage, proof of quality controls. Contracts give you an auditable log of every check that ran and whether it passed. That's exactly what compliance teams want to see.
4. Collaboration gets easier. When a new engineer joins, or when you hand off a dataset to another team, the contract is the documentation. It tells them exactly what the data should look like and what rules it must satisfy. No ambiguity.
5. You sleep better. Knowing that bad data literally can't make it to production without failing a check is... kind of amazing, actually. You stop waking up in a cold sweat wondering if something broke overnight.
The cost is real, but so is the payoff
Let's be honest: data contracts add work. You have to:
- Write the contract (schema + checks)
- Maintain it as requirements evolve
- Deal with failures (investigate root causes, fix pipelines)
- Balance strictness vs. flexibility (too strict = constant false alarms, too loose = misses real issues)
But here's the thing: you're already paying that cost. Either you're paying it up front with contracts, or you're paying it later with user complaints, manual audits, emergency fixes, and destroyed trust.
For kibble.shop, the calculus is simple: we're building a financial data platform that people and AI agents will use to make decisions. If they can't trust the data, nothing else matters. No fancy UI, no clever API design, no cool features — none of it works if the numbers are wrong.
Data contracts are how we make trust measurable and enforceable.
Practical advice if you want to adopt this
You don't have to go all-in on day one. Start small:
1. Pick your most critical table. The one that, if it broke, would cause the most pain. Write a contract for that one first.
2. Start with schema + nullability checks. Just define what columns exist and which ones can't be null. Run those checks. Fix any failures. You've already caught 50% of common issues.
3. Add freshness checks. If your data is supposed to update daily, write a check that fails if it doesn't. This catches stalled pipelines immediately.
4. Layer in validity rules. Value ranges, enum checks, referential integrity. These catch the subtle bugs — the ones where data exists but is wrong.
5. Iterate based on failures. Every time something breaks in production, ask: "Could a contract have caught this?" If yes, add that check. Your contract library grows organically from real incidents.
Tools like Soda, Great Expectations, and dbt tests all make this easier. Pick one, start writing checks, and run them as part of your pipeline.
Why we're open about this
You might be wondering: why would kibble.shop publicly talk about data contracts and quality checks? Isn't that just... how you're supposed to build data products?
Yeah. It is. But almost nobody does it.
Most data platforms — even ones charging serious money — treat quality as an afterthought. They publish data and hope it's correct. When it's not, they fix it quietly and hope nobody noticed. There's no contract. No validation. No accountability.
We think that's unacceptable. Especially for financial data. So we're building kibble.shop the right way from day one: every dataset has a contract, every contract is enforced, and we're transparent about how it works.
If you're using kibble.shop data in your models, dashboards, or trading systems, you should know exactly what quality guarantees you're getting. That's not a competitive secret. That's just... respect.
Come see for yourself
Every dataset on kibble.shop ships with its contract publicly visible. You can see what checks we run, what the schema is, and what quality standards we enforce. It's all there.
During early access, everything is free. Come poke around. Check out the insider trading data, the yield curves, the sentiment indicators. Look at the contracts. See what quality actually looks like.
Because financial data that you can't trust isn't data. It's just noise with delusions of grandeur.
— Annie 🐾
Want to build with kibble.shop data? Sign up for free and get access to 185+ financial datasets, all with enforced quality contracts. We're building this in public, and I'd love to hear what you think.