Your AI Agent Project Is Really a Data Project: The Data-Prep Tax
AI agent projects are really data projects. Why data preparation and upkeep, not the model, decides whether an agent ships, and how to scope the data work first.
Ask a CRM for one customer’s open pipeline when that customer is stored under four spellings, and a retrieval-based agent answers with a confident wrong number at every setting: $120,000 if it reads a few records, $1,345,000 if it reads a hundred, never the real $275,000. The model did nothing wrong. The customer’s identity was never resolved in the data, and no amount of orchestration recovers it. That example is a runnable demo; the wrong total reproduces at every retrieval cutoff.
That is the shape of the most expensive surprise in AI agent projects, and it is a data problem wearing an agent costume. A team scopes the agent and budgets the model: tokens, a vector store, some orchestration. The agent demos well on three clean documents, then meets the real data, and six weeks later the project is behind, not on the model, on the data nobody put on the plan. This is the data-prep tax. Here is what it is, why it recurs instead of ending, and how to scope it before you commit to building an agent at all.
Two kinds of data, two kinds of pain
“Prepare the data” sounds like one task. It is two, they fail differently, and conflating them is how estimates go wrong. This split is the part most agent plans get wrong, so it is worth getting precise before anything else.
Knowledge data is the unstructured material an agent reads to answer questions: policies, procedures, product docs, FAQs, the deck someone made in 2023. The pain is format and consistency. The same policy lives in a PDF, a wiki page, and a recorded all-hands. Three departments use three words for the same concept. Half the documents are stale and nobody knows which half. Before any of it is useful to retrieval, it has to be converted to a consistent format, de-duplicated, reconciled in terminology, and dated. None of that is AI work. It is editorial and information-architecture work, and it is most of the effort.
Operational data is the structured material an agent acts on: customer records, employee records, approvals, entitlements, who-can-see-what. The pain is not format. It is identity and authority. Which system is the source of truth when two disagree? Which of the four spellings of “Acme” is the real customer, and which is a different company entirely? Who is allowed to see this record, and does the agent honor that? These are data-modeling and governance questions, and an agent that guesses them returns the believable wrong number from the opening of this article.
| Knowledge data | Operational data |
|---|---|
| Policies, docs, FAQs, transcripts | Records, approvals, entitlements, source-of-truth |
| Fails on format and terminology drift | Fails on identity resolution and access authority |
| Fixed with editorial and information-architecture work | Fixed with data modeling, a resolution key, and governance |
| Bad version: a plausible answer from a stale document | Bad version: a confident total over the wrong or unauthorized records |
The operational side produces the most expensive surprises because the failure is invisible. The pipeline answers, the number looks reasonable, and it is wrong in a way no one catches until a forecast moves. The fix is not a better model. It is a resolved identity the data carries before retrieval runs, which is exactly what the relational-RAG demo isolates: change the model and the wrong total persists; resolve the identity and it disappears.
The tax recurs, it is not a one-time setup
The most common estimating error is treating data preparation as a phase that ends. It does not end, because the business does not hold still. New products ship, policies change, the org reorganizes, a migration renames half the accounts. Every one of those events degrades the prepared data unless something keeps it current.
So the cost has two parts. There is the upfront cost of getting the data ready, underestimated because the demo ran on the clean subset. And there is the ongoing cost of keeping it ready, which is usually not budgeted at all. Upkeep is a standing, staffed responsibility whose size tracks how fast the underlying data churns, and the failure mode is not getting that estimate wrong. It is never naming an owner, so the data drifts and the agent degrades with nobody accountable for noticing. A demo is correct on the day it is built. A system stays correct, and staying correct is an operating cost, not a capital one.
How do you know if your AI agent is really a data project?
Ask whether a competent person could answer the agent’s intended questions correctly, using only your current data, by hand. If the answer is no, the agent will not do better. It does not repair missing source-of-truth, unresolved duplicates, or undocumented access rules. It inherits them and presents them with more confidence than a spreadsheet would.
When you hit a “no,” you have found a data project wearing an agent costume. Fix the data first. Sometimes, once the data is clean and queryable, the thing you needed was a report or a query, and the agent was never the point. That is not a failure. That is the cheapest possible outcome, found before you paid for the expensive one. This is the same discipline as knowing when not to build an agent at all: the official frameworks will help you design one, but they will not tell you the problem in front of you is a data-quality problem. That call is yours, and it is one of the most valuable an architect makes. I have written separately about where the official agent guidance stops and your judgment starts.
How to scope it, in order
A short sequence that puts the data work where it belongs, before the build commitment:
- Inventory both data types separately. List the knowledge sources and the operational sources, and estimate them apart, because they fail differently and the operational side is the one that gets underestimated.
- Resolve identity as a materialized key. Decide how duplicate and variant entities become one queryable identity before retrieval, not inside a prompt. In Dataverse terms that is an alternate key or a resolved-account table maintained by a dataflow, not a
name LIKEmatch the model guesses at runtime. This is the single highest-leverage data decision. - Name the source-of-truth and the freshness policy. For every fact the agent will state, decide which system owns it and how current it has to be. Conflicts unresolved in the data become conflicts the agent invents an answer to.
- Model access before you model answers. The agent must honor the same row-level and role-level permissions the source systems do. Honoring them also means preserving data residency and producing an audit trail of which records answered each query. Retrofitting this is painful, and the failure mode is a compliance incident.
- Then, and only then, decide agent versus flow versus query. With the data scoped, the right tool is often obvious, and it is sometimes not an agent.
What clean data actually compounds into
The reframe that makes the tax worth paying: a resolved customer identity, a defined source-of-truth, and a current knowledge base are not a cost you absorb for one agent. The same resolved Acme identity that gives the agent the right pipeline number also gives your analytics the right revenue rollup and your integrations the right account to write back to. The data work serves every downstream consumer at once.
That is why the spend is better understood as building a foundation than feeding a feature. A team that treats data preparation as overhead to minimize ships one fragile agent. A team that treats it as the product builds the resolved, governed layer that makes the next agent, the next dashboard, and the next integration cheap. Model choice is the more reversible decision: swapping one carries prompt re-tuning, evaluation re-baselining, and output drift, but it is bounded work. The resolved, governed data is the part that compounds, and it is the part a competitor cannot copy by switching vendors.
Put it on the plan first
The data-prep tax keeps surprising people because it is invisible in the part of the project everyone sees. The demo runs on the clean subset, the slide says “AI agent,” and the work that decides the outcome happens off-screen and gets budgeted last. Scope it first: the resolved identity key, the source-of-truth decisions, the access model, and the owner who keeps it current. That foundation is what determines whether the agent is right, and unlike the model, it is the part you cannot unwind later.
Read next
- Relational records need a key, not an embedding - the runnable demonstration that retrieval returns a confident wrong number when identity is not resolved in the data.
- Dataverse MCP, Business Skills, and Coding Agents: The 2026 Decode - the platform-side counterpart: source-of-truth and identity resolution as a Dataverse data-platform concern.
- Microsoft’s Agents Hub Decoded - what the official agent guidance formalizes, and the cost and restraint calls it leaves to you.
Stay in the loop
Get new posts delivered to your inbox. No spam, unsubscribe anytime.
Related articles
AI Agent vs Flow: When Not to Build One (2026 Decision Guide)
When to build an AI agent and when a flow, a query, or nothing is the better tool. A 5-question decision test, worked examples, and the agent tax to budget.
Microsoft's Agents Hub Decoded (2026): The Frameworks and the Gaps
Microsoft's new Agents hub formalizes agent architecture, archetypes, a maturity model, and evaluation. An architect's read on what it still leaves to you.
AI Cost Governance in 2026: The Spend Caps That Don't Actually Cap
AI cost governance in 2026: the budget you reach for first only alerts, it never stops. A vendor-by-vendor playbook on which AI spend controls hard-stop and which just notify.