Skip to content

Your AI Agent Project Is Really a Data Project: The Data-Prep Tax

AI agent projects are really data projects. Why data preparation and upkeep, not the model, decides whether an agent ships, and how to scope the data work first.

Alex Pechenizkiy 7 min read
Your AI Agent Project Is Really a Data Project: The Data-Prep Tax

Ask a CRM for one customer’s open pipeline when that customer is stored under four spellings, and a retrieval-based agent answers with a confident wrong number at every setting: $120,000 if it reads a few records, $1,345,000 if it reads a hundred, never the real $275,000. The model did nothing wrong. The customer’s identity was never resolved in the data, and no amount of orchestration recovers it. That example is a runnable demo; the wrong total reproduces at every retrieval cutoff.

That is the shape of the most expensive surprise in AI agent projects, and it is a data problem wearing an agent costume. A team scopes the agent and budgets the model: tokens, a vector store, some orchestration. The agent demos well on three clean documents, then meets the real data, and six weeks later the project is behind, not on the model, on the data nobody put on the plan. This is the data-prep tax. Here is what it is, why it recurs instead of ending, and how to scope it before you commit to building an agent at all.

Two kinds of data, two kinds of pain

“Prepare the data” sounds like one task. It is two, they fail differently, and conflating them is how estimates go wrong. This split is the part most agent plans get wrong, so it is worth getting precise before anything else.

Knowledge data is the unstructured material an agent reads to answer questions: policies, procedures, product docs, FAQs, the deck someone made in 2023. The pain is format and consistency. The same policy lives in a PDF, a wiki page, and a recorded all-hands. Three departments use three words for the same concept. Half the documents are stale and nobody knows which half. Before any of it is useful to retrieval, it has to be converted to a consistent format, de-duplicated, reconciled in terminology, and dated. None of that is AI work. It is editorial and information-architecture work, and it is most of the effort.

Operational data is the structured material an agent acts on: customer records, employee records, approvals, entitlements, who-can-see-what. The pain is not format. It is identity and authority. Which system is the source of truth when two disagree? Which of the four spellings of “Acme” is the real customer, and which is a different company entirely? Who is allowed to see this record, and does the agent honor that? These are data-modeling and governance questions, and an agent that guesses them returns the believable wrong number from the opening of this article.

Knowledge data
Policies, docs, FAQs, transcripts
Operational data
Records, approvals, entitlements, source-of-truth
Knowledge data
Fails on format and terminology drift
Operational data
Fails on identity resolution and access authority
Knowledge data
Fixed with editorial and information-architecture work
Operational data
Fixed with data modeling, a resolution key, and governance
Knowledge data
Bad version: a plausible answer from a stale document
Operational data
Bad version: a confident total over the wrong or unauthorized records

The operational side produces the most expensive surprises because the failure is invisible. The pipeline answers, the number looks reasonable, and it is wrong in a way no one catches until a forecast moves. The fix is not a better model. It is a resolved identity the data carries before retrieval runs, which is exactly what the relational-RAG demo isolates: change the model and the wrong total persists; resolve the identity and it disappears.

The tax recurs, it is not a one-time setup

The most common estimating error is treating data preparation as a phase that ends. It does not end, because the business does not hold still. New products ship, policies change, the org reorganizes, a migration renames half the accounts. Every one of those events degrades the prepared data unless something keeps it current.

So the cost has two parts. There is the upfront cost of getting the data ready, underestimated because the demo ran on the clean subset. And there is the ongoing cost of keeping it ready, which is usually not budgeted at all. Upkeep is a standing, staffed responsibility whose size tracks how fast the underlying data churns, and the failure mode is not getting that estimate wrong. It is never naming an owner, so the data drifts and the agent degrades with nobody accountable for noticing. A demo is correct on the day it is built. A system stays correct, and staying correct is an operating cost, not a capital one.

How do you know if your AI agent is really a data project?

Ask whether a competent person could answer the agent’s intended questions correctly, using only your current data, by hand. If the answer is no, the agent will not do better. It does not repair missing source-of-truth, unresolved duplicates, or undocumented access rules. It inherits them and presents them with more confidence than a spreadsheet would.

When you hit a “no,” you have found a data project wearing an agent costume. Fix the data first. Sometimes, once the data is clean and queryable, the thing you needed was a report or a query, and the agent was never the point. That is not a failure. That is the cheapest possible outcome, found before you paid for the expensive one. This is the same discipline as knowing when not to build an agent at all: the official frameworks will help you design one, but they will not tell you the problem in front of you is a data-quality problem. That call is yours, and it is one of the most valuable an architect makes. I have written separately about where the official agent guidance stops and your judgment starts.

How to scope it, in order

A short sequence that puts the data work where it belongs, before the build commitment:

  1. Inventory both data types separately. List the knowledge sources and the operational sources, and estimate them apart, because they fail differently and the operational side is the one that gets underestimated.
  2. Resolve identity as a materialized key. Decide how duplicate and variant entities become one queryable identity before retrieval, not inside a prompt. In Dataverse terms that is an alternate key or a resolved-account table maintained by a dataflow, not a name LIKE match the model guesses at runtime. This is the single highest-leverage data decision.
  3. Name the source-of-truth and the freshness policy. For every fact the agent will state, decide which system owns it and how current it has to be. Conflicts unresolved in the data become conflicts the agent invents an answer to.
  4. Model access before you model answers. The agent must honor the same row-level and role-level permissions the source systems do. Honoring them also means preserving data residency and producing an audit trail of which records answered each query. Retrofitting this is painful, and the failure mode is a compliance incident.
  5. Then, and only then, decide agent versus flow versus query. With the data scoped, the right tool is often obvious, and it is sometimes not an agent.

What clean data actually compounds into

The reframe that makes the tax worth paying: a resolved customer identity, a defined source-of-truth, and a current knowledge base are not a cost you absorb for one agent. The same resolved Acme identity that gives the agent the right pipeline number also gives your analytics the right revenue rollup and your integrations the right account to write back to. The data work serves every downstream consumer at once.

That is why the spend is better understood as building a foundation than feeding a feature. A team that treats data preparation as overhead to minimize ships one fragile agent. A team that treats it as the product builds the resolved, governed layer that makes the next agent, the next dashboard, and the next integration cheap. Model choice is the more reversible decision: swapping one carries prompt re-tuning, evaluation re-baselining, and output drift, but it is bounded work. The resolved, governed data is the part that compounds, and it is the part a competitor cannot copy by switching vendors.

Put it on the plan first

The data-prep tax keeps surprising people because it is invisible in the part of the project everyone sees. The demo runs on the clean subset, the slide says “AI agent,” and the work that decides the outcome happens off-screen and gets budgeted last. Scope it first: the resolved identity key, the source-of-truth decisions, the access model, and the owner who keeps it current. That foundation is what determines whether the agent is right, and unlike the model, it is the part you cannot unwind later.

Stay in the loop

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

Related articles