AgentOps on Microsoft Foundry: A Practitioner Decode of the New CI/CD Reference Architecture (2026)
Practitioner read of Microsoft's new Foundry CI/CD reference architecture: the 5-layer pipeline, evaluation-driven release gates, and where the architecture understates the operational work.
Microsoft just published an end-to-end CI/CD reference architecture for Foundry agents. It is the first official Microsoft document that treats agent deployment as a first-class engineering discipline, not as a side effect of agent authoring. This is the practitioner read on what it gets right, where it understates the operational work, and what to do with it on Monday.
The CI/CD for AI Agents on Microsoft Foundry reference came out 2026-05-22 from Lee Stott on the Microsoft Tech Community Educator Developer Blog. Two reference repos accompany it: leestott/foundry-cicd for the pipeline implementation, ericchansen/foundry-agents-lifecycle (Eric Chansen) for the agent demo. The framework’s one-line thesis is the reframing on which the rest of the document hangs:
“Agents follow a standard CI/CD pattern, but with a critical shift: promotion happens at the agent version level, and release gates are driven by evaluation outcomes, not just test results.”
Treat the document as v1 of a still-refining artefact. It is one week old at writing, sits on a tech-community blog rather than Microsoft Learn, and has not yet been pressure-tested by a real enterprise deployment cycle. The architecture itself is sound. The understated parts are where the practitioner work begins.
The Assist-to-Execute Shift Demands AgentOps
The companion piece on the Microsoft Agentic Transformation Patterns Playbook opens with the shift from Assist mode (agent supports a human decision; human accountable) to Execute mode (agent performs work across systems; human oversees outcome). Execute mode demands four operating-model commitments: ownership, risk, lifecycle, and governance.
The CI/CD reference architecture is the lifecycle commitment made concrete. It is not optional. The moment any agent crosses from Assist to Execute, the operational surface area grows enough that ad-hoc deployment becomes a liability.
What the reference architecture adds beyond classic software CI/CD is the recognition that probabilistic systems need probabilistic release gates. A unit test that passes today on input X will still pass tomorrow on input X. An agent that answers correctly today on prompt X may answer differently tomorrow because the model weights drifted, the upstream retrieval changed, a tool dependency updated, or the eval dataset became stale. The pipeline has to assume non-determinism and gate accordingly.
The reframing the reference puts at the top is, in my view, the single most useful sentence in the document. The deployment unit is no longer a container or a tag, it is an agent version: an immutable artefact bundling model selection, system instructions, tool definitions, and configuration. Promotion happens at that version level. Rollback is a pointer-swap, not a re-deployment.
The 5-Layer Architecture at a Glance
The reference flow is five layers from developer commit to production observability.
| Layer | Responsibility | Key components |
|---|---|---|
| 1. Developer | Source-controlled repo for agent code, configuration, and infrastructure | Python or .NET agent code, agent.yaml or prompt definitions, MCP/REST tool configurations, Bicep or ARM IaC for Foundry project provisioning |
| 2. CI Pipeline | Build, validate, evaluate every push or PR before producing a versioned agent artefact | Docker build (hosted) or YAML schema validation (prompt), ruff + bandit static checks, pytest unit + tool tests, evaluation gate against golden dataset, ACR push |
| 3. CD Pipeline | Promote one agent version through three Foundry project environments with eval and human-approval gates | Stage 1 Dev (smoke + eval), Stage 2 Test (scenario + HITL + safety), Stage 3 Prod (versioned promote + endpoint enable) |
| 4. Foundry Agent Service | Runtime for the deployed agent versions, with built-in lifecycle and identity | Hosted + prompt-based runtimes, versioned deployments, Entra Agent Identity per version, RBAC per project, distributed traces + structured logs |
| 5. Monitoring + Governance + Control Plane | Platform-level governance, observability, and continuous evaluation across the deployed fleet | Agent registry + version history, OpenTelemetry forwarded to Azure Monitor + Application Insights, continuous evaluation on sampled production traffic, Azure Policy + RBAC enforcement |
Honest observation, and this is my read rather than a number from the source: Layer 2 and Layer 3 are where most of the operational discipline lives, call it roughly four-fifths of the new engineering work. The developer layer is what most teams already have. Foundry Agent Service is what Microsoft provides. The control plane is what the platform team configures once. The CI + CD pipeline is where new engineering work goes, and it is where, in my read, most teams will need to invest most heavily over roughly the next 6 to 12 months.
Evaluation Gates: the Real Differentiator
Classic software pipelines gate on tests. Agent pipelines add a category Microsoft calls evaluation-driven quality gates. These run at two points: pre-merge during CI, and pre-promotion during CD. The published threshold table is the most directly useful artefact in the document.
| Category | Metric | CI threshold | Prod threshold |
|---|---|---|---|
| Quality | Hallucination rate | < 5% | < 3% |
| Quality | Task completion rate | > 90% | > 95% |
| Safety | Grounded response rate | > 95% | > 98% |
| Safety | Policy violations | 0 | 0 |
| Performance | p95 latency | < 4000 ms | < 3000 ms |
| Cost | Token usage per query | Track only | Alert on > 20% regression |
The threshold values themselves are reasonable starters but they are not universal. A claims-processing agent has very different acceptable hallucination tolerance than a meeting-summary agent. A code-review agent has different latency tolerance than a customer-service agent. Microsoft frames these as “thresholds” without saying “these are illustrative starters.” Treat them as illustrative starters. Calibrate to your specific use case, your specific user expectations, and your specific liability exposure.
The hidden cost the document acknowledges briefly but does not quantify is dataset maintenance. The article does say “stale datasets produce misleading pass/fail signals” and “treat your golden evaluation set as a first-class engineering artefact alongside the agent code itself.” That is correct but understates the work. A real golden dataset needs:
- A named owner with authority over what goes in and what comes out
- A change-review process that mirrors code review
- Quarterly coverage audits against real production traffic patterns
- A drift detection process that flags when production cases diverge from dataset cases
- Versioning that ties each eval run to a specific dataset version
Most teams do none of this. The result is a dataset that was representative on day one and unrepresentative by month six, producing eval scores that no longer reflect real-world performance. The eval gate becomes theatre.
A second silent cost is token spend. Every PR triggers an eval run. Every eval run costs tokens at the rate of your model provider. At fleet scale (say, on the order of 100+ PRs per week across 10+ agents) the eval budget becomes a non-trivial line item. The reference architecture says nothing about cost-control patterns: sampled eval runs on draft PRs, full eval runs only on PRs marked ready-for-review, or per-team token budgets enforced at the pipeline level. These are decisions every team will face once the pipeline is running at scale.
Hosted vs Prompt-Based Agents: Pipeline Differences in Practice
The reference architecture handles two deployment models. The split is real and the operational implications differ.
| Capability | Hosted agents | Prompt-based agents |
|---|---|---|
| Deployment unit | Container image plus agent definition | YAML / prompt configuration bundle |
| Build step required | Yes: Docker build plus ACR push | No: YAML validation only |
| Supported frameworks | Agent Framework, LangGraph, Semantic Kernel, custom code | Foundry declarative runtime |
| Promotion artefact | Versioned agent with container image reference | Versioned prompt / config bundle |
| CI focus | Code quality, tool tests, evaluation | Prompt schema validation, evaluation |
| Rollback mechanism | Switch active agent version (pointer) | Switch active agent version (pointer) |
| Runtime management | Foundry manages container lifecycle | Foundry manages declarative runtime |
Hosted agents inherit all the Docker supply-chain problems (base-image pinning, dependency version drift, SDK churn) on top of the AI-specific evaluation problems. A team running hosted agents needs both standard container-security discipline (CVE scanning, SBOM tracking, base-image refresh cadence) and AI-specific evaluation discipline. The reference architecture covers the AI side cleanly; the container-supply-chain side is treated as standard practice and left to the team.
Prompt-based agents look simpler because there is no Docker build. The complexity moves to the prompt-version lifecycle. A small change to a prompt can shift behaviour across thousands of test cases. The golden dataset needs coverage of prompt edge cases (jailbreaks, prompt injection, edge-case phrasings) that does not exist for the hosted-agent code path. Prompt-based deployment is operationally lighter on day one and operationally similar on day ninety once the prompt-version testing is mature.
Multi-Environment Topology
Microsoft recommends a three-environment topology with separate Foundry projects per environment.
| Option | Structure | Best for | Trade-off |
|---|---|---|---|
| A (Recommended) | Dev Project → Test Project → Prod Project (separate Foundry projects) | Enterprise workloads | Full RBAC isolation, distinct connection strings, separate evaluation signals, easier governance audit |
| B (Lightweight) | Single Foundry project with agent version tags (dev/test/prod) | Small teams, prototyping, internal demos | Simpler initial setup, weaker environment separation, RBAC has to be enforced at agent-version granularity |
Option A is the right call for any team running agents in production. The cost is operational: three Azure subscriptions or three resource groups (the boundary depends on your landing zone), three sets of connection strings, three sets of policy assignments. Teams without a clean dev/test/prod Azure landing zone will spend the first month of adoption fixing landing-zone problems before the CI/CD pipeline can be set up cleanly. Plan for this.
OIDC Workload Identity Federation between the pipeline and Azure is the right authentication pattern. It avoids storing long-lived Azure credentials in pipeline secrets, rotates the trust automatically, and produces audit logs for every deployment action. Adopt OIDC from day one even on the Dev environment; the cost of retrofitting it later is high.
Where the Reference Architecture Understates the Work
The architecture itself is correct. The framework is sound. There are six places where the document understates the engineering work the patterns demand.
Multi-model evaluation is unaddressed. Many real enterprise agents route across Foundry, Claude, and OpenAI by workload type. Eval suites that work for one provider do not automatically work for another (prompt formatting differs, tool-call semantics differ, refusal patterns differ). The reference architecture assumes a single-model deployment per agent. Multi-model deployments need per-model eval suites and a routing-layer eval test that the architecture does not name. The companion piece on LLM-agnostic AI agents covers the routing-layer side of this.
Model deprecation breaks the reproducible triple. The (code, prompt, model) triple is auditable today. When Microsoft retires gpt-4o or upgrades a model behind the same name, the old triple is no longer reproducible. The article does not address provider-side model lifecycle. A real AgentOps practice needs: explicit model-version pinning, a deprecation-monitoring process, a re-evaluation campaign on every forced model upgrade, and a rollback plan that anticipates the model itself being unavailable.
The HITL approver authority gap. Required human approval before Production deploy is the right gate. The assumption is that the approver has both the technical literacy to read eval results and the product authority to defund a stalled promotion. Most enterprises have one or the other, rarely both. The product manager who has authority does not always have the technical literacy; the engineer who has the literacy does not always have the authority. This is the same political-maturity gap that the Agentic Patterns Playbook decode names in its critique of the CoE framework.
Continuous evaluation on production traffic raises PII and audit concerns. The article mentions sampled production traffic but does not address sampling-rate decisions, PII redaction before eval, audit logging for sampled traffic, or how to handle eval failures that would themselves expose PII in incident reports. In regulated industries (finance, healthcare, public sector) these are gating decisions, not implementation details.
Pipeline parity is operational debt. Microsoft publishes both a GitHub Actions workflow and an Azure DevOps YAML pipeline as parallel reference implementations. Both are well-designed. In practice teams pick one and stick with it. Maintaining both is operational debt the article does not name. The decision criteria are real: shop-existing tooling, identity model fit, RBAC and approval-gate ergonomics, evaluation-result reporting integration with your existing developer surfaces.
Cost regression alerting needs an owner and authority. “Alert on > 20% token regression” is a correct control. Who receives the alert and what authority they have to act on it (roll back, defund, downgrade model selection, require redesign) is not specified. Without a named owner and a defined escalation path, the alert becomes a Slack notification that no one acts on.
The Cross-Stack Test: AgentOps Beyond Microsoft
The same vendor-portability question we asked of the Patterns Playbook applies here. Are these patterns Microsoft-stack-implicit, or do they transfer cleanly to non-Microsoft operating models?
The CI/CD shape transfers cleanly. The eval-gate-as-release-gate principle is universal. The (code, prompt, model) version-pinning principle is universal. The Dev → Test (with HITL) → Prod promotion sequence is universal. The OIDC + workload identity pattern is universal across major identity providers.
The implementation surfaces differ. A team running Pega-based agent orchestration over Pega case management can adopt the same release-gate discipline with Pega’s own deployment tooling and a custom evaluation script. A team running Salesforce Agentforce can wire eval gates into Salesforce DX. A team running n8n or LangChain or LangGraph outside Microsoft can adopt the same CI sequence (lint, test, eval, build) using GitHub Actions or GitLab CI without touching Foundry.
What does not transfer cleanly is the agent registry. Foundry’s built-in agent registry, version history, and Entra Agent Identity per deployed version are platform features that have no direct cross-stack equivalent today. Teams running agents outside Foundry will need to build or buy the registry layer. This is one of the strongest practical reasons to standardise on Foundry for new enterprise agent workloads if the rest of your stack is Microsoft-aligned.
The Monday Move for Different Team Types
Adopt the reference architecture selectively. Your team’s current posture determines the right starting move.
| Team posture | Where to start | Where to skip-read |
|---|---|---|
| Already running Power Platform ALM with release gates | This maps cleanly onto your existing discipline. Adopt the eval-gate Python script as a new gate type in your existing pipeline. The CI structure plugs into Azure DevOps cleanly if you are already on ADO. | Skip nothing; this is additive. |
| One Foundry agent in Dev, no pipeline yet | Stand up the Dev project plus the eval-gate Python script first. Run it on every PR. Do not try to set up Test or Prod environments until the eval-gate is producing trustworthy signal for at least one full release cycle. | Skip the multi-environment topology decision until you have eval-gate confidence. |
| Running 10+ agents across Foundry plus Claude plus OpenAI | Build the per-model evaluation suite first. The routing layer is a separate concern; treat the AgentOps pipeline as per-model and converge at the routing layer. The LLM-agnostic agents companion (linked above) covers the routing side. | Skip the single-pipeline assumption; you need one CI pipeline per backing model. |
| No release discipline today | Start with eval-gate-on-PR for one agent. Add the Test environment after one quarter of running the eval-gate cleanly. Add Prod after another quarter. The temptation to set up all three environments on day one is the most common failure mode. | Skip the full reference architecture; treat it as the 12-month target, not the Monday move. |
Budget anchor. The reference architecture is silent on the engineering cost of standing up the pipeline. Order-of-magnitude estimate from our experience: a team adopting AgentOps for the first time typically spends 4-8 engineer-weeks on the initial pipeline (developer environment standardisation, CI eval-gate Python script, single-environment CD, OIDC trust setup). A team running 10+ agents at scale typically allocates 1-2 dedicated FTEs to AgentOps platform engineering plus shared eval-dataset curation across product teams. Treat these as illustrative ranges; your specific use cases, your team’s prior CI/CD maturity, and your regulated-data posture move them.
The Real Deployment Artefact Is the Agent Version
Microsoft’s whole reframing in one line:
“Treat the agent version as your deployment artefact, and evaluation outcomes as your release gate.”
Internalise that and the rest of the architecture organises itself. The container image is supporting infrastructure. The pipeline is the production line. The agent registry is the audit surface. The runtime is the execution surface. The thing that ships, the thing that promotes, the thing that rolls back, is the agent version.
The strategic implication for hiring, funding, and platform-engineering investment over the next 12 months is direct. Hiring: the Agent Product Owner role the Patterns Playbook formalises becomes the role that owns the agent version across its lifecycle. Funding: the eval-dataset curation that the reference architecture treats as a sidebar is in fact a budget line and a named engineering responsibility. Platform engineering: the AgentOps pipeline becomes a platform-team deliverable, not a per-product-team improvisation.
Microsoft has done the harder half of the work by publishing the reference. The remaining half is operational discipline: eval-dataset ownership, multi-model evaluation, model-deprecation rollback plans, HITL approver authority, continuous-evaluation PII handling, and pipeline-choice consolidation. Those are the practitioner gaps, and they are where the next 12 months of enterprise AgentOps work actually happens.
Read Next
- The Six Agentic Adoption Patterns: A Practitioner Decode of Microsoft’s New Playbook (2026). The operating-model layer above the CI/CD layer. The Agent Product Owner role and the CoE structures that own the AgentOps pipeline live in that document.
- AI Orchestration for Legacy Systems: The Operational Front Door Pattern (2026). The reference architecture for the agents themselves, beneath the deployment pipeline. Same Microsoft surface stack viewed from the runtime side.
- Six Rules for LLM-Agnostic AI Agents on Microsoft Foundry. The multi-model routing layer. When the AgentOps pipeline has to ship the same agent against three different models, this is the companion read.
- Source: CI/CD for AI Agents on Microsoft Foundry (Microsoft Tech Community, 2026-05-22). The primary reference.
- Reference repo: leestott/foundry-cicd. The GitHub Actions + Azure DevOps YAML reference implementations.
- Reference repo: ericchansen/foundry-agents-lifecycle. The hosted + prompt-based agent demos.
Stay in the loop
Get new posts delivered to your inbox. No spam, unsubscribe anytime.
Related articles
From Assist to Execute: The Reference Architecture Implications Microsoft's Playbook Doesn't Draw (2026)
The Assist-to-Execute shift in Microsoft's Agentic Patterns Playbook is the right conceptual move. This is the reference architecture implications the playbook stops short of drawing.
The Six Agentic Adoption Patterns: A Practitioner Decode of Microsoft's New Playbook (2026)
A practitioner read of Microsoft's Agentic Transformation Patterns Playbook: six patterns, the 5x5 maturity model, CoE structures, what it understates.
Azure AI Foundry vs Azure OpenAI: The 2026 Decision
Azure AI Foundry vs Azure OpenAI: the rebrand is consolidation, not deprecation. Decision tree, 8 scenarios, and the migration mechanics that bite.