Skip to content

AgentOps on Microsoft Foundry: A Practitioner Decode of the New CI/CD Reference Architecture (2026)

Practitioner read of Microsoft's new Foundry CI/CD reference architecture: the 5-layer pipeline, evaluation-driven release gates, and where the architecture understates the operational work.

Alex Pechenizkiy 16 min read
AgentOps on Microsoft Foundry: A Practitioner Decode of the New CI/CD Reference Architecture (2026)

Microsoft just published an end-to-end CI/CD reference architecture for Foundry agents. It is the first official Microsoft document that treats agent deployment as a first-class engineering discipline, not as a side effect of agent authoring. This is the practitioner read on what it gets right, where it understates the operational work, and what to do with it on Monday.

The CI/CD for AI Agents on Microsoft Foundry reference came out 2026-05-22 from Lee Stott on the Microsoft Tech Community Educator Developer Blog. Two reference repos accompany it: leestott/foundry-cicd for the pipeline implementation, ericchansen/foundry-agents-lifecycle (Eric Chansen) for the agent demo. The framework’s one-line thesis is the reframing on which the rest of the document hangs:

“Agents follow a standard CI/CD pattern, but with a critical shift: promotion happens at the agent version level, and release gates are driven by evaluation outcomes, not just test results.”

Treat the document as v1 of a still-refining artefact. It is one week old at writing, sits on a tech-community blog rather than Microsoft Learn, and has not yet been pressure-tested by a real enterprise deployment cycle. The architecture itself is sound. The understated parts are where the practitioner work begins.

Editorial illustration of an AI engineer in a modern Microsoft-themed control room reviewing a CI/CD pipeline on a large monitor. The pipeline shows distinct stages flowing from a developer commit on the left through CI eval gates, multi-environment promotion, and into a production Foundry runtime. Status indicators on the screen show evaluation thresholds. A second monitor displays an agent registry with version history. The engineer is mid-decision, hand on the keyboard ready to approve a promotion.

The Assist-to-Execute Shift Demands AgentOps

The companion piece on the Microsoft Agentic Transformation Patterns Playbook opens with the shift from Assist mode (agent supports a human decision; human accountable) to Execute mode (agent performs work across systems; human oversees outcome). Execute mode demands four operating-model commitments: ownership, risk, lifecycle, and governance.

The CI/CD reference architecture is the lifecycle commitment made concrete. It is not optional. The moment any agent crosses from Assist to Execute, the operational surface area grows enough that ad-hoc deployment becomes a liability.

What the reference architecture adds beyond classic software CI/CD is the recognition that probabilistic systems need probabilistic release gates. A unit test that passes today on input X will still pass tomorrow on input X. An agent that answers correctly today on prompt X may answer differently tomorrow because the model weights drifted, the upstream retrieval changed, a tool dependency updated, or the eval dataset became stale. The pipeline has to assume non-determinism and gate accordingly.

The reframing the reference puts at the top is, in my view, the single most useful sentence in the document. The deployment unit is no longer a container or a tag, it is an agent version: an immutable artefact bundling model selection, system instructions, tool definitions, and configuration. Promotion happens at that version level. Rollback is a pointer-swap, not a re-deployment.

The 5-Layer Architecture at a Glance

The reference flow is five layers from developer commit to production observability.

Layer
1. Developer
Responsibility
Source-controlled repo for agent code, configuration, and infrastructure
Key components
Python or .NET agent code, agent.yaml or prompt definitions, MCP/REST tool configurations, Bicep or ARM IaC for Foundry project provisioning
Layer
2. CI Pipeline
Responsibility
Build, validate, evaluate every push or PR before producing a versioned agent artefact
Key components
Docker build (hosted) or YAML schema validation (prompt), ruff + bandit static checks, pytest unit + tool tests, evaluation gate against golden dataset, ACR push
Layer
3. CD Pipeline
Responsibility
Promote one agent version through three Foundry project environments with eval and human-approval gates
Key components
Stage 1 Dev (smoke + eval), Stage 2 Test (scenario + HITL + safety), Stage 3 Prod (versioned promote + endpoint enable)
Layer
4. Foundry Agent Service
Responsibility
Runtime for the deployed agent versions, with built-in lifecycle and identity
Key components
Hosted + prompt-based runtimes, versioned deployments, Entra Agent Identity per version, RBAC per project, distributed traces + structured logs
Layer
5. Monitoring + Governance + Control Plane
Responsibility
Platform-level governance, observability, and continuous evaluation across the deployed fleet
Key components
Agent registry + version history, OpenTelemetry forwarded to Azure Monitor + Application Insights, continuous evaluation on sampled production traffic, Azure Policy + RBAC enforcement
Five-layer reference architecture for Foundry agent CI/CD. Layer 1 Developer shows the source-controlled repo with agent code, agent.yaml, tool configs, and IaC. Layer 2 CI shows Docker build, static checks, tests, evaluation gate, and ACR push as a horizontal flow. Layer 3 CD shows three sequential Foundry project environments (Dev, Test, Prod) with eval gates between each and a human-approval gate before Prod. Layer 4 Foundry Agent Service shows the runtime with versioned deployments, Entra Agent Identity, RBAC, and observability. Layer 5 Control Plane shows the agent registry, OpenTelemetry pipeline to Azure Monitor, continuous evaluation, and Azure Policy enforcement spanning the bottom of the diagram. Arrows connect each layer to the next, with the evaluation gate highlighted as the critical new control point.

Honest observation, and this is my read rather than a number from the source: Layer 2 and Layer 3 are where most of the operational discipline lives, call it roughly four-fifths of the new engineering work. The developer layer is what most teams already have. Foundry Agent Service is what Microsoft provides. The control plane is what the platform team configures once. The CI + CD pipeline is where new engineering work goes, and it is where, in my read, most teams will need to invest most heavily over roughly the next 6 to 12 months.

Evaluation Gates: the Real Differentiator

Classic software pipelines gate on tests. Agent pipelines add a category Microsoft calls evaluation-driven quality gates. These run at two points: pre-merge during CI, and pre-promotion during CD. The published threshold table is the most directly useful artefact in the document.

Category
Quality
Metric
Hallucination rate
CI threshold
< 5%
Prod threshold
< 3%
Category
Quality
Metric
Task completion rate
CI threshold
> 90%
Prod threshold
> 95%
Category
Safety
Metric
Grounded response rate
CI threshold
> 95%
Prod threshold
> 98%
Category
Safety
Metric
Policy violations
CI threshold
0
Prod threshold
0
Category
Performance
Metric
p95 latency
CI threshold
< 4000 ms
Prod threshold
< 3000 ms
Category
Cost
Metric
Token usage per query
CI threshold
Track only
Prod threshold
Alert on > 20% regression

The threshold values themselves are reasonable starters but they are not universal. A claims-processing agent has very different acceptable hallucination tolerance than a meeting-summary agent. A code-review agent has different latency tolerance than a customer-service agent. Microsoft frames these as “thresholds” without saying “these are illustrative starters.” Treat them as illustrative starters. Calibrate to your specific use case, your specific user expectations, and your specific liability exposure.

The hidden cost the document acknowledges briefly but does not quantify is dataset maintenance. The article does say “stale datasets produce misleading pass/fail signals” and “treat your golden evaluation set as a first-class engineering artefact alongside the agent code itself.” That is correct but understates the work. A real golden dataset needs:

  • A named owner with authority over what goes in and what comes out
  • A change-review process that mirrors code review
  • Quarterly coverage audits against real production traffic patterns
  • A drift detection process that flags when production cases diverge from dataset cases
  • Versioning that ties each eval run to a specific dataset version

Most teams do none of this. The result is a dataset that was representative on day one and unrepresentative by month six, producing eval scores that no longer reflect real-world performance. The eval gate becomes theatre.

A second silent cost is token spend. Every PR triggers an eval run. Every eval run costs tokens at the rate of your model provider. At fleet scale (say, on the order of 100+ PRs per week across 10+ agents) the eval budget becomes a non-trivial line item. The reference architecture says nothing about cost-control patterns: sampled eval runs on draft PRs, full eval runs only on PRs marked ready-for-review, or per-team token budgets enforced at the pipeline level. These are decisions every team will face once the pipeline is running at scale.

Hosted vs Prompt-Based Agents: Pipeline Differences in Practice

The reference architecture handles two deployment models. The split is real and the operational implications differ.

Capability
Deployment unit
Hosted agents
Container image plus agent definition
Prompt-based agents
YAML / prompt configuration bundle
Capability
Build step required
Hosted agents
Yes: Docker build plus ACR push
Prompt-based agents
No: YAML validation only
Capability
Supported frameworks
Hosted agents
Agent Framework, LangGraph, Semantic Kernel, custom code
Prompt-based agents
Foundry declarative runtime
Capability
Promotion artefact
Hosted agents
Versioned agent with container image reference
Prompt-based agents
Versioned prompt / config bundle
Capability
CI focus
Hosted agents
Code quality, tool tests, evaluation
Prompt-based agents
Prompt schema validation, evaluation
Capability
Rollback mechanism
Hosted agents
Switch active agent version (pointer)
Prompt-based agents
Switch active agent version (pointer)
Capability
Runtime management
Hosted agents
Foundry manages container lifecycle
Prompt-based agents
Foundry manages declarative runtime

Hosted agents inherit all the Docker supply-chain problems (base-image pinning, dependency version drift, SDK churn) on top of the AI-specific evaluation problems. A team running hosted agents needs both standard container-security discipline (CVE scanning, SBOM tracking, base-image refresh cadence) and AI-specific evaluation discipline. The reference architecture covers the AI side cleanly; the container-supply-chain side is treated as standard practice and left to the team.

Prompt-based agents look simpler because there is no Docker build. The complexity moves to the prompt-version lifecycle. A small change to a prompt can shift behaviour across thousands of test cases. The golden dataset needs coverage of prompt edge cases (jailbreaks, prompt injection, edge-case phrasings) that does not exist for the hosted-agent code path. Prompt-based deployment is operationally lighter on day one and operationally similar on day ninety once the prompt-version testing is mature.

Multi-Environment Topology

Microsoft recommends a three-environment topology with separate Foundry projects per environment.

Option
A (Recommended)
Structure
Dev Project → Test Project → Prod Project (separate Foundry projects)
Best for
Enterprise workloads
Trade-off
Full RBAC isolation, distinct connection strings, separate evaluation signals, easier governance audit
Option
B (Lightweight)
Structure
Single Foundry project with agent version tags (dev/test/prod)
Best for
Small teams, prototyping, internal demos
Trade-off
Simpler initial setup, weaker environment separation, RBAC has to be enforced at agent-version granularity

Option A is the right call for any team running agents in production. The cost is operational: three Azure subscriptions or three resource groups (the boundary depends on your landing zone), three sets of connection strings, three sets of policy assignments. Teams without a clean dev/test/prod Azure landing zone will spend the first month of adoption fixing landing-zone problems before the CI/CD pipeline can be set up cleanly. Plan for this.

OIDC Workload Identity Federation between the pipeline and Azure is the right authentication pattern. It avoids storing long-lived Azure credentials in pipeline secrets, rotates the trust automatically, and produces audit logs for every deployment action. Adopt OIDC from day one even on the Dev environment; the cost of retrofitting it later is high.

Where the Reference Architecture Understates the Work

The architecture itself is correct. The framework is sound. There are six places where the document understates the engineering work the patterns demand.

Multi-model evaluation is unaddressed. Many real enterprise agents route across Foundry, Claude, and OpenAI by workload type. Eval suites that work for one provider do not automatically work for another (prompt formatting differs, tool-call semantics differ, refusal patterns differ). The reference architecture assumes a single-model deployment per agent. Multi-model deployments need per-model eval suites and a routing-layer eval test that the architecture does not name. The companion piece on LLM-agnostic AI agents covers the routing-layer side of this.

Model deprecation breaks the reproducible triple. The (code, prompt, model) triple is auditable today. When Microsoft retires gpt-4o or upgrades a model behind the same name, the old triple is no longer reproducible. The article does not address provider-side model lifecycle. A real AgentOps practice needs: explicit model-version pinning, a deprecation-monitoring process, a re-evaluation campaign on every forced model upgrade, and a rollback plan that anticipates the model itself being unavailable.

The HITL approver authority gap. Required human approval before Production deploy is the right gate. The assumption is that the approver has both the technical literacy to read eval results and the product authority to defund a stalled promotion. Most enterprises have one or the other, rarely both. The product manager who has authority does not always have the technical literacy; the engineer who has the literacy does not always have the authority. This is the same political-maturity gap that the Agentic Patterns Playbook decode names in its critique of the CoE framework.

Continuous evaluation on production traffic raises PII and audit concerns. The article mentions sampled production traffic but does not address sampling-rate decisions, PII redaction before eval, audit logging for sampled traffic, or how to handle eval failures that would themselves expose PII in incident reports. In regulated industries (finance, healthcare, public sector) these are gating decisions, not implementation details.

Pipeline parity is operational debt. Microsoft publishes both a GitHub Actions workflow and an Azure DevOps YAML pipeline as parallel reference implementations. Both are well-designed. In practice teams pick one and stick with it. Maintaining both is operational debt the article does not name. The decision criteria are real: shop-existing tooling, identity model fit, RBAC and approval-gate ergonomics, evaluation-result reporting integration with your existing developer surfaces.

Cost regression alerting needs an owner and authority. “Alert on > 20% token regression” is a correct control. Who receives the alert and what authority they have to act on it (roll back, defund, downgrade model selection, require redesign) is not specified. Without a named owner and a defined escalation path, the alert becomes a Slack notification that no one acts on.

The Cross-Stack Test: AgentOps Beyond Microsoft

The same vendor-portability question we asked of the Patterns Playbook applies here. Are these patterns Microsoft-stack-implicit, or do they transfer cleanly to non-Microsoft operating models?

The CI/CD shape transfers cleanly. The eval-gate-as-release-gate principle is universal. The (code, prompt, model) version-pinning principle is universal. The Dev → Test (with HITL) → Prod promotion sequence is universal. The OIDC + workload identity pattern is universal across major identity providers.

The implementation surfaces differ. A team running Pega-based agent orchestration over Pega case management can adopt the same release-gate discipline with Pega’s own deployment tooling and a custom evaluation script. A team running Salesforce Agentforce can wire eval gates into Salesforce DX. A team running n8n or LangChain or LangGraph outside Microsoft can adopt the same CI sequence (lint, test, eval, build) using GitHub Actions or GitLab CI without touching Foundry.

What does not transfer cleanly is the agent registry. Foundry’s built-in agent registry, version history, and Entra Agent Identity per deployed version are platform features that have no direct cross-stack equivalent today. Teams running agents outside Foundry will need to build or buy the registry layer. This is one of the strongest practical reasons to standardise on Foundry for new enterprise agent workloads if the rest of your stack is Microsoft-aligned.

The Monday Move for Different Team Types

Adopt the reference architecture selectively. Your team’s current posture determines the right starting move.

Team posture
Already running Power Platform ALM with release gates
Where to start
This maps cleanly onto your existing discipline. Adopt the eval-gate Python script as a new gate type in your existing pipeline. The CI structure plugs into Azure DevOps cleanly if you are already on ADO.
Where to skip-read
Skip nothing; this is additive.
Team posture
One Foundry agent in Dev, no pipeline yet
Where to start
Stand up the Dev project plus the eval-gate Python script first. Run it on every PR. Do not try to set up Test or Prod environments until the eval-gate is producing trustworthy signal for at least one full release cycle.
Where to skip-read
Skip the multi-environment topology decision until you have eval-gate confidence.
Team posture
Running 10+ agents across Foundry plus Claude plus OpenAI
Where to start
Build the per-model evaluation suite first. The routing layer is a separate concern; treat the AgentOps pipeline as per-model and converge at the routing layer. The LLM-agnostic agents companion (linked above) covers the routing side.
Where to skip-read
Skip the single-pipeline assumption; you need one CI pipeline per backing model.
Team posture
No release discipline today
Where to start
Start with eval-gate-on-PR for one agent. Add the Test environment after one quarter of running the eval-gate cleanly. Add Prod after another quarter. The temptation to set up all three environments on day one is the most common failure mode.
Where to skip-read
Skip the full reference architecture; treat it as the 12-month target, not the Monday move.

Budget anchor. The reference architecture is silent on the engineering cost of standing up the pipeline. Order-of-magnitude estimate from our experience: a team adopting AgentOps for the first time typically spends 4-8 engineer-weeks on the initial pipeline (developer environment standardisation, CI eval-gate Python script, single-environment CD, OIDC trust setup). A team running 10+ agents at scale typically allocates 1-2 dedicated FTEs to AgentOps platform engineering plus shared eval-dataset curation across product teams. Treat these as illustrative ranges; your specific use cases, your team’s prior CI/CD maturity, and your regulated-data posture move them.

The Real Deployment Artefact Is the Agent Version

Microsoft’s whole reframing in one line:

“Treat the agent version as your deployment artefact, and evaluation outcomes as your release gate.”

Internalise that and the rest of the architecture organises itself. The container image is supporting infrastructure. The pipeline is the production line. The agent registry is the audit surface. The runtime is the execution surface. The thing that ships, the thing that promotes, the thing that rolls back, is the agent version.

The strategic implication for hiring, funding, and platform-engineering investment over the next 12 months is direct. Hiring: the Agent Product Owner role the Patterns Playbook formalises becomes the role that owns the agent version across its lifecycle. Funding: the eval-dataset curation that the reference architecture treats as a sidebar is in fact a budget line and a named engineering responsibility. Platform engineering: the AgentOps pipeline becomes a platform-team deliverable, not a per-product-team improvisation.

Microsoft has done the harder half of the work by publishing the reference. The remaining half is operational discipline: eval-dataset ownership, multi-model evaluation, model-deprecation rollback plans, HITL approver authority, continuous-evaluation PII handling, and pipeline-choice consolidation. Those are the practitioner gaps, and they are where the next 12 months of enterprise AgentOps work actually happens.

Stay in the loop

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

Related articles