AgentOps on Microsoft Foundry: A Practitioner Decode of the New CI/CD Reference Architecture (2026)

Microsoft just published an end-to-end CI/CD reference architecture for Foundry agents. It is the first official Microsoft document that treats agent deployment as a first-class engineering discipline, not as a side effect of agent authoring. This is the practitioner read on what it gets right, where it understates the operational work, and what to do with it on Monday.

The CI/CD for AI Agents on Microsoft Foundry reference came out 2026-05-22 from Lee Stott on the Microsoft Tech Community Educator Developer Blog. Two reference repos accompany it: leestott/foundry-cicd for the pipeline implementation, ericchansen/foundry-agents-lifecycle (Eric Chansen) for the agent demo. The framework’s one-line thesis is the reframing on which the rest of the document hangs:

“Agents follow a standard CI/CD pattern, but with a critical shift: promotion happens at the agent version level, and release gates are driven by evaluation outcomes, not just test results.”

Treat the document as v1 of a still-refining artefact. It is one week old at writing, sits on a tech-community blog rather than Microsoft Learn, and has not yet been pressure-tested by a real enterprise deployment cycle. The architecture itself is sound. The understated parts are where the practitioner work begins.

💡 Microsoft surface maturity (mid-2026)

This GA-vs-Preview split is my read as of mid-2026, not a list from the source; verify against the latest Foundry release notes before you plan around it.

GA today: Foundry Agent Service (hosted + prompt-based runtimes), Azure AI Projects SDK, Azure DevOps + GitHub Actions, Azure Container Registry, Azure Monitor, Application Insights, Entra Agent Identity, OIDC Workload Identity Federation.

Preview / hardening as of mid-2026: azure-ai-evaluation SDK at its current API surface, continuous-evaluation pipelines on production traffic, Foundry control-plane agent registry features.

The Layer 1, 2, and 3 pieces (developer + CI + CD) ship today on stable Microsoft surfaces. Layer 4 and Layer 5 (Foundry runtime + control plane) depend on surfaces that are mostly GA but still maturing the operational surface around continuous evaluation.

✅ TL;DR + the Monday move

Microsoft’s reference architecture has 5 layers (Developer → CI → CD → Foundry Agent Service → Monitoring + Governance). The single most important reframing, to my read: your deployment artefact is no longer a container image, it is an agent version (an immutable bundle of code, prompt, model, and tool config). Promotion gates fire on evaluation outcomes, not just test pass/fail.

Monday move: stand up the CI evaluation gate first. A failing eval blocks the PR before anything else moves. Once that gate is reliable on one agent, replicate the pattern across the fleet. Do not try to set up the full 3-environment topology until the eval gate is producing trustworthy signal.

Where the framework understates the work: evaluation dataset maintenance is treated as a sidebar. In practice it is a named engineering role and a budget line. A stale golden dataset turns the entire pipeline into theatre.

Editorial illustration of an AI engineer in a modern Microsoft-themed control room reviewing a CI/CD pipeline on a large monitor. The pipeline shows distinct stages flowing from a developer commit on the left through CI eval gates, multi-environment promotion, and into a production Foundry runtime. Status indicators on the screen show evaluation thresholds. A second monitor displays an agent registry with version history. The engineer is mid-decision, hand on the keyboard ready to approve a promotion.

The Assist-to-Execute Shift Demands AgentOps

The companion piece on the Microsoft Agentic Transformation Patterns Playbook opens with the shift from Assist mode (agent supports a human decision; human accountable) to Execute mode (agent performs work across systems; human oversees outcome). Execute mode demands four operating-model commitments: ownership, risk, lifecycle, and governance.

The CI/CD reference architecture is the lifecycle commitment made concrete. It is not optional. The moment any agent crosses from Assist to Execute, the operational surface area grows enough that ad-hoc deployment becomes a liability.

What the reference architecture adds beyond classic software CI/CD is the recognition that probabilistic systems need probabilistic release gates. A unit test that passes today on input X will still pass tomorrow on input X. An agent that answers correctly today on prompt X may answer differently tomorrow because the model weights drifted, the upstream retrieval changed, a tool dependency updated, or the eval dataset became stale. The pipeline has to assume non-determinism and gate accordingly.

The reframing the reference puts at the top is, in my view, the single most useful sentence in the document. The deployment unit is no longer a container or a tag, it is an agent version: an immutable artefact bundling model selection, system instructions, tool definitions, and configuration. Promotion happens at that version level. Rollback is a pointer-swap, not a re-deployment.

The 5-Layer Architecture at a Glance

The reference flow is five layers from developer commit to production observability.

Layer	Responsibility	Key components
1. Developer	Source-controlled repo for agent code, configuration, and infrastructure	Python or .NET agent code, agent.yaml or prompt definitions, MCP/REST tool configurations, Bicep or ARM IaC for Foundry project provisioning
2. CI Pipeline	Build, validate, evaluate every push or PR before producing a versioned agent artefact	Docker build (hosted) or YAML schema validation (prompt), ruff + bandit static checks, pytest unit + tool tests, evaluation gate against golden dataset, ACR push
3. CD Pipeline	Promote one agent version through three Foundry project environments with eval and human-approval gates	Stage 1 Dev (smoke + eval), Stage 2 Test (scenario + HITL + safety), Stage 3 Prod (versioned promote + endpoint enable)
4. Foundry Agent Service	Runtime for the deployed agent versions, with built-in lifecycle and identity	Hosted + prompt-based runtimes, versioned deployments, Entra Agent Identity per version, RBAC per project, distributed traces + structured logs
5. Monitoring + Governance + Control Plane	Platform-level governance, observability, and continuous evaluation across the deployed fleet	Agent registry + version history, OpenTelemetry forwarded to Azure Monitor + Application Insights, continuous evaluation on sampled production traffic, Azure Policy + RBAC enforcement

Layer

1. Developer

Responsibility

Source-controlled repo for agent code, configuration, and infrastructure

Key components

Python or .NET agent code, agent.yaml or prompt definitions, MCP/REST tool configurations, Bicep or ARM IaC for Foundry project provisioning

Layer

2. CI Pipeline

Responsibility

Build, validate, evaluate every push or PR before producing a versioned agent artefact

Key components

Docker build (hosted) or YAML schema validation (prompt), ruff + bandit static checks, pytest unit + tool tests, evaluation gate against golden dataset, ACR push

Layer

3. CD Pipeline

Responsibility

Promote one agent version through three Foundry project environments with eval and human-approval gates

Key components

Stage 1 Dev (smoke + eval), Stage 2 Test (scenario + HITL + safety), Stage 3 Prod (versioned promote + endpoint enable)

Layer

4. Foundry Agent Service

Responsibility

Runtime for the deployed agent versions, with built-in lifecycle and identity

Key components

Hosted + prompt-based runtimes, versioned deployments, Entra Agent Identity per version, RBAC per project, distributed traces + structured logs

Layer

5. Monitoring + Governance + Control Plane

Responsibility

Platform-level governance, observability, and continuous evaluation across the deployed fleet

Key components

Agent registry + version history, OpenTelemetry forwarded to Azure Monitor + Application Insights, continuous evaluation on sampled production traffic, Azure Policy + RBAC enforcement

Five-layer reference architecture for Foundry agent CI/CD. Layer 1 Developer shows the source-controlled repo with agent code, agent.yaml, tool configs, and IaC. Layer 2 CI shows Docker build, static checks, tests, evaluation gate, and ACR push as a horizontal flow. Layer 3 CD shows three sequential Foundry project environments (Dev, Test, Prod) with eval gates between each and a human-approval gate before Prod. Layer 4 Foundry Agent Service shows the runtime with versioned deployments, Entra Agent Identity, RBAC, and observability. Layer 5 Control Plane shows the agent registry, OpenTelemetry pipeline to Azure Monitor, continuous evaluation, and Azure Policy enforcement spanning the bottom of the diagram. Arrows connect each layer to the next, with the evaluation gate highlighted as the critical new control point.

Honest observation, and this is my read rather than a number from the source: Layer 2 and Layer 3 are where most of the operational discipline lives, call it roughly four-fifths of the new engineering work. The developer layer is what most teams already have. Foundry Agent Service is what Microsoft provides. The control plane is what the platform team configures once. The CI + CD pipeline is where new engineering work goes, and it is where, in my read, most teams will need to invest most heavily over roughly the next 6 to 12 months.

Evaluation Gates: the Real Differentiator

Classic software pipelines gate on tests. Agent pipelines add a category Microsoft calls evaluation-driven quality gates. These run at two points: pre-merge during CI, and pre-promotion during CD. The published threshold table is the most directly useful artefact in the document.

Category	Metric	CI threshold	Prod threshold
Quality	Hallucination rate	< 5%	< 3%
Quality	Task completion rate	> 90%	> 95%
Safety	Grounded response rate	> 95%	> 98%
Safety	Policy violations	0	0
Performance	p95 latency	< 4000 ms	< 3000 ms
Cost	Token usage per query	Track only	Alert on > 20% regression

Category

Quality

Metric

Hallucination rate

CI threshold

< 5%

Prod threshold

< 3%

Category

Quality

Metric

Task completion rate

CI threshold

> 90%

Prod threshold

> 95%

Category

Safety

Metric

Grounded response rate

CI threshold

> 95%

Prod threshold

> 98%

Category

Safety

Metric

Policy violations

CI threshold

Prod threshold

Category

Performance

Metric

p95 latency

CI threshold

< 4000 ms

Prod threshold

< 3000 ms

Category

Cost

Metric

Token usage per query

CI threshold

Track only

Prod threshold

Alert on > 20% regression

The threshold values themselves are reasonable starters but they are not universal. A claims-processing agent has very different acceptable hallucination tolerance than a meeting-summary agent. A code-review agent has different latency tolerance than a customer-service agent. Microsoft frames these as “thresholds” without saying “these are illustrative starters.” Treat them as illustrative starters. Calibrate to your specific use case, your specific user expectations, and your specific liability exposure.

The hidden cost the document acknowledges briefly but does not quantify is dataset maintenance. The article does say “stale datasets produce misleading pass/fail signals” and “treat your golden evaluation set as a first-class engineering artefact alongside the agent code itself.” That is correct but understates the work. A real golden dataset needs:

A named owner with authority over what goes in and what comes out
A change-review process that mirrors code review
Quarterly coverage audits against real production traffic patterns
A drift detection process that flags when production cases diverge from dataset cases
Versioning that ties each eval run to a specific dataset version

Most teams do none of this. The result is a dataset that was representative on day one and unrepresentative by month six, producing eval scores that no longer reflect real-world performance. The eval gate becomes theatre.

A second silent cost is token spend. Every PR triggers an eval run. Every eval run costs tokens at the rate of your model provider. At fleet scale (say, on the order of 100+ PRs per week across 10+ agents) the eval budget becomes a non-trivial line item. The reference architecture says nothing about cost-control patterns: sampled eval runs on draft PRs, full eval runs only on PRs marked ready-for-review, or per-team token budgets enforced at the pipeline level. These are decisions every team will face once the pipeline is running at scale.

Hosted vs Prompt-Based Agents: Pipeline Differences in Practice

The reference architecture handles two deployment models. The split is real and the operational implications differ.

Capability	Hosted agents	Prompt-based agents
Deployment unit	Container image plus agent definition	YAML / prompt configuration bundle
Build step required	Yes: Docker build plus ACR push	No: YAML validation only
Supported frameworks	Agent Framework, LangGraph, Semantic Kernel, custom code	Foundry declarative runtime
Promotion artefact	Versioned agent with container image reference	Versioned prompt / config bundle
CI focus	Code quality, tool tests, evaluation	Prompt schema validation, evaluation
Rollback mechanism	Switch active agent version (pointer)	Switch active agent version (pointer)
Runtime management	Foundry manages container lifecycle	Foundry manages declarative runtime

Capability

Deployment unit

Hosted agents

Container image plus agent definition

Prompt-based agents

YAML / prompt configuration bundle

Capability

Build step required

Hosted agents

Yes: Docker build plus ACR push

Prompt-based agents

No: YAML validation only

Capability

Supported frameworks

Hosted agents

Agent Framework, LangGraph, Semantic Kernel, custom code

Prompt-based agents

Foundry declarative runtime

Capability

Promotion artefact

Hosted agents

Versioned agent with container image reference

Prompt-based agents

Versioned prompt / config bundle

Capability

CI focus

Hosted agents

Code quality, tool tests, evaluation

Prompt-based agents

Prompt schema validation, evaluation

Capability

Rollback mechanism

Hosted agents

Switch active agent version (pointer)

Prompt-based agents

Switch active agent version (pointer)

Capability

Runtime management

Hosted agents

Foundry manages container lifecycle

Prompt-based agents

Foundry manages declarative runtime

Hosted agents inherit all the Docker supply-chain problems (base-image pinning, dependency version drift, SDK churn) on top of the AI-specific evaluation problems. A team running hosted agents needs both standard container-security discipline (CVE scanning, SBOM tracking, base-image refresh cadence) and AI-specific evaluation discipline. The reference architecture covers the AI side cleanly; the container-supply-chain side is treated as standard practice and left to the team.

Prompt-based agents look simpler because there is no Docker build. The complexity moves to the prompt-version lifecycle. A small change to a prompt can shift behaviour across thousands of test cases. The golden dataset needs coverage of prompt edge cases (jailbreaks, prompt injection, edge-case phrasings) that does not exist for the hosted-agent code path. Prompt-based deployment is operationally lighter on day one and operationally similar on day ninety once the prompt-version testing is mature.

Multi-Environment Topology

Microsoft recommends a three-environment topology with separate Foundry projects per environment.

Option	Structure	Best for	Trade-off
A (Recommended)	Dev Project → Test Project → Prod Project (separate Foundry projects)	Enterprise workloads	Full RBAC isolation, distinct connection strings, separate evaluation signals, easier governance audit
B (Lightweight)	Single Foundry project with agent version tags (dev/test/prod)	Small teams, prototyping, internal demos	Simpler initial setup, weaker environment separation, RBAC has to be enforced at agent-version granularity

Option

A (Recommended)

Structure

Dev Project → Test Project → Prod Project (separate Foundry projects)

Best for

Enterprise workloads

Trade-off

Full RBAC isolation, distinct connection strings, separate evaluation signals, easier governance audit

Option

B (Lightweight)

Structure

Single Foundry project with agent version tags (dev/test/prod)

Best for

Small teams, prototyping, internal demos

Trade-off

Simpler initial setup, weaker environment separation, RBAC has to be enforced at agent-version granularity

Option A is the right call for any team running agents in production. The cost is operational: three Azure subscriptions or three resource groups (the boundary depends on your landing zone), three sets of connection strings, three sets of policy assignments. Teams without a clean dev/test/prod Azure landing zone will spend the first month of adoption fixing landing-zone problems before the CI/CD pipeline can be set up cleanly. Plan for this.

OIDC Workload Identity Federation between the pipeline and Azure is the right authentication pattern. It avoids storing long-lived Azure credentials in pipeline secrets, rotates the trust automatically, and produces audit logs for every deployment action. Adopt OIDC from day one even on the Dev environment; the cost of retrofitting it later is high.

Where the Reference Architecture Understates the Work

The architecture itself is correct. The framework is sound. There are six places where the document understates the engineering work the patterns demand.

Multi-model evaluation is unaddressed. Many real enterprise agents route across Foundry, Claude, and OpenAI by workload type. Eval suites that work for one provider do not automatically work for another (prompt formatting differs, tool-call semantics differ, refusal patterns differ). The reference architecture assumes a single-model deployment per agent. Multi-model deployments need per-model eval suites and a routing-layer eval test that the architecture does not name. The companion piece on LLM-agnostic AI agents covers the routing-layer side of this.

Model deprecation breaks the reproducible triple. The (code, prompt, model) triple is auditable today. When Microsoft retires gpt-4o or upgrades a model behind the same name, the old triple is no longer reproducible. The article does not address provider-side model lifecycle. A real AgentOps practice needs: explicit model-version pinning, a deprecation-monitoring process, a re-evaluation campaign on every forced model upgrade, and a rollback plan that anticipates the model itself being unavailable.

The HITL approver authority gap. Required human approval before Production deploy is the right gate. The assumption is that the approver has both the technical literacy to read eval results and the product authority to defund a stalled promotion. Most enterprises have one or the other, rarely both. The product manager who has authority does not always have the technical literacy; the engineer who has the literacy does not always have the authority. This is the same political-maturity gap that the Agentic Patterns Playbook decode names in its critique of the CoE framework.

Continuous evaluation on production traffic raises PII and audit concerns. The article mentions sampled production traffic but does not address sampling-rate decisions, PII redaction before eval, audit logging for sampled traffic, or how to handle eval failures that would themselves expose PII in incident reports. In regulated industries (finance, healthcare, public sector) these are gating decisions, not implementation details.

Pipeline parity is operational debt. Microsoft publishes both a GitHub Actions workflow and an Azure DevOps YAML pipeline as parallel reference implementations. Both are well-designed. In practice teams pick one and stick with it. Maintaining both is operational debt the article does not name. The decision criteria are real: shop-existing tooling, identity model fit, RBAC and approval-gate ergonomics, evaluation-result reporting integration with your existing developer surfaces.

Cost regression alerting needs an owner and authority. “Alert on > 20% token regression” is a correct control. Who receives the alert and what authority they have to act on it (roll back, defund, downgrade model selection, require redesign) is not specified. Without a named owner and a defined escalation path, the alert becomes a Slack notification that no one acts on.

The Cross-Stack Test: AgentOps Beyond Microsoft

The same vendor-portability question we asked of the Patterns Playbook applies here. Are these patterns Microsoft-stack-implicit, or do they transfer cleanly to non-Microsoft operating models?

The CI/CD shape transfers cleanly. The eval-gate-as-release-gate principle is universal. The (code, prompt, model) version-pinning principle is universal. The Dev → Test (with HITL) → Prod promotion sequence is universal. The OIDC + workload identity pattern is universal across major identity providers.

The implementation surfaces differ. A team running Pega-based agent orchestration over Pega case management can adopt the same release-gate discipline with Pega’s own deployment tooling and a custom evaluation script. A team running Salesforce Agentforce can wire eval gates into Salesforce DX. A team running n8n or LangChain or LangGraph outside Microsoft can adopt the same CI sequence (lint, test, eval, build) using GitHub Actions or GitLab CI without touching Foundry.

What does not transfer cleanly is the agent registry. Foundry’s built-in agent registry, version history, and Entra Agent Identity per deployed version are platform features that have no direct cross-stack equivalent today. Teams running agents outside Foundry will need to build or buy the registry layer. This is one of the strongest practical reasons to standardise on Foundry for new enterprise agent workloads if the rest of your stack is Microsoft-aligned.

The Monday Move for Different Team Types

Adopt the reference architecture selectively. Your team’s current posture determines the right starting move.

Team posture	Where to start	Where to skip-read
Already running Power Platform ALM with release gates	This maps cleanly onto your existing discipline. Adopt the eval-gate Python script as a new gate type in your existing pipeline. The CI structure plugs into Azure DevOps cleanly if you are already on ADO.	Skip nothing; this is additive.
One Foundry agent in Dev, no pipeline yet	Stand up the Dev project plus the eval-gate Python script first. Run it on every PR. Do not try to set up Test or Prod environments until the eval-gate is producing trustworthy signal for at least one full release cycle.	Skip the multi-environment topology decision until you have eval-gate confidence.
Running 10+ agents across Foundry plus Claude plus OpenAI	Build the per-model evaluation suite first. The routing layer is a separate concern; treat the AgentOps pipeline as per-model and converge at the routing layer. The LLM-agnostic agents companion (linked above) covers the routing side.	Skip the single-pipeline assumption; you need one CI pipeline per backing model.
No release discipline today	Start with eval-gate-on-PR for one agent. Add the Test environment after one quarter of running the eval-gate cleanly. Add Prod after another quarter. The temptation to set up all three environments on day one is the most common failure mode.	Skip the full reference architecture; treat it as the 12-month target, not the Monday move.

Team posture

Already running Power Platform ALM with release gates

Where to start

This maps cleanly onto your existing discipline. Adopt the eval-gate Python script as a new gate type in your existing pipeline. The CI structure plugs into Azure DevOps cleanly if you are already on ADO.

Where to skip-read

Skip nothing; this is additive.

Team posture

One Foundry agent in Dev, no pipeline yet

Where to start

Stand up the Dev project plus the eval-gate Python script first. Run it on every PR. Do not try to set up Test or Prod environments until the eval-gate is producing trustworthy signal for at least one full release cycle.

Where to skip-read

Skip the multi-environment topology decision until you have eval-gate confidence.

Team posture

Running 10+ agents across Foundry plus Claude plus OpenAI

Where to start

Build the per-model evaluation suite first. The routing layer is a separate concern; treat the AgentOps pipeline as per-model and converge at the routing layer. The LLM-agnostic agents companion (linked above) covers the routing side.

Where to skip-read

Skip the single-pipeline assumption; you need one CI pipeline per backing model.

Team posture

No release discipline today

Where to start

Start with eval-gate-on-PR for one agent. Add the Test environment after one quarter of running the eval-gate cleanly. Add Prod after another quarter. The temptation to set up all three environments on day one is the most common failure mode.

Where to skip-read

Skip the full reference architecture; treat it as the 12-month target, not the Monday move.

Budget anchor. The reference architecture is silent on the engineering cost of standing up the pipeline. Order-of-magnitude estimate from our experience: a team adopting AgentOps for the first time typically spends 4-8 engineer-weeks on the initial pipeline (developer environment standardisation, CI eval-gate Python script, single-environment CD, OIDC trust setup). A team running 10+ agents at scale typically allocates 1-2 dedicated FTEs to AgentOps platform engineering plus shared eval-dataset curation across product teams. Treat these as illustrative ranges; your specific use cases, your team’s prior CI/CD maturity, and your regulated-data posture move them.

The Real Deployment Artefact Is the Agent Version

Microsoft’s whole reframing in one line:

“Treat the agent version as your deployment artefact, and evaluation outcomes as your release gate.”

Internalise that and the rest of the architecture organises itself. The container image is supporting infrastructure. The pipeline is the production line. The agent registry is the audit surface. The runtime is the execution surface. The thing that ships, the thing that promotes, the thing that rolls back, is the agent version.

The strategic implication for hiring, funding, and platform-engineering investment over the next 12 months is direct. Hiring: the Agent Product Owner role the Patterns Playbook formalises becomes the role that owns the agent version across its lifecycle. Funding: the eval-dataset curation that the reference architecture treats as a sidebar is in fact a budget line and a named engineering responsibility. Platform engineering: the AgentOps pipeline becomes a platform-team deliverable, not a per-product-team improvisation.

Microsoft has done the harder half of the work by publishing the reference. The remaining half is operational discipline: eval-dataset ownership, multi-model evaluation, model-deprecation rollback plans, HITL approver authority, continuous-evaluation PII handling, and pipeline-choice consolidation. Those are the practitioner gaps, and they are where the next 12 months of enterprise AgentOps work actually happens.

AgentOps on Microsoft Foundry: A Practitioner Decode of the New CI/CD Reference Architecture (2026)

The Assist-to-Execute Shift Demands AgentOps

The 5-Layer Architecture at a Glance

Evaluation Gates: the Real Differentiator

Hosted vs Prompt-Based Agents: Pipeline Differences in Practice

Multi-Environment Topology

Where the Reference Architecture Understates the Work

The Cross-Stack Test: AgentOps Beyond Microsoft

The Monday Move for Different Team Types

The Real Deployment Artefact Is the Agent Version

Read Next

Stay in the loop

Related articles

Foundry Hosted vs In-Process vs Copilot Studio Agents (2026 Decision)

Microsoft's Agents Hub Decoded (2026): The Frameworks and the Gaps

Azure AI Foundry New vs Classic: 2026 Migration Map