Skip to content

Six Rules for LLM-Agnostic AI Agents on Microsoft Foundry

Foundry Model Router routes 18 models across providers. Six rules for AI agents that survive the next vendor rotation, with a proposal-writing example.

Alex Pechenizkiy 11 min read
Six Rules for LLM-Agnostic AI Agents on Microsoft Foundry

The April 2026 calendar pretty much settled the multi-model debate.

On April 23, GPT-5.5 went generally available in Microsoft Foundry. Four days later, Microsoft and OpenAI revised the exclusivity terms that had defined their partnership since 2019. The week before, Foundry’s Model Router shipped a routing-aware version that picks between 18 models including Claude, DeepSeek, Llama, and Grok. Satya Nadella’s earnings note put a number on it: 10,000 Foundry customers using more than one model, 5,000 leveraging open weights.

If your enterprise AI agent is still hardcoded to a single provider’s SDK, you are now operating against the platform’s grain. The rest of this article is what to do instead: six rules for agents that route, tier, swap, and survive the next provider rotation. There is a worked proposal-writing example with four different models at the bottom.

Multi-model agent architecture: a single gateway routes prompts across frontier, balanced, and fast tiers based on task complexity

What Is Foundry Model Router?

Foundry Model Router is a deployable Azure model that picks the best underlying LLM for each prompt at request time. One deployment, three routing modes (Balanced, Quality, Cost), 18 models in the routing pool including GPT, Claude, DeepSeek, Llama, Grok. Calling code uses the standard Chat Completions API. The router decides which model handles each request.

Why This Matters Now

Four dated events from the last three weeks change the architecture math:

Date Event Architectural implication
Nov 18, 2025 Foundry Model Router went GA with 18-model support, including Claude, DeepSeek, Llama, and Grok. One Azure deployment can route across providers without code changes.
Apr 23, 2026 GPT-5.5 launched in Microsoft Foundry on day one. Frontier OpenAI models now land in Foundry first; Foundry is the platform layer.
Apr 27, 2026 OpenAI and Microsoft revised partnership exclusivity terms. OpenAI is one provider on Azure, not THE provider.
Apr 30, 2026 Nadella reported 10,000 Foundry customers using more than one model. Multi-model is the default deployment shape, not a niche pattern.

The combined message from Microsoft is unmistakable. The Azure AI roadmap is a model marketplace with routing intelligence on top, not a single-vendor stack with bolt-ons. Anyone who recently watched a startup get a $1,600 Claude bill on Azure Marketplace already understands that “vendor on Azure” is a different relationship than “vendor’s official platform”. The lesson generalizes: design for swappable providers, not for the one that signed the headline deal.

The Six Rules

Rule 1: Abstract Behind a Gateway

Every model call in production goes through a gateway, not directly to a vendor SDK.

On Azure, the Foundry Model Router is the obvious choice. It is one deployment that exposes the standard Chat Completions API and routes requests to whichever underlying model best fits each prompt, with three modes (Balanced, Quality, Cost) and custom-subset selection so you control which models are eligible.

If you need cross-cloud, LiteLLM is the open-source equivalent. It speaks the OpenAI format and proxies to over 100 providers (Bedrock, Vertex AI, Azure, Anthropic, etc.) with a unified retries-and-fallback layer.

Either way, the code change is the same: agent code stops importing anthropic or openai and starts pointing at one HTTP endpoint with one auth pattern. When a provider deprecates, you change a deployment config, not your codebase. When a new model lands, you add it to the routing subset, not your repository.

Rule 2: Route Per Task Complexity, Not Per Agent

The first instinct after reading Rule 1 is to assign each agent a model. Don’t do that. Route the prompts inside each agent.

A single agent in production typically handles a mix of work: short classification prompts (“is this a billing question?”), summarization, structured extraction, multi-step reasoning, format checks. The token economics across those tasks span an order of magnitude. Sending all of it to one frontier model is the most common reason agentic-AI bills come back at 4-8 times what the architect estimated.

Foundry’s Model Router classifies the prompt at request time and picks. In Balanced mode it looks for the cheapest model within roughly 1-2% quality of the best candidate; in Cost mode it widens that quality band to 5-6%; in Quality mode it ignores cost. The routing logic page has the details, but the practical point is that a single agent can run hundreds of trivial prompts on a small model and reserve the frontier model for the requests that actually need it, all without you writing any routing code.

Rule 3: Keep Prompts Portable

Vendor-specific affordances are tempting and dangerous. OpenAI’s structured outputs schema, Anthropic’s tool-use blocks, Gemini’s function-calling format. Each one looks like a small ergonomic win. Each one welds your prompt to the vendor.

The discipline: write prompts that work on any of the three. That means JSON-schema-prompted output (every modern model can produce a JSON object that matches a schema you describe in plain text) instead of response_format on OpenAI. Tool-call descriptions written into the system prompt instead of relying on the vendor’s tool-call envelope. Few-shot examples in the prompt body instead of metadata fields.

You give up some structured-output guarantee. You buy the ability to swap models in five minutes instead of five days. For an agent under active production load, the second matters more.

Rule 4: Tier Your Traffic

The healthy production shape is roughly 60/30/10:

Tier Share of traffic Models Use cases
Tier 1: fast and cheap 60-70% DeepSeek V3, Llama 3.3 70B, Mistral Small, GPT-5-mini Classification, intent detection, format checks, simple summarization, routing decisions
Tier 2: balanced 20-30% GPT-4.1, Mistral Large 123B, Claude Sonnet 4.6 Section drafting, structured extraction, code generation
Tier 3: frontier 10-15% Claude Opus 4.7, GPT-5.5, Gemini 2.5 Pro Multi-step reasoning, ambiguous-input judgment, customer-facing complex outputs

This is the shape that the AIThority analysis of multi-model routing in 2026 reports as cost-effective, and it lines up with what I see when I model token economics for proposal-writing pipelines. The blended cost-per-task drop versus running everything through Tier 3 is significant; the worked example below shows the math.

The point of this rule is not the specific 60/30/10 split. The point is that your traffic should be tiered. If 100% of your requests go to one frontier model, you are paying frontier prices for classification work that a 7B-parameter model would handle.

Rule 5: Pin Model Versions in Production

Both OpenAI and Anthropic deprecate models. Microsoft’s policy is explicit: GA models in Foundry get a retirement date set programmatically to 18 months from launch (12 months for some partner-catalog models), per the Foundry Models lifecycle policy. Each Model Router version is associated with a specific set of underlying models, and only newer router versions expose new ones. You opt into auto-update or you stay pinned.

Production agents should pin. The reason is not nostalgia for old models; it is that prompt regression is real. A prompt tuned against gpt-4o-2024-08-06 does not always produce equivalent output on gpt-4o-2024-11-20, even when the model card says minor changes. If your eval suite catches the regression, great; if not, you ship degraded responses to production silently.

The discipline: production deployments name explicit model versions in their config; staging tracks latest; CI runs an eval-suite diff between the two on every change. The cost is a tiny amount of operational complexity. The benefit is that you decide when to migrate, not the vendor’s deprecation calendar.

Rule 6: Run a Quarterly Model-Swap Drill

This is the rule most teams skip and the one that actually proves the architecture works.

Once a quarter, in staging, replace your default model with the next-best fallback. Run your full eval suite. Document the deltas: tasks that improved, tasks that regressed, latency change, cost change, format-stability change.

Two outcomes both useful:

  1. The swap works smoothly, deltas are small, you have a migration plan ready for the day the vendor announces deprecation.
  2. The swap exposes vendor-specific dependencies you forgot about (a JSON schema only OpenAI honors strictly; a tool-call format only Claude formats well; a system-prompt convention that bumps Mistral’s quality). You fix them now, when there is no deadline pressure, instead of during a deprecation scramble.

I argue this drill is the single highest-ROI exercise in the multi-model playbook. It costs maybe a day of engineering time per quarter. It catches the problems that turn into incidents.

Worked Example: Proposal Writing With Four Models

A request that prompted this article: how would you architect an AI proposal-writing system across multiple models? Here is the shape I would build.

Proposal writing has four distinct phases, each with different cognitive demands and different cost profiles per request. Sending all four to the same frontier model is the canonical example of paying too much for too little.

Phase What the model does Best-fit model class Why
RFP requirement extraction Read 100-400 page RFP. Output structured JSON of every requirement, evaluation criterion, page reference. Frontier reasoning + long context (Claude Opus 4.7) Long-context handling and structured JSON output across hundreds of pages is where reasoning models earn their keep. Errors here cascade through the entire proposal.
Win-themes brainstorm Given the requirement set + company context, generate 8-12 candidate win themes with rationale. Frontier creative (GPT-5.5) Divergent generation. Quality of ideation correlates with model size for this task type.
Section drafting Draft each proposal section against assigned win themes + requirements. High volume: 30-60 sections per RFP. Balanced (GPT-4.1 or Mistral Large 123B) High request volume. Frontier model adds marginal quality at multiples of the cost; the balanced tier is sufficient for first drafts that humans will edit anyway.
Compliance + format check Verify every required element appears, page limits are honored, font and margin rules are followed, all references resolve. Fast classifier (DeepSeek V3 or Llama 3.3 70B) Pure pattern-matching. Each section gets checked, sometimes multiple times. A frontier model here is wasted spend.

Illustrative cost math

The numbers below are illustrative arithmetic from public per-token pricing as of 2026-04-28. Substitute your own volumes. Treat these as a calculator, not as a measured production result.

Assume 100 RFPs per month, average 250 pages per RFP, 40 sections per response, 3 compliance-check passes per section.

  • All-frontier baseline: every phase on Claude Opus 4.7 or GPT-5.5. Token volume is dominated by section drafts (40 sections x 100 RFPs x ~3K output tokens each = 12M output tokens). At frontier output pricing, this is the single largest line item on the bill.
  • Tiered architecture: extraction stays on frontier (necessary for accuracy across long documents). Win themes stay on frontier (low volume, high impact). Section drafting moves to balanced tier (cuts the drafting line item by roughly 70-80% per token at comparable subjective quality on first drafts). Compliance check moves to fast tier (cuts that line item by 90%+).

The blended cost reduction lands in the 4-8x range against an all-frontier baseline depending on the exact volume mix and the price points on the day you do the math. Your number will differ. The architectural point is that the tiering does most of the work; the routing gateway just makes it operationally clean.

I will write the prompt-level mechanics for this proposal pipeline in a follow-up article. The architecture rule is general: tier the traffic, route at request time, pin the models, and treat the prompt-portability discipline (Rule 3) as non-negotiable.

What This Looks Like in Code

A Foundry Model Router deployment with custom subset selection. The example below uses the Azure Management REST API path and deploys a router that mixes one OpenAI model, one Anthropic model, and one open-weight model.

# Deploy a Model Router with three eligible models, Balanced mode.
# Anthropic Claude must be deployed separately first to the same Foundry resource.

URI="https://management.azure.com/subscriptions/$SUB_ID"
URI="$URI/resourceGroups/$RG/providers/Microsoft.CognitiveServices"
URI="$URI/accounts/$FOUNDRY_ACCOUNT/deployments/router-prod"
URI="$URI?api-version=2024-10-01"

az rest --method PUT --uri "$URI" --body '{
  "sku": { "name": "Standard", "capacity": 100 },
  "properties": {
    "model": {
      "format": "OpenAI",
      "name": "model-router",
      "version": "2025-11-18"
    },
    "routing": {
      "mode": "Balanced",
      "models": [
        { "format": "OpenAI",    "name": "gpt-4.1",         "version": "2025-04-14" },
        { "format": "Anthropic", "name": "claude-opus-4-7", "version": "2025-10-01" },
        { "format": "Mistral",   "name": "mistral-large",   "version": "2024-11-18" }
      ]
    }
  }
}'

Calling the router from application code is unchanged from a single-model deployment. Same Chat Completions API, same SDKs, one endpoint. The "model" field in the response payload reveals which underlying model handled each request, useful for logging and for cost attribution.

For cross-cloud or on-prem hybrid deployments where Foundry is not the substrate, the LiteLLM proxy gives you the same shape with a different operational profile (you run the proxy yourself, you own the routing logic explicitly). For Azure-native deployments where the workload lives in Azure already, Foundry Model Router is significantly less work.

When NOT To Do This

Multi-model architecture has a fixed cost: routing logic, eval-suite expansion, prompt-portability discipline, the quarterly drill. It is overhead that pays back at scale and is just overhead at small scale.

Skip the multi-model architecture when:

  • The agent serves fewer than ~10,000 requests per month and the bill is already small. The savings are pennies; the operational complexity is not.
  • The agent does one kind of task and that task fits one model class well. A simple FAQ chatbot does not need a router.
  • You have one frontier model that is materially better than alternatives at your specific task and you have evaluated this rigorously. The discipline is to confirm with an eval suite, not to assume.
  • Compliance or data-residency rules pin you to one provider. Tier 3 might still be one provider; tiers 1 and 2 might be impossible to use cross-cloud. Honest, but real.

The rule of thumb: if your monthly model bill is over $5,000 and growing, the architecture investment will pay back within a quarter. If it is under $1,000 and stable, you are over-engineering.

What I Would Actually Do

Pick one production agent in your portfolio. The one with the highest token volume, ideally one that mixes prompt complexity (some classification, some drafting, some reasoning).

Deploy a Foundry Model Router in front of it with three models in the routing subset: your current default, one tier-1 model, one tier-3 model. Run for a month. Measure: cost change, latency change, eval-suite delta, format-stability incidents.

If the result is good, generalize the pattern across the rest of your agent fleet. Use the proposal-writing pipeline above as a reference architecture for cases that need explicit phase-based routing.

If the result is mixed, you have learned something concrete about which of your prompts depend on which vendor’s specific behavior. Fix those next. The discipline survives whatever comes after GPT-5.5 and Claude Opus 4.7.

Microsoft just told you, with five separate announcements in three weeks, that the platform is going multi-model whether your application is ready or not. The rules above are the cheapest way to be ready before the next rotation lands.


Cluster: AI Architecture. Related reading: Azure AI Foundry vs Azure OpenAI for the platform-layer decision; Claude on Azure Marketplace billing trap for vendor lock-in mechanics in the wild; Azure OpenAI PTU vs PAYG break-even for the underlying cost math; AI engineering productivity ROI framework for an end-to-end ROI calculator. The vendor-neutral peer reading: Kai Waehner on the Enterprise Agentic AI Landscape 2026.

Stay in the loop

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

Related articles