Skip to content

AI Proposal Writing on Foundry: Multi-Model Patterns That Ship

A four-phase proposal pipeline running on four different LLMs. Prompts, evaluator metrics, and the eval-suite that lets you swap models cleanly.

Alex Pechenizkiy 12 min read
AI Proposal Writing on Foundry: Multi-Model Patterns That Ship

Proposal writing is the canonical multi-model use case. Not because it is glamorous, but because the work decomposes into four phases that each demand different cognitive work and each have different cost profiles. Sending all four phases to one frontier model is the most common reason an AI-assisted proposal pipeline costs four to eight times what the architect estimated.

The companion article to this one, Six Rules for LLM-Agnostic AI Agents on Microsoft Foundry, covers the architecture rules at the framework level. This piece goes deeper on the proposal use case specifically: the prompt pattern per phase, the evaluator metric that tells you whether the phase is working, and the eval-suite shape that lets you swap models in production without breaking outputs.

Four-phase proposal pipeline with model class per phase

Why Does Proposal Writing Decompose Cleanly Across Models?

A proposal response is not one task. It is at least four tasks chained together with different success criteria. Each task demands a different cognitive workload (long-context comprehension, divergent generation, constrained drafting, pattern matching) and each has a different cost profile per request. Routing each task to the model class that fits its profile cuts the bill significantly while keeping output quality flat or better.

PhaseCognitive demandVolume profileWhere it fails
RFP requirement extractionLong-context comprehension, structured outputLow (1 pass per RFP)Missed requirements compound through every later phase
Win-themes brainstormDivergent generation, creative reasoningLow (1 pass per RFP)Generic themes that match any vendor
Section draftingConstrained generation against requirementsHigh (30-60 sections per RFP)Tone drift across sections, format instability
Compliance checkPattern matching, completeness scoringVery high (every section, multiple passes)False negatives on hidden requirements

The phases also have different latency tolerances. Extraction can take minutes; the user is not waiting on it. Compliance checks run in CI-style pipelines and can be parallelized. Section drafting is where users notice latency, and it is also the phase where token volume dominates the bill.

The architectural opportunity: route each phase to the model class that fits best, then wire the phases together with an orchestrator that owns the contract between them.

Phase 1: RFP Requirement Extraction

This is the highest-leverage phase. Errors here propagate through every later phase. A missed requirement becomes a missed section, becomes a noncompliance flag, becomes a lost proposal.

Best-fit model class: frontier reasoning with long context. Claude Opus 4.7 is the current favorite for this phase; GPT-5.5 is competitive on shorter RFPs but Claude’s context window and structured-output stability make it the safer default for documents over 200 pages.

The prompt pattern

SYSTEM:
You are a proposal-response analyst extracting requirements from RFP documents.
Output strict JSON matching the schema below.

USER:
RFP_DOCUMENT:
<full RFP text, up to ~200K tokens>

EVALUATION_CRITERIA_SECTION_REFERENCES:
<page numbers of evaluation sections>

INSTRUCTIONS:
1. Extract every numbered requirement.
2. For each requirement: assign category (technical, management, past performance, price),
   compliance type (mandatory, optional, recommended), and source page reference.
3. Capture every evaluation factor with its weight if stated.
4. Capture every page-limit, font, margin, and submission-format rule.
5. Output JSON matching this schema: {requirements: [...], evaluation_factors: [...],
   format_rules: {...}, deadlines: [...]}

Do not summarize. Extract literally. If a requirement is ambiguous, capture it as-is and
flag with "ambiguous": true. If you cannot find a section, do not invent it.

Why this prompt is portable

The schema is described in the user message body, not as a response_format parameter. Every modern frontier model can produce JSON that conforms to a schema described in plain text. This means the same prompt runs on Claude, GPT-5.5, or Gemini 2.5 Pro without modification.

If you used Anthropic’s tool-use blocks or OpenAI’s response_format, you would have to maintain two prompt versions. That is the vendor lock-in the multi-model architecture is designed to avoid.

Evaluator metric

For extraction, the metric that matters is recall against a ground-truth requirement count, not generic quality. You build the eval suite by manually annotating five to ten historical RFPs, capturing every requirement and rule. Then for every model swap, you measure:

  1. Recall: percentage of ground-truth requirements the model captured. Target above 95% for frontier; degraded below that means a swap is risky.
  2. Precision: percentage of captured items that are real (not hallucinated). Target 100%; one fabricated requirement on a federal proposal is a disqualification risk.
  3. Schema validity: percentage of outputs that parse cleanly. Target 100%.

A model that hits 99% recall but 95% precision is dangerous in this phase. Score precision more strictly.

The percentage targets above and throughout this article (95%+ recall, 70%+ specificity, 90%+ recall on compliance, etc.) are practitioner heuristics calibrated against the kind of federal RFP where a single fabricated requirement is a disqualification risk. They are not benchmark-derived. Calibrate against your own annotated eval set; the right threshold for a marketing-RFP pipeline is different from a federal-IT-services pipeline.

Phase 2: Win-Themes Brainstorm

After extraction, the proposal team needs three to five win themes that will be threaded through every section. This is where divergent generation matters more than long-context analysis.

Best-fit model class: frontier creative. In my experience, GPT-5.5 produces stronger candidate themes than Claude on this phase, with broader range and less tendency toward generic “proven track record” framings. Claude Opus 4.7 is competitive when given a strong rubric, but the rubric does most of the work then. The pattern: GPT for divergence, Claude for refinement.

The prompt pattern

SYSTEM:
You are a proposal strategist. Generate candidate win themes for an RFP response.

USER:
RFP_REQUIREMENTS_SUMMARY:
<top 30 requirements + evaluation factors from Phase 1, summarized to ~3K tokens>

COMPANY_CONTEXT:
<company differentiators, past performance, key personnel - 1-2K tokens>

INSTRUCTIONS:
Generate 8 to 12 candidate win themes. For each:
1. Headline (max 10 words)
2. Underlying differentiator (specific capability or experience, not generic claim)
3. Which 3-5 RFP requirements this theme directly addresses
4. Which evaluation factor it strengthens
5. One paragraph rationale

Avoid generic themes (e.g. "proven track record", "client-focused approach"). If a theme could
fit any vendor, drop it. Themes must be specific enough to be defensible in a proposal review.

Why this needs frontier creative

Lower-tier models tend to output a fixed catalog of generic themes regardless of input. The differentiation is in the divergence: a frontier model with the right prompt produces themes that surprise the proposal team with angles they had not considered. That is the entire point of the phase.

Evaluator metric

Theme quality is harder to measure than extraction recall, but a usable proxy:

  1. Diversity score: semantic distance between candidate themes (high is better; if two themes paraphrase each other, the model failed)
  2. Specificity test: how many themes pass the “could this be any vendor” filter. Target 70%+ specific.
  3. Requirement coverage: every theme should map to at least 3 RFP requirements; the theme set should collectively cover at least 60% of evaluation-weighted requirements.

In practice, the proposal team also rates themes 1-5 on usefulness during review. Capture this; it is the closest thing to a real-world quality signal.

Phase 3: Section Drafting

This is where token volume dominates the bill. A typical federal proposal has 30 to 60 sections. Each section gets drafted, often multiple times. If you run this phase on Tier 3 frontier models, you are paying frontier prices for first drafts that humans will edit anyway.

Best-fit model class: balanced. GPT-4.1 and Mistral Large 123B are both reasonable defaults. The choice between them often comes down to which produces more consistent tone across sections; in my experience GPT-4.1 has the edge on tone stability, Mistral has the edge on cost per token. Test both with your actual prompts and your eval suite.

The prompt pattern

SYSTEM:
You are drafting Section <ID> of a proposal response. Match the company voice. Keep within
the page limit. Address every requirement listed below. Use plain English; avoid jargon
unless the RFP uses it.

USER:
SECTION_ID: <e.g. L-1.2.3 Technical Approach>
PAGE_LIMIT: <e.g. 4 pages>
REQUIREMENTS_TO_ADDRESS:
<3-8 requirements from Phase 1 extraction, full text>

WIN_THEMES_TO_THREAD:
<2-3 themes from Phase 2, headline + rationale>

COMPANY_VOICE_REFERENCE:
<2-3 paragraphs from a prior winning proposal in the same domain, as voice example>

EVALUATION_FACTORS_AT_STAKE:
<which factors this section will be scored against>

INSTRUCTIONS:
Draft this section in the company voice. Address every requirement. Thread the assigned
win themes naturally. Do not exceed the page limit (assume 280 words per page).
End with a short summary box if the page allows.

The tone-drift problem

The single biggest problem with this phase is tone drift across sections. If section L-1.2 sounds like a confident senior consultant and section L-1.3 sounds like a marketing brochure, reviewers notice and proposal scores suffer.

Two mitigations:

  1. Voice reference in every prompt. Include 2-3 paragraphs from a prior winning proposal as the voice example. Update the reference when company voice evolves.
  2. Stitch-pass at the end of the phase. Run a separate prompt that takes the full draft set and produces a tone-adjustment diff. This is one Tier 3 frontier call against the entire draft, not section-by-section. The cost is negligible compared to the drafting itself; the quality lift is consistent.

Evaluator metric

For drafting, three measurements:

  1. Rubric score: human or LLM-as-judge scoring against a 1-5 rubric. Capture the rubric explicitly so it is reproducible across model swaps.
  2. Format stability: percentage of drafts that respect page limit, section structure, no markdown leakage. Target 95%+.
  3. Coverage: percentage of assigned requirements actually addressed. Target 100%; missed requirements are bugs.

The rubric scoring is the labor-intensive part. Build it once with annotated historical drafts, then automate via LLM-as-judge using a frontier model for the judging only. The judge runs once per evaluation cycle, not on every draft.

Phase 4: Compliance Check

Compliance checking is the highest-volume, lowest-margin work in the pipeline. Every section gets checked against requirements, often multiple times. The model class that wins here is the cheapest one that hits acceptable precision and recall.

Best-fit model class: fast classifier. DeepSeek V3 and Llama 3.3 70B are both viable defaults. Mistral Small handles this work too. The criterion is not raw quality; it is precision-recall on injected violations from your eval suite.

What compliance checks actually catch

Naive compliance checks count words, verify required headings, and flag forbidden phrases. These are necessary but rarely sufficient. The hidden compliance failures live in:

  • Requirements that are addressed in spirit but not by letter (the RFP wants “demonstrated experience”; the section says “we have experience”)
  • Conditional requirements that only fire if certain technical claims are made
  • Page-limit violations that survive word counting (a 4-page section with a 3.9-page text block plus a half-page diagram)
  • Format inconsistencies between sections that pass individually but fail collectively

A capable Tier 1 model handles all of these with the right prompt structure. The trick is the prompt, not the model size.

The prompt pattern

SYSTEM:
You are a compliance auditor. Score the section against the requirements below.

USER:
SECTION_DRAFT:
<full section text>

REQUIREMENTS_THIS_SECTION_MUST_ADDRESS:
<list from Phase 1>

CHECK_LIST:
1. For each requirement: is it addressed? (yes/partial/no/ambiguous)
2. For partial/no/ambiguous: which sentence in the draft is
   closest to addressing it?
3. Does the section exceed the page limit? Estimate at 280 words
   per page, minus 10% for diagrams.
4. Are there forbidden patterns? (list of company-specific
   forbiddens, e.g. "world-class" or competitor names)
5. Is the tone consistent with the voice reference?

OUTPUT_FORMAT:
JSON: {requirement_coverage: [...], page_limit: {...}, forbidden_patterns: [...],
       tone_consistency: {score, rationale}, recommended_actions: [...]}

Evaluator metric

The eval suite for compliance is constructed from injected violations. You take 20-30 known-good sections, then create a violation set: each section gets 5-10 deliberate violations injected (a missed requirement, a forbidden phrase, a page-limit overrun, an ambiguous claim). The model has to find them.

Then you measure:

  1. Precision: percentage of flagged violations that are real
  2. Recall: percentage of injected violations the model caught
  3. False positive rate: flagged items in known-good sections

For a federal proposal, you want recall above 90% on the eval set. Below that, the cheap-tier model is not viable for this phase regardless of cost savings.

The Orchestration Layer

The four phases need glue. On Microsoft Foundry, the natural pattern is Foundry Agent Service with phases as agents and an orchestrator handling the handoffs. The Microsoft Agent Framework gives you the SDK shape if you want explicit control over orchestration.

The contracts between phases matter more than the orchestrator choice:

  1. Phase 1 to Phase 2: the Phase 1 JSON output is summarized (top 30 requirements, evaluation weights) before going into Phase 2’s prompt. Do not pass the entire extraction; the brainstorm phase needs signal density, not volume.
  2. Phase 2 to Phase 3: themes are mapped to sections explicitly. A separate routing pass (which themes thread through which sections) lives between Phase 2 and Phase 3 output.
  3. Phase 3 to Phase 4: drafts are checked one at a time, in parallel. Phase 4 results feed back into Phase 3 if violations exceed a threshold (re-draft with violations as additional requirements).

The orchestrator is responsible for these contracts. The model classes per phase are implementation details that can swap without the orchestrator changing.

Honest Cost Math

The numbers below are illustrative arithmetic from public per-token pricing as of 2026-04-28. Substitute your own. Treat as a calculator, not a measured production result.

Assume 100 RFPs per month, average 250 pages, 40 sections per response, 3 compliance-check passes per section.

Phase Per RFP token volume (input + output) All-frontier line item Tiered line item Savings driver
1. Extraction ~250K in / ~30K out Tier 3 Tier 3 (unchanged) No savings; recall matters more than cost
2. Win themes ~5K in / ~3K out Tier 3 Tier 3 (unchanged) Low volume; quality lift worth the cost
3. Drafting 40 sections x ~5K in / ~3K out = 200K in / 120K out Tier 3 (dominant line item) Tier 2 (60-80% per-token reduction) Volume x balanced-tier pricing
4. Compliance 120 checks x ~3K in / ~1K out = 360K in / 120K out Tier 3 Tier 1 (90%+ per-token reduction) Volume x cheap-tier pricing

Across the full 100-RFP month, the blended cost for the tiered architecture lands roughly four to eight times below the all-frontier baseline depending on the exact volume mix and the price points on the day you run the math. The architectural point is not the specific multiple; it is that the tiering does most of the work, and the routing gateway just makes the tiering operationally clean.

Cited cost-pattern reference: AIThority on multi-model routing in 2026 reports comparable 4-8x reductions across multi-model agentic pipelines.

What I Would Actually Do

Build this incrementally. Do not start with all four phases on four different models on day one.

Sequence:

  1. Week 1: Get extraction working on a single frontier model. Build the eval suite first, then optimize the prompt against it. Goal: 95%+ recall, 100% precision on a 5-RFP eval set.
  2. Week 2: Add the win-themes phase. Same model. Get the prompt and rubric right.
  3. Week 3: Add drafting on the same frontier model. Build the section-level rubric and eval suite. Capture baseline cost.
  4. Week 4: Swap drafting to Tier 2 and run the eval comparison. Document deltas. If recall and rubric score hold, ship.
  5. Week 5: Add compliance check on Tier 1. Same eval discipline.
  6. Week 6: Run the first quarterly model-swap drill. Document what breaks.

Skip the multi-model architecture entirely if your volume is under 10 RFPs per month. The complexity overhead does not pay back at that scale; one frontier model handling everything is fine.

For everyone else, the four-phase decomposition is the most concrete payoff of the six rules. It is also the easiest one to demonstrate to skeptical stakeholders, because the bill drops visibly within one billing cycle and the eval scores stay flat or improve.


Cluster: AI Architecture. Companion piece to: Six Rules for LLM-Agnostic AI Agents on Microsoft Foundry. Related: Azure AI Foundry vs Azure OpenAI: 2026 Decision for the platform-layer choice; Claude on Azure Marketplace billing trap for vendor lock-in mechanics; AI engineering productivity ROI framework for the cost-of-AI-development side. External reference architecture: Foundry Agent Service overview and Microsoft Agent Framework.

Stay in the loop

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

Related articles