A design analysis of Proceda's architecture and how structured execution enables state-of-the-art on SOP-Bench with models that cost a fraction of the baselines.
There is a phrase gaining traction in the AI engineering community: harness engineering. Prompt engineering concerned itself with crafting the right input to an LLM. Harness engineering concerns itself with crafting the right execution environment around it — shifting emphasis from “what you say to the model” to “what you let the model do, when, and how you constrain it.”
Proceda, run against Amazon’s SOP-Bench — a benchmark with 2,411 tasks across 14 business domains — achieved state-of-the-art on 4 of 10 runnable domains, with 8 of 10 on SOP-consistent tasks. The models used were Gemini 2.5 Flash and Gemini 3 Flash. The baselines they beat were Claude 4 Opus, Claude 4.1 Opus, and Claude 4 Sonnet.
This post is a technical design analysis of how. It is a companion to the results report (which covers the numbers) and the 48-hour journey post (which covers the story). This piece covers the architecture.
Procedural tasks have defined steps, tool dependencies, and deterministic outcomes — Patient Intake, KYC compliance, Dangerous Goods classification, Content Moderation, Order Fulfillment. These are not open-ended reasoning problems. They are procedures: sequences of steps with tool calls, decision gates, and branching logic that must be executed faithfully.
SOP-Bench tests exactly this. Each domain provides a multi-step Standard Operating Procedure, a set of mock tools, and ground truth labels. The question: can an AI agent follow the procedure, call the right tools, and produce the correct output?
The baselines in the SOP-Bench paper use two standard agent architectures:
Both treat the SOP as prompt context. Proceda treats it as a state machine.
Full SOP (30 steps) loaded into system prompt
All tool schemas injected at once
Task variables in prompt context
LLM reasons about entire procedure in one context window
↓ Tool calls ↓ Final answer
SOP parsed into discrete SkillStep objects
Each step receives only its instructions + relevant tool schemas
LLM calls tools, then calls complete_step() to advance
Executor manages step transitions, context, and guard rails
↓ Step 1 → Step 2 → … → Step N
This represents a category difference, not a degree difference. The baseline architectures require the model to function as both planner and executor. Proceda separates those concerns: the SOP defines the plan, the harness manages the execution lifecycle, and the model handles one step at a time.
Each decision below is connected to a specific benchmark outcome. The pattern is consistent: the harness absorbs complexity that would otherwise fall on the model.
A SKILL.md file is parsed into a sequence of SkillStep objects. The executor walks through them with a simple loop:
while current_step <= total_steps:
prompt LLM with step instructions
run LLM loop until complete_step() is called
handle approval gates if needed
advance to next step
The model never receives a prompt like “you are on step 3 of 12, here is what you did in steps 1 and 2, now figure out what to do next.” Instead, it receives: “Execute Step 3: Validate insurance. Here are the instructions. Here are the tools. Call complete_step when done.”
Evidence: Patient Intake. This domain requires a strict 6-tool dependency chain where the final tool needs outputs from all 5 previous tools. Both baseline architectures score 0% with Claude 3.5 Sonnet v2 — the model cannot manage the sequencing. The paper shows that Claude 4.1 Opus (the most expensive model in the comparison) is required to reach 100%. Proceda reaches 97% with Gemini 2.5 Flash by decomposing the chain into 6 steps, each with one tool call.
Principle: Reduce the reasoning surface per LLM call. Rather than asking the model to plan, provide a plan and ask it to execute one piece.
This design decision is not purely technical. Stepwise execution also matches the mental model of the business users who author and oversee SOPs. Operations teams think in steps — they want to see which step the agent is on, review progress at each stage, and designate specific steps as requiring human sign-off before proceeding. A monolithic prompt that produces a single final answer offers no such visibility. The state machine architecture serves both the LLM (by reducing cognitive load per call) and the human operator (by providing the step-level observability and control they expect from any process automation system).
Two “control tools” are always injected into every LLM call alongside the domain’s actual tools:
complete_step(summary) — the LLM signals that the current step is donerequest_clarification(question, options) — the LLM asks the human for inputThe model must call complete_step to advance. There is no implicit step detection — no pattern matching on the model’s text output to infer completion. The model makes an explicit, structured declaration.
This prevents the model from silently skipping steps, rambling past step boundaries, or confusing “thinking about the step” with “completing the step.”
The harness includes a two-tier fallback for models that produce text without calling tools:
complete_step if done, or use a tool to make progress.”Evidence: 100% Execution Completion Rate. Across all 10 domains, every single task ran to completion. No crashes, no infinite loops, no stalls. The nudge/force-complete mechanism catches any model that loses track, without failing the entire run.
Principle: Make progress legible. State transitions should be required as structured actions, not inferred from prose.
The context manager applies aggressive token budgeting:
The step prompt for the current step is marked as critical at creation time, so it is never trimmed. Old tool results from previous steps are the first to go.
Evidence: Dangerous Goods. This domain has 274 tasks, many requiring multiple tool calls per step across a multi-step procedure. Without trimming, later steps would have their context window dominated by tool results from earlier steps — results the model no longer needs. The trimming ensures the model always has the current step’s instructions and recent tool results in view.
Principle: Context is a resource to be managed, not accumulated. The right information at the right time outperforms all information all the time. A 1M token window becomes unnecessary when the harness keeps the active window focused.
SKILL.md supports two approval markers:
[APPROVAL REQUIRED] — human reviews after step completion, before advancing[PRE-APPROVAL REQUIRED] — human must approve before step execution beginsThese are parsed from the markdown by the parser (regex-based marker extraction) and enforced by the executor’s step loop. The LLM never decides whether to pause for approval. The harness makes that decision based on the SOP’s structural markers.
Evidence: This did not directly affect benchmark scores (SOP-Bench runs with auto-approve). However, it is load-bearing for the design philosophy: every decision removed from the model is a decision it cannot get wrong.
For production SOPs — the actual use case Proceda is built for — this is table stakes. More broadly, it illustrates the principle of separating policy from execution.
Principle: Policy enforcement belongs in the harness, not the model.
The SKILL.md frontmatter can declare output fields:
output_fields:
- final_resolution_status
- escalation_required
When present, the system prompt instructs the model to emit XML tags in its final complete_step summary:
<final_resolution_status>RESOLVED</final_resolution_status>
<escalation_required>NO</escalation_required>
The extractor then parses these deterministically — no fuzzy matching, no regex on prose, no ambiguity between “RESOLVED” and “the status should be resolved.”
Evidence: Customer Service — 30.1% to 81.4%. Before output_fields, the model was producing correct answers but expressing them in free-form prose. The output extractor would match “account status is ACTIVE” (from a tool result) instead of “final resolution status is RESOLVED” (the agent’s actual answer). The model had the answer; the harness could not extract it.
This was the single most impactful change in the entire benchmark run. A 51-point improvement from changing how the model formats its output — with no change to how it reasons.
Principle: Structure the interface between model output and downstream systems. Deterministic extraction of structured output, rather than parsing prose, costs zero model capability and delivers massive gains.
The executor enforces hard limits on every step:
| Guard Rail | Threshold | What Happens |
|---|---|---|
| Text-only responses (soft) | Every 5 | Nudge: “call complete_step or use a tool” |
| Text-only responses (hard) | 15 | Force-complete the step |
| Total iterations per step | 50 | Raise execution error |
| App tool calls per step | 20 | Trigger error recovery (retry/skip/cancel) |
The tool call circuit breaker merits attention. If a step exceeds 20 app tool calls, execution does not crash — it pauses and asks the human (or auto-approve in benchmarks) whether to retry (reset the counter), skip the step, or cancel the run.
Evidence: 100% ECR again. These limits are generous — 50 iterations is substantial, and most steps complete in 2–5. But they are finite. Unbounded loops are the enemy of reliability, and the cost of a slightly-too-early force-complete is far lower than the cost of a runaway token spend or infinite loop.
Principle: Bounded execution is a feature, not a limitation.
Every runtime transition emits a structured event: step started, tool called, tool completed, LLM usage, approval requested, status changed, and 20+ other types. These are written to a JSONL trace file for every task.
This may appear to be a nice-to-have. In practice, it was the mechanism that made every other improvement possible.
output_fields breakthrough came from trace analysis. The traces showed 100% ECR but 30% TSR on Customer Service — and revealed exactly where extraction was failing: the model’s answer was in the complete_step summary, but the extractor was picking up a field from an earlier tool result.Principle: Instrument everything. Improvement requires observation. Traces are the foundation of eval-driven development — the feedback loop that makes systematic improvement possible.
The most counterintuitive result from the benchmark run was the model comparison on Referral Abuse v2:
| Model | Type | TSR |
|---|---|---|
| Gemini 2.5 Pro | Thinking (extended reasoning) | 74.5% |
| Gemini 3 Flash | Non-thinking, cheap | 88.5% |
| Gemini 3.1 Pro | Non-thinking, mid-tier | 99.0% |
The thinking model — the one that spends extra compute on chain-of-thought reasoning — performed worst. The cheap non-thinking model beat it by 14 points.
The explanation: Referral Abuse requires calculating penalty scores from a table, comparing them, and selecting the right violation. The procedure is arithmetic and table lookups, fully specified by the SOP. The thinking model treated each task as a reasoning puzzle, sometimes second-guessing the SOP’s instructions or over-analyzing edge cases that were not edges.
Thinking models solve a different problem: they help when the task requires figuring out what to do. Structured SOP execution already specifies what to do. The model need only execute.
| Domain | Proceda Model | Proceda TSR | Baseline Model | Baseline TSR |
|---|---|---|---|---|
| Dangerous Goods | Gemini 2.5 Flash | 94.2% | Claude 4 Sonnet | 87% |
| Customer Service | Gemini 2.5 Flash | 81.4% | Llama 3.3 70B | 79% |
| Aircraft Inspection | Gemini 2.5 Flash | 100% | Claude 3.7 Sonnet | 99% |
| Patient Intake | Gemini 2.5 Flash | 97.0% | Claude 4.1 Opus | 100% |
| Referral Abuse v2 | Gemini 3.1 Pro | 99.0% | Claude 4 Opus | 98% |
Gemini 2.5 Flash is roughly an order of magnitude cheaper per token than Claude 4 Opus. On Dangerous Goods, the cheaper model wins by 7.2 percentage points. On Aircraft Inspection, it ties (100% vs 99%). On Patient Intake, it trails by 3 points — but the baseline required Claude 4.1 Opus, the most expensive model in the comparison.
The pattern suggests a boundary. For domains with deterministic decision logic — even complex multi-step logic like Dangerous Goods classification — cheap models with structured execution match or beat expensive models with unstructured execution.
For domains requiring subjective judgment — like Know Your Business, where the SOP says to escalate on risk indicators but the ground truth follows unstated rules — stronger models retain an advantage. Proceda scored 42.2% on Know Your Business with Gemini 3.1 Pro, versus 58% for the Claude 4.5 Opus ReAct baseline.
Proceda is open source at github.com/vivekhaldar/proceda. Full benchmark results are in the SOP-Bench results report. The 48-hour benchmarking journey is in How We Achieved SOTA on SOP-Bench in 48 Hours. Built by Enchiridion Labs.
Proceda converts Standard Operating Procedures into step-by-step AI agents with human oversight, tool integration, and full audit trails.