*On 4 of 10 runnable domains by raw TSR; 8 of 10 when excluding benchmark labeling issues we identified and reported. Four additional domains were not runnable due to benchmark bugs. See Benchmark Quality for details.
Proceda, a terminal-first SDK for turning Standard Operating Procedures into runnable AI agents, achieves state-of-the-art results on SOP-Bench, an Amazon Science benchmark with 2,411 tasks across 14 business domains.
On the 10 domains with functioning tools, Proceda sets new SOTA on 4 domains by raw Task Success Rate and beats the best published baseline on 8 of 10 domains when measured on tasks where the benchmark's ground truth is consistent with its own SOP rules. Proceda also achieves 100% Execution Completion Rate across every domain — no crashes, no stalls, every task runs to completion.
Most results were achieved using Gemini 2.5 Flash and Gemini 3 Flash — lightweight, inexpensive models — beating baselines set by Claude 4 Opus, Claude 4.1 Opus, and Claude 4 Sonnet.
| Acronym | Full Name | Definition |
|---|---|---|
| SOP | Standard Operating Procedure | A documented, step-by-step business process |
| TSR | Task Success Rate | Fraction of tasks where all output fields match ground truth |
| ECR | Execution Completion Rate | Fraction of tasks that run to completion without crashing |
| C-TSR | Conditional TSR | TSR measured only on tasks that completed (TSR / ECR) |
| SOTA | State of the Art | Best published result on a benchmark |
| FC | Function Calling | Agent architecture that uses tool-calling in a single LLM prompt |
| ReAct | Reasoning + Acting | Agent architecture using a thought-action-observation loop |
| MCP | Model Context Protocol | Open standard for connecting AI models to external tools |
Standard Operating Procedures are how enterprises encode critical business processes — from patient intake to KYC compliance to content moderation. Unlike informational knowledge bases (where benchmarks like SkillsBench measure factual retrieval), SOPs are procedural: they define a sequence of steps, tool calls, decision gates, and branching logic that must be executed faithfully.
Automating SOPs with AI is a key enterprise use case. But “dump the SOP into a prompt and hope for the best” doesn't work — as SOP-Bench demonstrates, even frontier models fail when the procedure is complex. The patient intake domain, rated the easiest by human experts, scores 0% TSR with both baseline agent architectures when using Claude 3.5 Sonnet v2 (the model shipped with the SOP-Bench evaluation code). The paper’s Table 5 shows that scaling to Claude 4.1 Opus achieves 100% — but that requires a frontier model on the simplest domain.
SOP-Bench is a welcome benchmark in this space. It provides 14 real-world business domains with mock tools, ground truth labels, and standardized evaluation. It tests what matters: can an AI agent reliably follow a multi-step procedure, call the right tools with the right parameters, and produce the correct output?
| Domain | Model | Raw TSR | SOP-consistent | Best Baseline | Delta |
|---|---|---|---|---|---|
| Aircraft Inspection | Gemini 2.5 Flash | 100% | 100% | 99% — Claude 3.7 Sonnet | +1pt SOTA |
| Referral Abuse v2 | Gemini 3.1 Pro | 99.0% | 99.0% | 98% — Claude 4 Opus | +1pt SOTA |
| Patient Intake | Gemini 2.5 Flash | 97.0% | 97.0% | 100% — Claude 4.1 Opus | -3pt |
| Referral Abuse v1 | Gemini 3 Flash | 95.5% | 100%* | 98% — Claude 3.5 v2 | +2pt* SOTA |
| Dangerous Goods | Gemini 2.5 Flash | 94.2% | 94.2% | 87% — Claude 4 Sonnet | +7.2pt SOTA |
| Order Fulfillment | Gemini 3 Flash | 86.7% | 100%* | — | no baseline |
| Video Classification | Gemini 3 Flash | 83.2% | ~100%* | 95% — Claude 4 Sonnet | +5pt* SOTA |
| Customer Service | Gemini 2.5 Flash | 81.4% | 81.4% | 79% — Llama 3.3 70B | +2.4pt SOTA |
| Traffic Spoofing | Gemini 3 Flash | 79.5% | 98.8%* | 86% — Claude 4.5 Sonnet | +12.8pt* SOTA |
| Know Your Business | Gemini 3.1 Pro | 42.2% | 64.4%* | 58% — Claude 4.5 Opus | +6.4pt* SOTA |
* SOP-consistent TSR excludes tasks where the benchmark's CSV ground truth contradicts the SOP's explicit rules. Baseline TSR is measured on all tasks, so the comparison favors the baseline. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.
| Domain | Issue | Filed |
|---|---|---|
| Content Flagging | random.random() in tool implementations | #2 |
| Warehouse Inspection | Hardcoded po_number % 3 mock logic ignores CSV | #3 |
| Video Annotation | 20 of 26 tool methods are pass stubs returning None | #6 |
| Email Intent | Unresolved git merge conflicts in 3 source files | #7 |
* SOP-consistent TSR (excludes tasks where CSV ground truth contradicts SOP rules). Baseline measured on all tasks.
Sorted by Proceda TSR. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.
4 domains SOTA by raw TSR — no caveats:
4 more domains SOTA on SOP-consistent tasks:
Patient Intake: 97% is arguably 100% in practice. The 2 “failures” (out of 66) are both the same edge case: when primary insurance returns “invalid,” the LLM reasons that prescription benefit validation is unnecessary and skips the tool call. This is a clinically reasonable judgment — a human would likely do the same. The benchmark scores it wrong because the tool would have returned “valid” if called. All 5 other output fields are correct in both tasks.
100% ECR across all domains. Every task completes. Failures are reasoning or extraction issues, never execution crashes.
One of the most striking findings is the model cost story. The SOP-Bench paper's best baselines require frontier models — Claude 4.1 Opus, Claude 4.5 Opus, Claude 4 Sonnet. Proceda beats these with much lighter models:
| Proceda Model | Cost Tier | Domains | Baselines Beaten |
|---|---|---|---|
| Gemini 2.5 Flash | Lowest | Dangerous Goods, Customer Service, Patient Intake, Aircraft Insp. | Claude 4 Sonnet, Llama 3.3 70B, Claude 3.7 Sonnet |
| Gemini 3 Flash | Low | Referral v1, Traffic Spoofing, Video Classif., Order Fulfil. | Claude 3.5 v2, Claude 4.5 Sonnet, Claude 4 Sonnet |
| Gemini 3.1 Pro | Medium | Referral v2, KYB | Claude 4 Opus, Claude 4.5 Opus |
The only domain that required Gemini 3.1 Pro where the SOP explicitly asks the agent to exercise subjective judgment is Know Your Business — the SOP says “use your experience” to distinguish typos from fraud, and notes that “risk scores are not reliable.” Referral Abuse v2 also used 3.1 Pro, but primarily because it was the model that achieved SOTA during our model comparison experiments.
For domains with clear, formulaic decision logic — even complex ones like Dangerous Goods (274 tasks, weighted scoring with imputation) — cheap models suffice when execution is structured.
The implication: a purpose-built SOP execution harness substitutes for raw model capability on procedural tasks. You don’t need a $75/MTok model to follow a procedure; you need a harness that decomposes the procedure into manageable steps and drives the model through them one at a time. See How Proceda Works for the architecture that makes this possible.
During this evaluation, we identified and reported systematic issues in the SOP-Bench data. We frame these as contributions to the benchmark's quality — SOP-Bench is a valuable new benchmark, and these findings help improve it.
| Domain | Issue | Filed |
|---|---|---|
| Content Flagging | random.random() in tool implementations | #2 |
| Warehouse Inspection | Hardcoded po_number % 3 mock logic ignores CSV | #3 |
| Video Annotation | 20 of 26 tool methods are pass stubs returning None | #6 |
| Email Intent | Unresolved git merge conflicts in 3 source files | #7 |
These domains have functioning tools and were run successfully, but a portion of tasks have ground truth labels that contradict the SOP's explicit rules. An agent faithfully following the SOP gets penalized.
| Domain | Disagreements | Filed | Pattern |
|---|---|---|---|
| Referral Abuse v1 | 9 / 200 | #4 | CSV follows a closure-priority rule not in the SOP |
| Traffic Spoofing | 39 / 200 | #5 | Medium-risk tasks labeled “Warning Issued” when SOP says “Temporary Suspension” |
| Know Your Business | 31 / 90 | #8 | All “awaiting info” tasks have stronger escalation signals than “escalate” tasks |
| Video Classification | 9 / 196 | — | Implicit rules not derivable from SOP; stub tools suppress signals |
The SOP-Bench paper lists “implicit knowledge that humans learn but rarely document” as a benchmark challenge. However, each agent evaluates tasks in isolation with no access to ground truth labels — there is no mechanism to learn unstated rules from data patterns.
The SOP-Bench paper evaluates two agent architectures: Function Calling (tool use in a single prompt) and ReAct (thought-action-observation loop). Both dump the entire SOP as raw text into a single prompt and say “follow this.”
This works for short procedures. It fails for complex ones. Patient intake — rated the easiest domain by human experts — requires a 6-tool dependency chain where the final tool needs outputs from all 5 previous tools as input parameters. With Claude 3.5 Sonnet v2, both baseline architectures score 0% TSR. Scaling to Claude 4.1 Opus achieves 100% — but requiring a frontier model for the simplest procedure highlights the brittleness of the unstructured approach.
Proceda treats SOPs as first-class executable artifacts, not prompt context.
proceda convert uses an LLM to transform unstructured SOP text into a structured SKILL.md file — a markdown document with YAML frontmatter and ### Step N: headings. The --tools flag passes tool schemas so the converter generates steps referencing exact tool names and parameter names. The --output-fields flag declares expected outputs so the final step emits structured XML tags. No hand-editing.complete_step with a summary before advancing. Circuit breakers catch infinite loops.output_fields are declared, the system prompt instructs the LLM to emit <field_name>value</field_name> XML tags. The output extractor parses these deterministically — no fragile regex on free-form text.Structured execution reduces the cognitive load on the LLM at each decision point. Instead of reasoning about a 30-step procedure in a single prompt, the model handles one step at a time with clear instructions, available tools, and context from prior results.
This is why Gemini 2.5 Flash beats Claude 4 Sonnet on Dangerous Goods: the procedure is formulaic — weighted scoring with imputation rules — and Proceda's step decomposition makes each step tractable. The model doesn't need to “be smart”; it needs to follow instructions and call the right tool with the right parameters.
For a deeper technical analysis of the seven design decisions that enable this — state machine execution, control tools, context trimming, approval gates, structured extraction, guard rails, and event-driven observability — see Anatomy of a SOTA Agentic SOP-Execution Engine.
Steps marked [APPROVAL REQUIRED] or [PRE-APPROVAL REQUIRED] halt execution until a human approves, rejects, or skips. Decisions are logged with timestamps for audit trails.
Every runtime transition emits a structured RunEvent — 20+ event types covering step lifecycle, tool calls, LLM usage, approvals, and errors. Persisted as append-only JSONL logs.
Complete execution state — current step, conversation history, approval records — can be paused and resumed across hours, days, or weeks. Essential for SOPs spanning multiple review cycles.
Tools connect via the Model Context Protocol (open standard), supporting stdio and HTTP transports. No vendor lock-in. Access control via denylists and allowlists.
SOP-Bench demonstrates that faithfully executing complex procedures is a hard problem for AI agents. Proceda's structured approach — automated SOP conversion, step-by-step execution with tool integration, and human oversight — achieves state-of-the-art results on 4 domains outright and 8 of 10 on SOP-consistent tasks, often with significantly cheaper models than the published baselines.
The benchmark also has room to grow. We've filed 6 issues identifying broken tools and labeling inconsistencies across 7 domains, and hope these contributions help strengthen SOP-Bench as a standard for evaluating procedural AI.
Proceda converts Standard Operating Procedures into step-by-step AI agents with human oversight, tool integration, and full audit trails.