Proceda Achieves SOTA on SOP-Bench*

*On 4 of 10 runnable domains by raw TSR; 8 of 10 when excluding benchmark labeling issues we identified and reported. Four additional domains were not runnable due to benchmark bugs. See Benchmark Quality for details.

By Vivek Haldar / Enchiridion Labs · March 2026

Proceda, a terminal-first SDK for turning Standard Operating Procedures into runnable AI agents, achieves state-of-the-art results on SOP-Bench, an Amazon Science benchmark with 2,411 tasks across 14 business domains.

On the 10 domains with functioning tools, Proceda sets new SOTA on 4 domains by raw Task Success Rate and beats the best published baseline on 8 of 10 domains when measured on tasks where the benchmark's ground truth is consistent with its own SOP rules. Proceda also achieves 100% Execution Completion Rate across every domain — no crashes, no stalls, every task runs to completion.

Most results were achieved using Gemini 2.5 Flash and Gemini 3 Flash — lightweight, inexpensive models — beating baselines set by Claude 4 Opus, Claude 4.1 Opus, and Claude 4 Sonnet.

Terminology

Acronym	Full Name	Definition
SOP	Standard Operating Procedure	A documented, step-by-step business process
TSR	Task Success Rate	Fraction of tasks where all output fields match ground truth
ECR	Execution Completion Rate	Fraction of tasks that run to completion without crashing
C-TSR	Conditional TSR	TSR measured only on tasks that completed (TSR / ECR)
SOTA	State of the Art	Best published result on a benchmark
FC	Function Calling	Agent architecture that uses tool-calling in a single LLM prompt
ReAct	Reasoning + Acting	Agent architecture using a thought-action-observation loop
MCP	Model Context Protocol	Open standard for connecting AI models to external tools

Why SOPs Matter

Standard Operating Procedures are how enterprises encode critical business processes — from patient intake to KYC compliance to content moderation. Unlike informational knowledge bases (where benchmarks like SkillsBench measure factual retrieval), SOPs are procedural: they define a sequence of steps, tool calls, decision gates, and branching logic that must be executed faithfully.

Automating SOPs with AI is a key enterprise use case. But “dump the SOP into a prompt and hope for the best” doesn't work — as SOP-Bench demonstrates, even frontier models fail when the procedure is complex. The patient intake domain, rated the easiest by human experts, scores 0% TSR with both baseline agent architectures when using Claude 3.5 Sonnet v2 (the model shipped with the SOP-Bench evaluation code). The paper’s Table 5 shows that scaling to Claude 4.1 Opus achieves 100% — but that requires a frontier model on the simplest domain.

SOP-Bench is a welcome benchmark in this space. It provides 14 real-world business domains with mock tools, ground truth labels, and standardized evaluation. It tests what matters: can an AI agent reliably follow a multi-step procedure, call the right tools with the right parameters, and produce the correct output?

Results

Performance Summary

Domain	Model	Raw TSR	SOP-consistent	Best Baseline	Delta
Aircraft Inspection	Gemini 2.5 Flash	100%	100%	99% — Claude 3.7 Sonnet	+1pt SOTA
Referral Abuse v2	Gemini 3.1 Pro	99.0%	99.0%	98% — Claude 4 Opus	+1pt SOTA
Patient Intake	Gemini 2.5 Flash	97.0%	97.0%	100% — Claude 4.1 Opus	-3pt
Referral Abuse v1	Gemini 3 Flash	95.5%	100%*	98% — Claude 3.5 v2	+2pt* SOTA
Dangerous Goods	Gemini 2.5 Flash	94.2%	94.2%	87% — Claude 4 Sonnet	+7.2pt SOTA
Order Fulfillment	Gemini 3 Flash	86.7%	100%*	—	no baseline
Video Classification	Gemini 3 Flash	83.2%	~100%*	95% — Claude 4 Sonnet	+5pt* SOTA
Customer Service	Gemini 2.5 Flash	81.4%	81.4%	79% — Llama 3.3 70B	+2.4pt SOTA
Traffic Spoofing	Gemini 3 Flash	79.5%	98.8%*	86% — Claude 4.5 Sonnet	+12.8pt* SOTA
Know Your Business	Gemini 3.1 Pro	42.2%	64.4%*	58% — Claude 4.5 Opus	+6.4pt* SOTA

* SOP-consistent TSR excludes tasks where the benchmark's CSV ground truth contradicts the SOP's explicit rules. Baseline TSR is measured on all tasks, so the comparison favors the baseline. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.

Not Run (4 domains — benchmark bugs)

Domain	Issue	Filed
Content Flagging	`random.random()` in tool implementations	#2
Warehouse Inspection	Hardcoded `po_number % 3` mock logic ignores CSV	#3
Video Annotation	20 of 26 tool methods are `pass` stubs returning `None`	#6
Email Intent	Unresolved git merge conflicts in 3 source files	#7

Visualizations

* SOP-consistent TSR (excludes tasks where CSV ground truth contradicts SOP rules). Baseline measured on all tasks.

Sorted by Proceda TSR. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.

Key Takeaways

4 domains SOTA by raw TSR — no caveats:

Dangerous Goods: +7.2pt (87% → 94.2%) with Gemini 2.5 Flash
Customer Service: +2.4pt (79% → 81.4%) with Gemini 2.5 Flash
Referral Abuse v2: +1pt (98% → 99.0%) with Gemini 3.1 Pro
Aircraft Inspection: +1pt (99% → 100%) with Gemini 2.5 Flash

4 more domains SOTA on SOP-consistent tasks:

Traffic Spoofing: +12.8pt (86% → 98.8%)
Know Your Business: +6.4pt (58% → 64.4%)
Video Classification: +5pt (95% → ~100%)
Referral Abuse v1: +2pt (98% → 100%)

Patient Intake: 97% is arguably 100% in practice. The 2 “failures” (out of 66) are both the same edge case: when primary insurance returns “invalid,” the LLM reasons that prescription benefit validation is unnecessary and skips the tool call. This is a clinically reasonable judgment — a human would likely do the same. The benchmark scores it wrong because the tool would have returned “valid” if called. All 5 other output fields are correct in both tasks.

100% ECR across all domains. Every task completes. Failures are reasoning or extraction issues, never execution crashes.

Cheaper Models, Better Results

One of the most striking findings is the model cost story. The SOP-Bench paper's best baselines require frontier models — Claude 4.1 Opus, Claude 4.5 Opus, Claude 4 Sonnet. Proceda beats these with much lighter models:

Proceda Model	Cost Tier	Domains	Baselines Beaten
Gemini 2.5 Flash	Lowest	Dangerous Goods, Customer Service, Patient Intake, Aircraft Insp.	Claude 4 Sonnet, Llama 3.3 70B, Claude 3.7 Sonnet
Gemini 3 Flash	Low	Referral v1, Traffic Spoofing, Video Classif., Order Fulfil.	Claude 3.5 v2, Claude 4.5 Sonnet, Claude 4 Sonnet
Gemini 3.1 Pro	Medium	Referral v2, KYB	Claude 4 Opus, Claude 4.5 Opus

The only domain that required Gemini 3.1 Pro where the SOP explicitly asks the agent to exercise subjective judgment is Know Your Business — the SOP says “use your experience” to distinguish typos from fraud, and notes that “risk scores are not reliable.” Referral Abuse v2 also used 3.1 Pro, but primarily because it was the model that achieved SOTA during our model comparison experiments.

For domains with clear, formulaic decision logic — even complex ones like Dangerous Goods (274 tasks, weighted scoring with imputation) — cheap models suffice when execution is structured.

The implication: a purpose-built SOP execution harness substitutes for raw model capability on procedural tasks. You don’t need a $75/MTok model to follow a procedure; you need a harness that decomposes the procedure into manageable steps and drives the model through them one at a time. See How Proceda Works for the architecture that makes this possible.

Benchmark Quality Contributions

During this evaluation, we identified and reported systematic issues in the SOP-Bench data. We frame these as contributions to the benchmark's quality — SOP-Bench is a valuable new benchmark, and these findings help improve it.

Domains with broken tools (4 domains, not runnable)

Domain	Issue	Filed
Content Flagging	`random.random()` in tool implementations	#2
Warehouse Inspection	Hardcoded `po_number % 3` mock logic ignores CSV	#3
Video Annotation	20 of 26 tool methods are `pass` stubs returning `None`	#6
Email Intent	Unresolved git merge conflicts in 3 source files	#7

SOP/CSV labeling disagreements (4 domains)

These domains have functioning tools and were run successfully, but a portion of tasks have ground truth labels that contradict the SOP's explicit rules. An agent faithfully following the SOP gets penalized.

Domain	Disagreements	Filed	Pattern
Referral Abuse v1	9 / 200	#4	CSV follows a closure-priority rule not in the SOP
Traffic Spoofing	39 / 200	#5	Medium-risk tasks labeled “Warning Issued” when SOP says “Temporary Suspension”
Know Your Business	31 / 90	#8	All “awaiting info” tasks have stronger escalation signals than “escalate” tasks
Video Classification	9 / 196	—	Implicit rules not derivable from SOP; stub tools suppress signals

The SOP-Bench paper lists “implicit knowledge that humans learn but rarely document” as a benchmark challenge. However, each agent evaluates tasks in isolation with no access to ground truth labels — there is no mechanism to learn unstated rules from data patterns.

How Proceda Works

The Problem with Existing Approaches

The SOP-Bench paper evaluates two agent architectures: Function Calling (tool use in a single prompt) and ReAct (thought-action-observation loop). Both dump the entire SOP as raw text into a single prompt and say “follow this.”

This works for short procedures. It fails for complex ones. Patient intake — rated the easiest domain by human experts — requires a 6-tool dependency chain where the final tool needs outputs from all 5 previous tools as input parameters. With Claude 3.5 Sonnet v2, both baseline architectures score 0% TSR. Scaling to Claude 4.1 Opus achieves 100% — but requiring a frontier model for the simplest procedure highlights the brittleness of the unstructured approach.

Convert, Structure, Execute

Proceda treats SOPs as first-class executable artifacts, not prompt context.

Automated Conversion. proceda convert uses an LLM to transform unstructured SOP text into a structured SKILL.md file — a markdown document with YAML frontmatter and ### Step N: headings. The --tools flag passes tool schemas so the converter generates steps referencing exact tool names and parameter names. The --output-fields flag declares expected outputs so the final step emits structured XML tags. No hand-editing.
Structured Execution. The runtime executes each step sequentially. The LLM sees only the current step's instructions plus context from previous steps. Tool schemas are injected with exact parameter names. The LLM must call complete_step with a summary before advancing. Circuit breakers catch infinite loops.
Structured Output Extraction. When output_fields are declared, the system prompt instructs the LLM to emit <field_name>value</field_name> XML tags. The output extractor parses these deterministically — no fragile regex on free-form text.

Why This Enables SOTA with Cheaper Models

Structured execution reduces the cognitive load on the LLM at each decision point. Instead of reasoning about a 30-step procedure in a single prompt, the model handles one step at a time with clear instructions, available tools, and context from prior results.

This is why Gemini 2.5 Flash beats Claude 4 Sonnet on Dangerous Goods: the procedure is formulaic — weighted scoring with imputation rules — and Proceda's step decomposition makes each step tractable. The model doesn't need to “be smart”; it needs to follow instructions and call the right tool with the right parameters.

For a deeper technical analysis of the seven design decisions that enable this — state machine execution, control tools, context trimming, approval gates, structured extraction, guard rails, and event-driven observability — see Anatomy of a SOTA Agentic SOP-Execution Engine.

Operational Features for Production SOPs

Approval Gates

Steps marked [APPROVAL REQUIRED] or [PRE-APPROVAL REQUIRED] halt execution until a human approves, rejects, or skips. Decisions are logged with timestamps for audit trails.

Execution Observability

Every runtime transition emits a structured RunEvent — 20+ event types covering step lifecycle, tool calls, LLM usage, approvals, and errors. Persisted as append-only JSONL logs.

Session State Capture

Complete execution state — current step, conversation history, approval records — can be paused and resumed across hours, days, or weeks. Essential for SOPs spanning multiple review cycles.

MCP-Native Tools

Tools connect via the Model Context Protocol (open standard), supporting stdio and HTTP transports. No vendor lock-in. Access control via denylists and allowlists.

Conclusion

SOP-Bench demonstrates that faithfully executing complex procedures is a hard problem for AI agents. Proceda's structured approach — automated SOP conversion, step-by-step execution with tool integration, and human oversight — achieves state-of-the-art results on 4 domains outright and 8 of 10 on SOP-consistent tasks, often with significantly cheaper models than the published baselines.

The benchmark also has room to grow. We've filed 6 issues identifying broken tools and labeling inconsistencies across 7 domains, and hope these contributions help strengthen SOP-Bench as a standard for evaluating procedural AI.

Turn your SOPs into reliable AI workflows

Proceda converts Standard Operating Procedures into step-by-step AI agents with human oversight, tool integration, and full audit trails.

Try sop.run Enchiridion Labs