Proceda Achieves SOTA on SOP-Bench*

*On 4 of 10 runnable domains by raw TSR; 8 of 10 when excluding benchmark labeling issues we identified and reported. Four additional domains were not runnable due to benchmark bugs. See Benchmark Quality for details.

Proceda, a terminal-first SDK for turning Standard Operating Procedures into runnable AI agents, achieves state-of-the-art results on SOP-Bench, an Amazon Science benchmark with 2,411 tasks across 14 business domains.

On the 10 domains with functioning tools, Proceda sets new SOTA on 4 domains by raw Task Success Rate and beats the best published baseline on 8 of 10 domains when measured on tasks where the benchmark's ground truth is consistent with its own SOP rules. Proceda also achieves 100% Execution Completion Rate across every domain — no crashes, no stalls, every task runs to completion.

Most results were achieved using Gemini 2.5 Flash and Gemini 3 Flash — lightweight, inexpensive models — beating baselines set by Claude 4 Opus, Claude 4.1 Opus, and Claude 4 Sonnet.

Terminology

AcronymFull NameDefinition
SOPStandard Operating ProcedureA documented, step-by-step business process
TSRTask Success RateFraction of tasks where all output fields match ground truth
ECRExecution Completion RateFraction of tasks that run to completion without crashing
C-TSRConditional TSRTSR measured only on tasks that completed (TSR / ECR)
SOTAState of the ArtBest published result on a benchmark
FCFunction CallingAgent architecture that uses tool-calling in a single LLM prompt
ReActReasoning + ActingAgent architecture using a thought-action-observation loop
MCPModel Context ProtocolOpen standard for connecting AI models to external tools

Why SOPs Matter

Standard Operating Procedures are how enterprises encode critical business processes — from patient intake to KYC compliance to content moderation. Unlike informational knowledge bases (where benchmarks like SkillsBench measure factual retrieval), SOPs are procedural: they define a sequence of steps, tool calls, decision gates, and branching logic that must be executed faithfully.

Automating SOPs with AI is a key enterprise use case. But “dump the SOP into a prompt and hope for the best” doesn't work — as SOP-Bench demonstrates, even frontier models fail when the procedure is complex. The patient intake domain, rated the easiest by human experts, scores 0% TSR with both baseline agent architectures when using Claude 3.5 Sonnet v2 (the model shipped with the SOP-Bench evaluation code). The paper’s Table 5 shows that scaling to Claude 4.1 Opus achieves 100% — but that requires a frontier model on the simplest domain.

SOP-Bench is a welcome benchmark in this space. It provides 14 real-world business domains with mock tools, ground truth labels, and standardized evaluation. It tests what matters: can an AI agent reliably follow a multi-step procedure, call the right tools with the right parameters, and produce the correct output?


Results

Performance Summary

DomainModelRaw TSRSOP-consistentBest BaselineDelta
Aircraft InspectionGemini 2.5 Flash100%100%99% — Claude 3.7 Sonnet+1pt SOTA
Referral Abuse v2Gemini 3.1 Pro99.0%99.0%98% — Claude 4 Opus+1pt SOTA
Patient IntakeGemini 2.5 Flash97.0%97.0%100% — Claude 4.1 Opus-3pt
Referral Abuse v1Gemini 3 Flash95.5%100%*98% — Claude 3.5 v2+2pt* SOTA
Dangerous GoodsGemini 2.5 Flash94.2%94.2%87% — Claude 4 Sonnet+7.2pt SOTA
Order FulfillmentGemini 3 Flash86.7%100%*no baseline
Video ClassificationGemini 3 Flash83.2%~100%*95% — Claude 4 Sonnet+5pt* SOTA
Customer ServiceGemini 2.5 Flash81.4%81.4%79% — Llama 3.3 70B+2.4pt SOTA
Traffic SpoofingGemini 3 Flash79.5%98.8%*86% — Claude 4.5 Sonnet+12.8pt* SOTA
Know Your BusinessGemini 3.1 Pro42.2%64.4%*58% — Claude 4.5 Opus+6.4pt* SOTA

* SOP-consistent TSR excludes tasks where the benchmark's CSV ground truth contradicts the SOP's explicit rules. Baseline TSR is measured on all tasks, so the comparison favors the baseline. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.

Not Run (4 domains — benchmark bugs)

DomainIssueFiled
Content Flaggingrandom.random() in tool implementations#2
Warehouse InspectionHardcoded po_number % 3 mock logic ignores CSV#3
Video Annotation20 of 26 tool methods are pass stubs returning None#6
Email IntentUnresolved git merge conflicts in 3 source files#7

Visualizations

Proceda SOP-Consistent TSR vs Best Published Baseline Proceda (SOP-consistent) Best Baseline (paper) 0% 25% 50% 75% 100% 100 Aircraft Insp. SOTA 99 Referral v2 SOTA 100* Referral v1 SOTA* ~100* Video Classif. SOTA* 98.8* Traffic Spoof. SOTA* 97 Patient Intake 94.2 Dangerous Gds. SOTA 81.4 Customer Svc. SOTA 64.4* KYB SOTA* 100* Order Fulfil. no baseline

* SOP-consistent TSR (excludes tasks where CSV ground truth contradicts SOP rules). Baseline measured on all tasks.

Proceda Raw TSR vs Best Published Baseline Proceda (raw TSR) Best Baseline (paper) 0% 25% 50% 75% 100% 100 Aircraft Insp. SOTA 99 Referral v2 SOTA 97 Patient Intake 95.5 Referral v1 94.2 Dangerous Gds. SOTA 86.7 Order Fulfil. no baseline 83.2 Video Classif. 81.4 Customer Svc. SOTA 79.5 Traffic Spoof. 42.2 KYB

Sorted by Proceda TSR. Baselines are best-across-all-models from SOP-Bench paper v2 Table 5.

Key Takeaways

4 domains SOTA by raw TSR — no caveats:

4 more domains SOTA on SOP-consistent tasks:

Patient Intake: 97% is arguably 100% in practice. The 2 “failures” (out of 66) are both the same edge case: when primary insurance returns “invalid,” the LLM reasons that prescription benefit validation is unnecessary and skips the tool call. This is a clinically reasonable judgment — a human would likely do the same. The benchmark scores it wrong because the tool would have returned “valid” if called. All 5 other output fields are correct in both tasks.

100% ECR across all domains. Every task completes. Failures are reasoning or extraction issues, never execution crashes.


Cheaper Models, Better Results

One of the most striking findings is the model cost story. The SOP-Bench paper's best baselines require frontier models — Claude 4.1 Opus, Claude 4.5 Opus, Claude 4 Sonnet. Proceda beats these with much lighter models:

Proceda ModelCost TierDomainsBaselines Beaten
Gemini 2.5 FlashLowestDangerous Goods, Customer Service, Patient Intake, Aircraft Insp.Claude 4 Sonnet, Llama 3.3 70B, Claude 3.7 Sonnet
Gemini 3 FlashLowReferral v1, Traffic Spoofing, Video Classif., Order Fulfil.Claude 3.5 v2, Claude 4.5 Sonnet, Claude 4 Sonnet
Gemini 3.1 ProMediumReferral v2, KYBClaude 4 Opus, Claude 4.5 Opus

The only domain that required Gemini 3.1 Pro where the SOP explicitly asks the agent to exercise subjective judgment is Know Your Business — the SOP says “use your experience” to distinguish typos from fraud, and notes that “risk scores are not reliable.” Referral Abuse v2 also used 3.1 Pro, but primarily because it was the model that achieved SOTA during our model comparison experiments.

For domains with clear, formulaic decision logic — even complex ones like Dangerous Goods (274 tasks, weighted scoring with imputation) — cheap models suffice when execution is structured.

The implication: a purpose-built SOP execution harness substitutes for raw model capability on procedural tasks. You don’t need a $75/MTok model to follow a procedure; you need a harness that decomposes the procedure into manageable steps and drives the model through them one at a time. See How Proceda Works for the architecture that makes this possible.


Benchmark Quality Contributions

During this evaluation, we identified and reported systematic issues in the SOP-Bench data. We frame these as contributions to the benchmark's quality — SOP-Bench is a valuable new benchmark, and these findings help improve it.

Domains with broken tools (4 domains, not runnable)

DomainIssueFiled
Content Flaggingrandom.random() in tool implementations#2
Warehouse InspectionHardcoded po_number % 3 mock logic ignores CSV#3
Video Annotation20 of 26 tool methods are pass stubs returning None#6
Email IntentUnresolved git merge conflicts in 3 source files#7

SOP/CSV labeling disagreements (4 domains)

These domains have functioning tools and were run successfully, but a portion of tasks have ground truth labels that contradict the SOP's explicit rules. An agent faithfully following the SOP gets penalized.

DomainDisagreementsFiledPattern
Referral Abuse v19 / 200#4CSV follows a closure-priority rule not in the SOP
Traffic Spoofing39 / 200#5Medium-risk tasks labeled “Warning Issued” when SOP says “Temporary Suspension”
Know Your Business31 / 90#8All “awaiting info” tasks have stronger escalation signals than “escalate” tasks
Video Classification9 / 196Implicit rules not derivable from SOP; stub tools suppress signals

The SOP-Bench paper lists “implicit knowledge that humans learn but rarely document” as a benchmark challenge. However, each agent evaluates tasks in isolation with no access to ground truth labels — there is no mechanism to learn unstated rules from data patterns.


How Proceda Works

The Problem with Existing Approaches

The SOP-Bench paper evaluates two agent architectures: Function Calling (tool use in a single prompt) and ReAct (thought-action-observation loop). Both dump the entire SOP as raw text into a single prompt and say “follow this.”

This works for short procedures. It fails for complex ones. Patient intake — rated the easiest domain by human experts — requires a 6-tool dependency chain where the final tool needs outputs from all 5 previous tools as input parameters. With Claude 3.5 Sonnet v2, both baseline architectures score 0% TSR. Scaling to Claude 4.1 Opus achieves 100% — but requiring a frontier model for the simplest procedure highlights the brittleness of the unstructured approach.

Convert, Structure, Execute

Proceda treats SOPs as first-class executable artifacts, not prompt context.

  1. Automated Conversion. proceda convert uses an LLM to transform unstructured SOP text into a structured SKILL.md file — a markdown document with YAML frontmatter and ### Step N: headings. The --tools flag passes tool schemas so the converter generates steps referencing exact tool names and parameter names. The --output-fields flag declares expected outputs so the final step emits structured XML tags. No hand-editing.
  2. Structured Execution. The runtime executes each step sequentially. The LLM sees only the current step's instructions plus context from previous steps. Tool schemas are injected with exact parameter names. The LLM must call complete_step with a summary before advancing. Circuit breakers catch infinite loops.
  3. Structured Output Extraction. When output_fields are declared, the system prompt instructs the LLM to emit <field_name>value</field_name> XML tags. The output extractor parses these deterministically — no fragile regex on free-form text.

Why This Enables SOTA with Cheaper Models

Structured execution reduces the cognitive load on the LLM at each decision point. Instead of reasoning about a 30-step procedure in a single prompt, the model handles one step at a time with clear instructions, available tools, and context from prior results.

This is why Gemini 2.5 Flash beats Claude 4 Sonnet on Dangerous Goods: the procedure is formulaic — weighted scoring with imputation rules — and Proceda's step decomposition makes each step tractable. The model doesn't need to “be smart”; it needs to follow instructions and call the right tool with the right parameters.

For a deeper technical analysis of the seven design decisions that enable this — state machine execution, control tools, context trimming, approval gates, structured extraction, guard rails, and event-driven observability — see Anatomy of a SOTA Agentic SOP-Execution Engine.

Operational Features for Production SOPs

Approval Gates

Steps marked [APPROVAL REQUIRED] or [PRE-APPROVAL REQUIRED] halt execution until a human approves, rejects, or skips. Decisions are logged with timestamps for audit trails.

Execution Observability

Every runtime transition emits a structured RunEvent — 20+ event types covering step lifecycle, tool calls, LLM usage, approvals, and errors. Persisted as append-only JSONL logs.

Session State Capture

Complete execution state — current step, conversation history, approval records — can be paused and resumed across hours, days, or weeks. Essential for SOPs spanning multiple review cycles.

MCP-Native Tools

Tools connect via the Model Context Protocol (open standard), supporting stdio and HTTP transports. No vendor lock-in. Access control via denylists and allowlists.


Conclusion

SOP-Bench demonstrates that faithfully executing complex procedures is a hard problem for AI agents. Proceda's structured approach — automated SOP conversion, step-by-step execution with tool integration, and human oversight — achieves state-of-the-art results on 4 domains outright and 8 of 10 on SOP-consistent tasks, often with significantly cheaper models than the published baselines.

The benchmark also has room to grow. We've filed 6 issues identifying broken tools and labeling inconsistencies across 7 domains, and hope these contributions help strengthen SOP-Bench as a standard for evaluating procedural AI.

Turn your SOPs into reliable AI workflows

Proceda converts Standard Operating Procedures into step-by-step AI agents with human oversight, tool integration, and full audit trails.