Technical Article

Proceda Voice:SOP-DrivenVoice Agents

Voice agents are not just speech wrapped around a chatbot. They are low-latency procedural systems that have to keep state, enforce policy, and survive messy human conversation.

Source video: Proceda Voice: SOP-Driven Voice Agents, 17:49, published May 25, 2026.

The Voice-Agent Constraint

Voice agents are hard because the system has to optimize for two things that pull against each other: low-latency conversation and high-quality procedural reasoning.

A caller is on the line. The agent cannot spend seconds thinking. But it also cannot drift, skip gates, hallucinate policy, or forget why the call exists.

Text agents can often hide behind latency and backtracking. If a tool call takes a few seconds, the user can wait. If the model needs to inspect a long document, it can do that before answering. Voice is less forgiving. Dead air feels broken, interruptions are normal, and the human on the other end is often distracted, emotional, adversarial, or simply trying to get through a customer-service task quickly.

That makes the core challenge architectural. The system needs strong reasoning and tool use, but it needs them inside a loop fast enough for speech. It has to preserve intent across a complex call without handing the entire call transcript, every policy, every tool, and every branch to the model on every turn.

Latency

Keep the call moving

The agent needs a quick next utterance, even while state, tools, and validation are evolving underneath.

Procedure

Hold the intent

The caller may jump ahead, correct themselves, or ask side questions. The workflow still has to converge.

Policy

Refuse unsafe shortcuts

Authentication, confirmations, compliance language, and required side effects cannot be optional.

From SOPs To Calls

Proceda started as a text-in, text-out system for turning natural-language SOPs into executable agents. The author writes a procedure in a `SKILL.md`-style document: steps, fields, tools, gates, confirmations, and policy. The runtime handles the execution mechanics around that procedure.

The central insight is that the model should not be asked to understand the entire workflow at once. If the runtime narrows each model call to the current local job, the model has less to remember, fewer tools to choose among, and fewer ways to drift. That narrowing is part of why Proceda can make smaller models perform well on bounded procedural work.

Proceda Voice asks whether the same thesis can survive a phone call. The answer is yes, but with a change in the unit of narrowing. Text SOP execution can often advance step by step. Voice cannot assume that clean sequence, because callers do not speak in step order.

Positioning

Proceda Voice is SOP-native, not canvas-native. The human-authored artifact is still the procedure document, not a hand-drawn graph of every possible conversational branch.

The runtime can derive state, eligible facts, gates, prompts, tool eligibility, and audit evidence from that document.

Existing Solutions

The broader voice-agent market appears to be converging on the same lesson: large monolithic prompts are too brittle for complex, multi-turn calls. The common response is to add structure around the conversation so the model has a smaller local job.

Decagon

AOPs

Agent Operating Procedures combine natural-language workflow instructions with code, tools, metadata, and nodes that keep complex CX conversations on track.

Vapi

Squads

Squads split a complex voice workflow across specialized assistants, each with its own role, tool surface, and handoff conditions.

Pipecat

Flows

Pipecat Flows model structured conversations as a graph of nodes, focusing the model on one local task and the tools needed at that point.

These systems are similar to Proceda Voice in the most important way: they all reject the idea that one giant prompt should carry an entire production call. They add decomposition, state, tool scoping, and local instructions so the end-to-end conversation stays on rails.

The difference is the primary control surface. Flow builders ask the author to model a graph. Squad-style systems ask the author to split the call across assistants and handoffs. Proceda Voice asks the author to bring a natural-language SOP. The runtime then derives the active state, hard gates, eligible facts, tool permissions, and audit trace from that SOP.

Proceda Voice Difference

The new piece is not simply "natural language." The new piece is a structured natural-language procedure executed by a deterministic conversation engine.

Instead of asking the model to pick the next node or carry the whole procedure, Proceda Voice exposes a bounded slot frontier: the facts the caller is allowed to establish now, plus a small lookahead window for nearby future facts.

The Windowed Slot Frontier

The main design move is to stop narrowing to a single current step and instead narrow to a bounded set of candidate slots. A slot is a concrete fact the procedure needs: a policy number, a date, a location, an employee ID, a confirmation code, a maximum price, a preferred return-to-work time.

On each caller turn, Proceda Voice computes which slots are eligible now and which nearby future slots are safe to listen for. The model is allowed to extract facts only from that set. It can hear ahead, but only within the part of the SOP the runtime has made available.

Allowed
policy_number
incident_date
incident_location
injuries_reported
vehicle_damage
Closed
claim_payout
rental_reimbursement
fraud_escalation
settlement_offer
payment_method

This is bounded procedural attention. The model receives enough freedom to handle natural caller behavior, but not enough freedom to fill arbitrary future facts or decide that policy no longer applies.

The window solves the false choice between two weak designs. A rigid form asks for exactly one field at a time and misses useful out-of-order answers. A giant prompt asks the model to track every step, tool, policy, and branch while also holding a live conversation. Proceda Voice sits between them: a small allowed frontier, recomputed deterministically after each turn.

If the caller volunteers a fact outside the current window, the system can safely ignore it and ask later. Missing a far-future fact is usually better than accepting a fact in the wrong procedural context.

Architecture

Proceda Voice is deliberately not a full audio stack. The speech plane can be supplied by systems such as Vapi, Pipecat, or another realtime voice substrate. Proceda owns the procedural brain: state, extraction, gates, tools, response planning, and audit.

That separation matters because it keeps the voice infrastructure swappable. The runtime can improve STT, TTS, endpointing, and interruption handling without moving policy logic into a vendor-specific flow graph.

Voice Plane

Turn detection, STT, TTS, telephony, WebRTC, interruptions, recordings, and realtime transport.

Proceda Turn Engine

Computes the slot window, extracts structured facts, advances state, fires eligible tools, plans the response, and writes the audit trail.

SOP Artifact

The natural-language procedure: slots, steps, hard dependencies, prompts, confirmations, tools, policies, and final criteria.

Business Systems

CRMs, claims systems, HR systems, airline reservations, calendars, facilities tools, and other side-effecting APIs.

The Per-Turn Loop

The important property is that the model does not own the call state. The transcript is evidence. The canonical state is a typed object maintained by the runtime.

  1. Receive a finalized caller utterance. The voice plane provides text plus timing and call metadata.
  2. Compute the candidate window. Deterministic code derives the currently eligible slots and the bounded future frontier from SOP state.
  3. Extract only allowed facts. A model maps the utterance into structured slot updates, corrections, or no-op results.
  4. Advance state in code. The runtime validates gates, records provenance, fires eligible tools, and decides what remains unresolved.
  5. Plan the next utterance. The response can be verbatim for compliance, templated for confirmations, or generated for flexible conversational glue.
  6. Emit auditable output. The system logs the state delta, source turn, tool calls, emitted text, and latency for replay and conformance review.

This is a specialized harness for voice. The model still does language work, but the harness supplies the shape of the work: allowed facts, state transitions, policy gates, and stopping conditions.

What The Demo Traces Show

The video ends with three animated call traces. They are useful because they stress the runtime in the ways ordinary demo scripts often avoid: side questions, corrections, missing fields, caller frustration, and attempts to bypass mandatory steps.

IT Service Desk

Conference room booking

The caller tries to skip identity verification and later asks to skip a required calendar invite. The agent keeps the workflow moving while enforcing both gates.

Medical HR

FMLA leave case

The caller is scattered, asks side questions, and supplies partial identity data. The runtime captures leave category, dates, confirmation, and return-to-work scheduling.

Airline Rebooking

Flight change

The caller is in a noisy airport and keeps checking constraints. The agent preserves route, date, arrival deadline, cost cap, seat preference, and final authorization.

In each trace, the interesting behavior is not that the agent sounds polite. It is that the runtime knows which facts have been established, which facts remain open, which side effects are allowed, and which policy constraints cannot be waived by caller pressure.