Specialized Harness Engineering

The Need For Specialized Harnesses

Harness engineering is the design of the runtime around the model: what it sees, what it can do, when it can act, how outputs are interpreted, how errors are handled, and how humans supervise the work.

Specialized harness engineering narrows that idea to a task family. It encodes the stable structure of the work into the runtime, then reserves the model for the fuzzy parts where language and judgment actually matter.

There is a spectrum of automation. On one end are completely deterministic workflows: ordinary code, business rules, workflow engines, and classic Robotic Process Automation. These work well when every branch is known in advance, the inputs are structured, and the right action can be specified precisely.

On the other end are fully dynamic, general-purpose harnesses: Claude Code, OpenAI Codex, and similar agents that can take almost any task, gather context, use tools, and try to solve it with a powerful frontier model. These are valuable when the work is exploratory, ambiguous, or genuinely open-ended.

Left Edge

Deterministic Workflows

Code, rules engines, workflow software, and RPA. Reliable and cheap when the process can be fully specified.

Broad Middle

Specialized Harnesses

A known business process with local reasoning, natural language understanding, tool calls, and controlled branching.

Right Edge

General-Purpose Harnesses

Open-ended agents with broad context, broad tools, and strong models. Powerful, but expensive and less constrained.

In practice, most business use cases live in the broad middle. There is usually a high-level process to follow, a rough flowchart, a set of allowed actions, and a desired final state. But inside that process, the system still needs judgment: reading messy natural language, matching records, deciding which tool to call, handling exceptions, asking for clarification, or choosing the next branch.

Specialized harnesses shine in this middle zone. They give the model enough freedom to handle ambiguity, but enough structure to make the work cheaper, more reliable, easier to audit, and easier to improve.

The economics improve because the harness narrows context, scopes tools, reduces unnecessary tokens, and often lets cheaper models complete the job. Performance improves because the runtime carries the process, state, policy, verification, and failure handling instead of asking the model to invent all of that from scratch.

This is the lesson behind Proceda's SOP execution architecture: the win does not come from asking a model to be generally smarter. It comes from reshaping the job into smaller, typed, observable steps.

General Harness

Task interpretation is open.
Plans are improvised.
Tool choice is broad.
Context is accumulated.
Stopping conditions are loose.
Outputs are often prose-first.

Specialized Harness

The task is typed.
State is explicit.
Tools are scoped.
Context is routed.
Transitions are controlled.
Outputs are structured.

What The Harness Controls

A specialized harness is not just a prompt with tools attached. It is the execution environment that turns a language model into a reliable worker for one narrow task family.

State

Current step, completed work, extracted fields, approvals, errors, retries, and final outputs.

Context

The exact instructions, facts, tool results, memories, examples, and policies relevant now.

Tools

The scoped action space, typed inputs, normalized outputs, permissions, and side effects.

Policy

Approval gates, access boundaries, spend limits, audit rules, and write constraints.

Verification

Schemas, deterministic checks, tests, citations, reconciliation, and critic passes.

The human surfaces for authoring, review, approval, correction, explanation, and audit.

The Principles

The principles below are a practical checklist for designing a harness around a narrow but valuable category of work.

Separate structure from judgment

Move stable parts of the job into the runtime: steps, rules, schemas, required fields, tool dependencies, approvals, and output formats. Use the model for interpretation, ambiguity, mapping, summarization, and exceptions.

Reduce cognitive load on the model

Each model call should have a local objective, relevant facts, and the smallest useful tool surface. The harness should make the next act obvious.

Make state the source of truth

The transcript is not the system of record. Keep typed task state: lifecycle, fields, unresolved questions, evidence, approvals, validation status, and final artifacts.

Use synthetic tools for control

The harness can create tools that are not underlying APIs: complete, clarify, escalate, approve, retry, or report_blocked. These tools make control flow explicit.

First-class output contracts

Prefer structured outputs for both intermediate and final steps: JSON, XML tags, typed events, patches, records, forms, or document templates. Free-form prose should not be the control surface.

Enforce policy and verification in the harness

Approval thresholds, permissions, business rules, schema checks, tests, citation requirements, and source-data reconciliation should live outside the model wherever possible.

Make failure a first-class state

Blocked, invalid, needs clarification, failed validation, needs approval, partial, escalated, canceled, and retrying are product states. The harness should degrade into known states, not open-ended loops.

Context management should be done in a domain-specific way

The task should decide what gets preserved, compacted, retrieved, or biased toward recency. This is not generic message trimming.

Instrument everything

Trace prompts, tools, calls, arguments, responses, state changes, validations, costs, failures, and final outputs. Failed traces are the raw material for improving the harness.

Use the cheapest sufficient model

Once the task is decomposed, route local operations to smaller models where possible. A well-designed harness can improve correctness while lowering model cost.

Design the human role deliberately

Humans may author procedures, approve sensitive actions, resolve exceptions, teach the system, audit decisions, or receive the final result. Show task state, evidence, proposed action, confidence, and blocked reasons.

Encode a domain language

The best harnesses develop small domain languages: steps, approvals, clauses, risks, citations, tickets, patches, migrations, accounts, evidence, or disposition codes. The runtime compiles that language into model calls, tool calls, checks, and outputs.

Evaluate the task, not the model

Measure completion rate, correctness, path accuracy, extraction accuracy, false escalations, cost per completed case, latency, human touches, audit pass rate, and regression rate after changes.

Detailed Explanations

Separate structure from judgment

The first move is to decompose the work into what is structurally known and what actually requires intelligence. The known parts are steps, required fields, allowed branches, approval rules, tool dependencies, output schemas, and side-effect boundaries. Those should move into the harness.

The model should handle the irreducibly fuzzy parts: interpreting messy input, matching ambiguous records, choosing among plausible categories, summarizing evidence, or deciding whether an exception applies. The model should not be the operating system for the workflow. The harness should be.

Reduce cognitive load on the model

The crude approach is to give the model the whole task, all the context, all the tools, and all the rules. The harness-engineered approach is to give it the local objective, the relevant facts, and the narrow action surface for the current point in the workflow.

This is not only a token-saving trick. It changes the difficulty of the model's job. Smaller local calls reduce planning errors, tool-selection errors, context distraction, and unnecessary deliberation. A specialized harness behaves like a manager placing the right file, instruction, and tool in front of the worker at the right moment.

Make state the source of truth

Many agent systems are secretly chat logs with tool calls attached. That can work for general exploration, but specialized work needs explicit state: current step, completed steps, extracted fields, unresolved questions, validation status, approvals, errors, retries, final outputs, and audit records.

The conversation with the model is only one way of updating that state. A contract-review harness should track documents reviewed, clauses extracted, risks identified, source citations, reviewer comments, and final recommendations. A support harness should track customer identity, issue category, account status, attempted remedies, escalation reason, and final disposition.

Use synthetic tools for control

A harness is not limited to the external tools it connects to. It can create synthetic tools that exist only to shape the interaction between the model and the runtime. These are not domain APIs like search CRM or update ticket. They are control tools like complete_step, request_clarification, escalate_to_human, propose_retry, defer_action, report_blocked, or submit_structured_fields.

This turns prose into protocol. The runtime no longer has to guess whether the model is done, blocked, confused, asking permission, or ready to advance. The model uses a synthetic tool, the harness updates state, and the next step follows from a typed transition.

First-class output contracts

Specialized harnesses should prefer structured outputs wherever the runtime needs to act on the result. This applies to intermediate steps as much as final answers. If a step extracts fields, classifies a case, selects an account, records evidence, or proposes an action, the output should be JSON, XML tags, a typed event, a patch, a database row, or another explicit contract.

Natural language is still useful for explanation and human review. It should not be the control surface. Downstream systems need determinism, and the harness should not rely on fragile parsing of free-form prose to know what happened.

Enforce policy and verification in the harness

The model can interpret policy, but it should not be the only thing enforcing policy. Approval thresholds, access control, spend limits, production writes, PII handling, rate limits, retry rules, and audit requirements should be represented in the harness.

The same is true for verification. Use schema validation, deterministic business rules, tests, API precondition checks, citation coverage, duplicate detection, source-data reconciliation, or independent critic passes. Prompting says "be careful." Harness engineering says "here is the checker that catches mistakes."

Make failure a first-class state

A production harness has to represent more than the happy path. Missing information, ambiguous evidence, unavailable tools, permissions errors, low-confidence classifications, validation failures, contradictory inputs, and nonresponsive humans are all normal workflow states.

The harness should degrade into known, named conditions rather than open-ended loops: blocked, needs clarification, failed validation, needs approval, partially complete, escalated, or retrying. Naming these conditions is what makes the system operable.

Context management should be done in a domain-specific way

Context management is often treated as generic compaction: summarize older messages, keep recent messages, and hope the important details survive. In a specialized harness, context management should be guided by the task itself.

Some domains should preserve specific fields, citations, approvals, or tool outputs no matter how old they are. Others may bias toward recency because the latest state supersedes earlier evidence. Some information should be retrieved from source systems when needed; some should live in structured state rather than conversational memory. A large context window is not a substitute for domain-specific relevance.

Instrument everything

Traces should be a product primitive, not an afterthought. A run should show what step was active, what prompt was sent, what tools were exposed, what tool was called, what arguments were used, what output came back, what state changed, what validation passed, what validation failed, what the model cost, where a human intervened, and what final artifact was produced.

These traces are how the harness improves over time. Failed traces show exactly where the runtime, prompt, state model, tool surface, or verifier needs to change. That learning loop can even be automated: Karpathy's autoresearch is a useful reference point for the pattern of letting agents run experiments, evaluate outcomes, keep improvements, discard failures, and accumulate progress from the trace.

Use the cheapest sufficient model

The economic promise of a specialized harness is that structure can substitute for raw model capability. Once the task is decomposed, many local operations can be handled by smaller, cheaper, faster models because the harness has already reduced the degrees of freedom.

That does not have to sacrifice task performance. Proceda's SOP-Bench results show the stronger claim: with the right specialized harness, cheaper models can match or beat stronger frontier-model baselines on bounded procedural work. The harness can improve correctness while lowering cost.

This implies model routing. Extraction from a known template may use a cheap model. Clear classification may use a cheap model. Ambiguous exception handling may use a stronger model. Verification may be deterministic code. The goal is not to worship small models; it is to spend frontier-model intelligence only where it actually creates leverage.

Design the human role deliberately

Humans are not merely emergency fallbacks. They can be authors, approvers, reviewers, exception resolvers, teachers, auditors, or customers. A good harness knows which role is needed at which point in the workflow.

The interface should expose task state rather than raw transcript: current step, evidence, proposed action, confidence, blocked reason, validation result, and approval options. This is how agentic work becomes legible and trustworthy.

Encode a domain language

The best specialized harnesses develop a small language for the domain. In an SOP harness, that language might include steps, approval markers, tool names, output fields, completion signals, and blocked states. In a contract harness, it might include clauses, obligations, counterparties, risks, citations, and reviewer dispositions.

This is where harness engineering starts to look like compiler design. The input may be natural language, markdown, examples, templates, or business rules. The harness compiles that into model calls, tool calls, checks, state transitions, and artifacts.

Evaluate the task, not the model

Generic model benchmarks do not tell you whether your invoice harness, support harness, diligence harness, migration harness, or SOP harness works. You need task-level measurements: completion rate, correctness, tool-call accuracy, path correctness, extraction accuracy, false escalation rate, cost per completed case, latency, human touches, audit-pass rate, and regression rate.

The unit of evaluation is the whole task family. The question is not whether the model sounded plausible. The question is whether the harness completed the work correctly, cheaply, safely, and with an acceptable amount of human intervention.

Architecture Template

For a new specialized task, the harness usually falls into these layers. The model is important, but it is not the whole product.

Human UX

Authoring, review, approval, correction, explanation, audit, and customer-facing surfaces.

Observability & Evals

Traces, metrics, replay, failure clustering, model comparisons, regression suites, and audit trails.

Policy & Verification

Validators, permissions, approval gates, circuit breakers, tests, schema checks, and business rules.

Model Interaction

Prompts, retrieval, model routing, structured outputs, local instructions, and control tools.

Controller

State machine, DAG, workflow engine, step executor, planner, or queue processor.

Domain State

Typed objects representing cases, claims, invoices, SOP runs, documents, tickets, repos, approvals, and outputs.

Tools & Integrations

APIs, databases, browsers, SaaS systems, files, sandboxes, search, email, calendars, CRMs, and internal services.

Design Process

Start from concrete work, not architecture diagrams. The structure of the harness should be discovered from real examples.

Collect real cases. Capture inputs, actions, tools, decisions, outputs, mistakes, and human checks.
Find the stable skeleton. Identify what always happens, what usually happens, branches, final states, deterministic rules, and judgment-heavy moments.
Define canonical state. Create typed lifecycle states and fields. Keep the transcript secondary.
Locate the model slots. Decide where language understanding, fuzzy matching, exception handling, or generation is actually needed.
Constrain the action space. Set tool availability, permissions, preconditions, reversibility, idempotency, and approval requirements.
Specify output contracts. Define required fields, citations, explanations, patches, records, or form submissions.
Enumerate failure states. Design blocked, invalid, low-confidence, missing-info, tool-failed, and validation-failed paths deliberately.
Instrument from the beginning. Make every run replayable and every failure inspectable.
Optimize the model portfolio. Try smaller, cheaper, faster models once the runtime has reduced the task.

From Prompt To Runtime

Proceda is one example of this philosophy applied to standard operating procedures: bounded state, scoped tools, approval gates, structured extraction, and observable execution.

Read The Architecture See SOP-Bench Results Back To The Lab