The Infrastructure Layer Beneath Every AI Decision: Why Delegation Is the Hard Problem

Technology

Mar 11, 2026

There's a captivating simplicity to the way most people talk about deploying AI in regulated industries. You find the right model, you fine-tune it on the right data, you wrap it in a compliance layer, and you ship. The model is the product. The model is the moat.

We've spent a long time building in regulated environments, and we believe that framing is wrong, or at least incomplete in a way that becomes catastrophically apparent the moment you move from prototype to production. The model is table stakes. The hard problem is what sits beneath it: the infrastructure that governs how AI systems decompose complex goals, assign work across agents and humans, maintain accountability through every step, and remain safe and auditable at scale.

The hard problem is delegation.

This post is our attempt to articulate what we believe that means, why most current approaches get it wrong, and what a serious architecture really looks like.

What Delegation Actually Is

When engineers talk about multi-agent systems, delegation usually means task routing: one agent passes work to another. That's necessary, but its not sufficient. Real delegation–the kind that holds up in production, under regulatory scrutiny, with real consequences for failure–involves something considerably more demanding:

Transfer of authority. What is this agent permitted to decide on its own? Under what circumstances must it escalate?

Transfer of responsibility. If this agent fails, who is accountable? Where does liability sit?

Trust calibration. How much confidence should the system place in this agent's judgment, and based on what evidence? How does that confidence update over time?

Monitoring. What is actually happening inside this agent's execution, and how does the delegating system know in real time?

Verifiability. When the agent reports completion, how does the system confirm that what was claimed actually occurred?

In human organizations, these questions are answered through institutional structure: approval hierarchies, audit requirements, professional licensing, job titles, all combined with trust built through repeated interactions. None of that scaffolding exists by default in an AI system. You have to engineer it explicitly. And in regulated industries, where the consequences of failure are not just operational but legal, reputational, and human, engineering it correctly is the entire game.

Why Heuristics Fail at Scale

The current generation of agentic AI systems handles delegation through what can charitably be called educated guesses. An orchestrator agent breaks a complex goal into sub-tasks. It routes those sub-tasks based on static capability descriptions. If something goes wrong, it retries or escalates. The whole thing is held together with prompt engineering and good intentions.

This works in demos. It works for low-stakes workflows with high error tolerance. It does not work when the tasks in question are consequential, irreversible, or subject to external audit.

The failure modes are specific:

Static capability models. An agent that performed well last week may be overloaded, misconfigured, or operating against changed data today. A delegation system that doesn't continuously assess the real-time state of its delegates is operating on stale information. In a regulated context, a decision made by a degraded agent under stale capability assumptions is still a decision that the organization owns.

Opacity in execution chains. When agent A delegates to agent B, which sub-delegates to agent C, the original delegator loses visibility into what's actually happening. The paper trail — to the extent one exists — is often too coarse to support post-hoc audit. This isn't just a governance problem; it's a diagnosis problem. When something goes wrong in an opaque chain, you can't tell whether you're dealing with a capability gap, a misspecified task, a corrupted intermediate result, or something adversarial. The inability to distinguish incompetence from malice is a serious systemic risk.

Undifferentiated failure handling. Not all failures are equivalent. A reversible failure — an agent produced a draft that failed quality review — is a retry event. An irreversible failure — a decision has been communicated to an external party, an action has been taken in the real world — is an escalation event that may have no recovery path. Systems that treat these the same way are not ready for production in regulated environments.

Accountability diffusion. As delegation chains lengthen, the distance between original intent and ultimate execution grows. The human who authorized the initial task may be five or six hops removed from the agent that caused the failure. Without explicit liability architecture — defined contractual stop-gaps at each delegation boundary—accountability evaporates into the gaps between components.

What Serious Architecture Looks Like

We think about agentic infrastructure across five dimensions. These aren't the only dimensions that matter, but they're the ones that separate systems that are ready for high-stakes regulated environments from systems that aren't.

1. Dynamic Assessment

Delegation decisions must be based on the current state of your agents, but can be fine-tuned thanks to historical interactions. This requires continuous telemetry: real-time data on computational throughput, resource consumption, current load, and the sub-delegation chains already in progress. An orchestration layer that can't answer "is this agent in a position to reliably execute this task right now?" is operating blind.

Assessment also needs to be task-specific. The same agent may be highly capable for one class of task and unreliable for another. A trust model that doesn't distinguish capability by task type will systematically misallocate work.

Delegation decisions should be based on the current state of your agents, informed by telemetry such as system load, resource usage, active delegations, and past performance. An orchestration layer that cannot answer “can this agent reliably execute this task right now?” is operating without visibility.

Assessment also needs to be task-specific. An agent may perform well on one class of task but poorly on another, so delegation decisions should reflect capability by task type rather than assuming general reliability.

from dataclasses import dataclass
from enum import Enum

class TaskType(Enum):
    DATA_EXTRACTION = "data_extraction"
    COMPLIANCE_REVIEW = "compliance_review"

@dataclass
class AgentTelemetry:
    agent_id: str
    cpu: float
    active_delegations: int
    success_rate: dict[TaskType, float]  # task -> historical success %

class DynamicAssessor:
    LOAD_THRESHOLD = 0.80
    MAX_DELEGATIONS = 5
    MIN_SUCCESS = 0.90

    def assess(self, t: AgentTelemetry, task: TaskType) -> bool:
        if t.cpu > self.LOAD_THRESHOLD: return False
        if t.active_delegations >= self.MAX_DELEGATIONS: return False
        if t.success_rate.get(task, 0.5) < self.MIN_SUCCESS: return False
        return True

agent = AgentTelemetry("agent-7", cpu=0.45, active_delegations=2,
                       success_rate={TaskType.DATA_EXTRACTION: 0.90,
                                     TaskType.COMPLIANCE_REVIEW: 0.75})

assessor = DynamicAssessor()
print(assessor.assess(agent, TaskType.DATA_EXTRACTION))

from dataclasses import dataclass
from enum import Enum

class TaskType(Enum):
    DATA_EXTRACTION = "data_extraction"
    COMPLIANCE_REVIEW = "compliance_review"

@dataclass
class AgentTelemetry:
    agent_id: str
    cpu: float
    active_delegations: int
    success_rate: dict[TaskType, float]  # task -> historical success %

class DynamicAssessor:
    LOAD_THRESHOLD = 0.80
    MAX_DELEGATIONS = 5
    MIN_SUCCESS = 0.90

    def assess(self, t: AgentTelemetry, task: TaskType) -> bool:
        if t.cpu > self.LOAD_THRESHOLD: return False
        if t.active_delegations >= self.MAX_DELEGATIONS: return False
        if t.success_rate.get(task, 0.5) < self.MIN_SUCCESS: return False
        return True

agent = AgentTelemetry("agent-7", cpu=0.45, active_delegations=2,
                       success_rate={TaskType.DATA_EXTRACTION: 0.90,
                                     TaskType.COMPLIANCE_REVIEW: 0.75})

assessor = DynamicAssessor()
print(assessor.assess(agent, TaskType.DATA_EXTRACTION))

from dataclasses import dataclass
from enum import Enum

class TaskType(Enum):
    DATA_EXTRACTION = "data_extraction"
    COMPLIANCE_REVIEW = "compliance_review"

@dataclass
class AgentTelemetry:
    agent_id: str
    cpu: float
    active_delegations: int
    success_rate: dict[TaskType, float]  # task -> historical success %

class DynamicAssessor:
    LOAD_THRESHOLD = 0.80
    MAX_DELEGATIONS = 5
    MIN_SUCCESS = 0.90

    def assess(self, t: AgentTelemetry, task: TaskType) -> bool:
        if t.cpu > self.LOAD_THRESHOLD: return False
        if t.active_delegations >= self.MAX_DELEGATIONS: return False
        if t.success_rate.get(task, 0.5) < self.MIN_SUCCESS: return False
        return True

agent = AgentTelemetry("agent-7", cpu=0.45, active_delegations=2,
                       success_rate={TaskType.DATA_EXTRACTION: 0.90,
                                     TaskType.COMPLIANCE_REVIEW: 0.75})

assessor = DynamicAssessor()
print(assessor.assess(agent, TaskType.DATA_EXTRACTION))

from dataclasses import dataclass
from enum import Enum

class TaskType(Enum):
    DATA_EXTRACTION = "data_extraction"
    COMPLIANCE_REVIEW = "compliance_review"

@dataclass
class AgentTelemetry:
    agent_id: str
    cpu: float
    active_delegations: int
    success_rate: dict[TaskType, float]  # task -> historical success %

class DynamicAssessor:
    LOAD_THRESHOLD = 0.80
    MAX_DELEGATIONS = 5
    MIN_SUCCESS = 0.90

    def assess(self, t: AgentTelemetry, task: TaskType) -> bool:
        if t.cpu > self.LOAD_THRESHOLD: return False
        if t.active_delegations >= self.MAX_DELEGATIONS: return False
        if t.success_rate.get(task, 0.5) < self.MIN_SUCCESS: return False
        return True

agent = AgentTelemetry("agent-7", cpu=0.45, active_delegations=2,
                       success_rate={TaskType.DATA_EXTRACTION: 0.90,
                                     TaskType.COMPLIANCE_REVIEW: 0.75})

assessor = DynamicAssessor()
print(assessor.assess(agent, TaskType.DATA_EXTRACTION))

The assessor above is intentionally stateless. It evaluates only current telemetry and task-specific performance history, not assumptions about the agent’s general reliability. Dynamic assessment ensures that each delegation decision is grounded in real evidence. Without this discipline, orchestration layers default to static assumptions about capability and end up delegating work blindly—an unacceptable risk in regulated environments.

2. Adaptive Execution

Delegation decisions made at the start of a task should not be treated as permanent. Environments change. Resources degrade. External dependencies fail. A delegation framework that can't adapt mid-execution is brittle by design.

The key architectural requirement is a continuous feedback loop: monitoring signals feed back into delegation decisions, triggering reallocation when conditions shift below acceptable thresholds. This loop needs to be fast enough to catch failures before they propagate downstream, and intelligent enough to distinguish between transient degradation (wait and retry) and structural failure (reallocate immediately).

Critically, adaptive coordination must be coupled with stability mechanisms. Without them, a single failure can trigger a cascade of reallocations that is more disruptive than the original failure. Cooldown periods, dampening factors on reputation updates, and increasing costs on repeated reallocation all serve to absorb shock rather than amplify it.

3. Structural Transparency

Auditability is not a feature to be added later. It's an architectural constraint that shapes how every component is designed.

Process-level monitoring — tracking intermediate states, resource consumption, and the methodology used by an agent, not just its final output — is the foundation of auditability. For tasks where the how is as important as the what, outcome-only monitoring is insufficient. You need visibility into the execution path.

A minimal implementation makes the idea concrete. In practice, every delegation or tool invocation should emit a structured audit event that captures who performed the action, what was delegated, and the context under which it occurred. The example below shows a simplified agent that logs each delegation step as an immutable audit entry.

from dataclasses import dataclass
from datetime import datetime
import json, uuid

@dataclass
class AuditEntry:
session_id: str
agent_id: str
action: str
metadata: dict
timestamp: str = datetime.utcnow().isoformat()

class AuditLogger:
def log(self, entry: AuditEntry):
print(json.dumps(entry.__dict__))

class Agent:
def __init__(self, agent_id, logger):
self.agent_id, self.logger = agent_id, logger

def delegate(self, task, tool):
self.logger.log(AuditEntry(session_id=str(uuid.uuid4()),
agent_id=self.agent_id,
action="delegate",
metadata={"task": task, "tool": tool}))
return f"{tool} executed {task}"

logger = AuditLogger()
agent = Agent("claims-agent", logger)
agent.delegate("extract indemnification clause", "contract_parser")

from dataclasses import dataclass
from datetime import datetime
import json, uuid

@dataclass
class AuditEntry:
session_id: str
agent_id: str
action: str
metadata: dict
timestamp: str = datetime.utcnow().isoformat()

class AuditLogger:
def log(self, entry: AuditEntry):
print(json.dumps(entry.__dict__))

class Agent:
def __init__(self, agent_id, logger):
self.agent_id, self.logger = agent_id, logger

def delegate(self, task, tool):
self.logger.log(AuditEntry(session_id=str(uuid.uuid4()),
agent_id=self.agent_id,
action="delegate",
metadata={"task": task, "tool": tool}))
return f"{tool} executed {task}"

logger = AuditLogger()
agent = Agent("claims-agent", logger)
agent.delegate("extract indemnification clause", "contract_parser")

from dataclasses import dataclass
from datetime import datetime
import json, uuid

@dataclass
class AuditEntry:
session_id: str
agent_id: str
action: str
metadata: dict
timestamp: str = datetime.utcnow().isoformat()

class AuditLogger:
def log(self, entry: AuditEntry):
print(json.dumps(entry.__dict__))

class Agent:
def __init__(self, agent_id, logger):
self.agent_id, self.logger = agent_id, logger

def delegate(self, task, tool):
self.logger.log(AuditEntry(session_id=str(uuid.uuid4()),
agent_id=self.agent_id,
action="delegate",
metadata={"task": task, "tool": tool}))
return f"{tool} executed {task}"

logger = AuditLogger()
agent = Agent("claims-agent", logger)
agent.delegate("extract indemnification clause", "contract_parser")

from dataclasses import dataclass
from datetime import datetime
import json, uuid

@dataclass
class AuditEntry:
session_id: str
agent_id: str
action: str
metadata: dict
timestamp: str = datetime.utcnow().isoformat()

class AuditLogger:
def log(self, entry: AuditEntry):
print(json.dumps(entry.__dict__))

class Agent:
def __init__(self, agent_id, logger):
self.agent_id, self.logger = agent_id, logger

def delegate(self, task, tool):
self.logger.log(AuditEntry(session_id=str(uuid.uuid4()),
agent_id=self.agent_id,
action="delegate",
metadata={"task": task, "tool": tool}))
return f"{tool} executed {task}"

logger = AuditLogger()
agent = Agent("claims-agent", logger)
agent.delegate("extract indemnification clause", "contract_parser")

Even in this small example, the important property is that the execution path is recorded, not just the final output. When scaled across an agent system, these events form a verifiable trail that allows auditors to reconstruct how decisions were made, which components were involved, and whether the delegation chain respected the constraints defined by the system’s architecture.

Privacy must also be handled correctly. In regulated environments, agent systems often process sensitive data, so full transparency to a human auditor is not always possible. The architecture should allow verification without exposing the underlying data, using cryptographic techniques that confirm computations were performed correctly.

Transparency also applies to the delegation chain itself. Every handoff should be logged so that an audit can reconstruct what was delegated, to whom, under what constraints, and what the outcome was. Immutable records of this chain form the technical foundation of accountability.

4. Trust Architecture

Trust in an agentic system needs to be both earned and enforced. An agent builds trust through a verifiable history of reliable task completion. That trust translates into expanded authority and reduced oversight overhead. An agent that underperforms has its authority narrowed and its actions subjected to tighter verification — automatically, not through manual intervention.

The distinction between reputation (public, verifiable history) and trust (context-specific threshold set by the delegating system) is important. An agent may have a strong overall reputation and still not meet the trust threshold for a particular class of task that demands a level of certainty or domain expertise it hasn't demonstrated. Reputation is an input to trust, not a substitute for it.

Trust architecture also needs to handle the authority gradient problem. When a more capable system delegates to a less capable one, there's a risk of systematic under-specification: the delegator assumes knowledge the delegatee doesn't have, or the delegatee fails to raise legitimate concerns because the authority gradient creates implicit pressure to comply. Intelligent delegation systems need to build in mechanisms for delegatees to push back — to request clarification, flag ambiguity, or reject tasks that exceed their verified capability — without that pushback being systematically suppressed.

5. Permission Boundaries

Every delegation boundary is a permission boundary. When a system delegates a sub-task, it should transmit only the authority required for that specific sub-task — not its full permission set. This is privilege attenuation, and it's a core security property.

Without it, a compromise at the edge of a delegation network can escalate inward. An agent that was only supposed to read a specific dataset and somehow obtains write access becomes a liability that extends far beyond its intended scope. The blast radius of any individual agent failure should be strictly bounded by the permissions it was issued at the time of delegation.

Permission lifecycles also need to be dynamic. Permissions should persist only as long as the agent maintains the trust metrics that justified issuing them. When an agent's performance degrades below threshold, or when anomalous behavior is detected, active permissions should be invalidated automatically — not flagged for manual review that may arrive too late.

The Human Question

We want to be direct about where humans belong in this architecture, because we think the industry has a tendency to answer this question at the extremes — either "AI handles everything" or "humans must review everything" — when the real answer is considerably more precise.

Humans should be where their judgment is actually irreplaceable: in decisions that are genuinely ambiguous, genuinely high-stakes, and genuinely require the kind of contextual moral reasoning that no current AI system reliably supplies. That is a real and important set of decisions. It is not, however, most decisions.

The goal of intelligent delegation architecture is not to minimize human involvement. It's to place human judgment where it compounds most effectively — while building the infrastructure that lets everything else run safely without it.

Two failure modes to avoid:

The moral crumple zone. This is what happens when humans are inserted into a delegation chain not because their judgment is needed, but to absorb liability. They lack meaningful visibility into what the system is doing, lack the context to evaluate what they're reviewing, and lack the authority to intervene effectively. They're nominally "in the loop" while the loop runs around them. This is worse than either genuine human oversight or genuine autonomy, because it creates the appearance of accountability without the substance.

De-skilling. As agentic systems take over routine tasks, the humans in the oversight role are increasingly exposed only to edge cases and exceptions. Expertise in edge cases is built through repeated exposure to routine cases. If you automate away the routine, you gradually erode the human judgment capacity you're relying on for escalation. The architecture needs to deliberately route some tasks to humans for developmental reasons — not because AI can't handle them, but because the system's long-term reliability depends on maintaining human capability within the loop.

The right mental model is dynamic cognitive friction: low friction for tasks that are routine and high-confidence; escalating friction as complexity, uncertainty, and irreversibility increase. The system should always be asking whether the human reviewing a decision has enough context, enough authority, and enough time to actually evaluate it. When the answer is no, the system has failed at oversight design — regardless of whether a human technically touched the decision.

Why Regulated Industries Are Different

Everything above applies to any serious agentic deployment. In regulated industries, the stakes are higher across every dimension, and several additional constraints apply.

The external audit requirement is real. In highly regulated sectors, the question is not whether your system will be audited — it's when, and by whom. The audit will not care about your architecture's elegance. It will care about whether you can reconstruct, with precision, what decisions were made, on what basis, by what agent or human, under what authority. Systems that can't answer those questions are not compliant systems, regardless of how sophisticated the underlying AI is.

Irreversibility is the norm, not the exception. In many regulated contexts, consequential decisions cannot be undone once communicated to external parties. The irreversibility taxonomy described earlier — distinguishing retry events from escalation events — isn't an engineering nicety. It's the difference between a recoverable failure and a regulatory incident.

The accountability chain extends to the organization. In consumer-facing regulated industries, the organization is accountable for the behavior of its AI systems. An agent making a wrong decision doesn't insulate the organization from responsibility; it raises questions about whether appropriate oversight was in place. This means the delegation architecture needs to be defensible not just technically but institutionally — designed and documented in a way that demonstrates the organization exercised reasonable care.

Safety cannot be a luxury good. In competitive markets, there's pressure to reduce verification overhead for the sake of speed and cost. In regulated industries, this creates a systemic fragility problem: if the economics of safe operation are only viable for well-resourced actors, under-resourced actors will cut corners, and the regulatory environment will eventually tighten for everyone. The architecture should be designed to make high-assurance operation cost-effective at scale, not a premium feature layered on top.

The Infrastructure Thesis

We started Soris with a conviction that the most important companies of the next decade in large regulated industries will not be the ones with the best models. They will be the ones that build the best infrastructure for operating AI systems safely, accountably, and at scale.

Models are a commodity trend. The quality gap between frontier models and capable open models is compressing. Fine-tuning on domain data is increasingly accessible. Any company with sufficient resources can acquire roughly comparable modeling capabilities.

Delegation infrastructure is not a commodity. It's an engineering discipline that requires deep understanding of both AI system design and the specific operational, legal, and ethical constraints of the industry it serves. It requires getting the trust architecture right, the monitoring topology right, the permission model right, the human-AI interface right — and it requires doing that in an environment where the cost of getting it wrong is not just a bad user experience but a serious organizational consequence.

That's the company we're building. Not a faster version of how things are done today. A different operational architecture — one where the safety and accountability properties of the system are not bolt-on compliance features but foundational design constraints that shape everything else.

The model matters. The delegation layer matters more.

Copy link

Ready to join us?

View open positions