Trust Boundaries for LLM Agents

Large language models capable of acting as agents introduces a new layer of risk. These systems are more than passive text generators but are often equipped with tools and can increasingly get wired into developer workflows, APIs, and even financial systems. The problem we see is that once an LLM is given the ability to act on a user’s behalf, it must be treated as hostile. Whether through deliberate prompt injection, malicious data fed into its context, or unintended emergent behavior, an LLM can attempt to exfiltrate information, misuse credentials, or perform destructive operations.

Traditional operating systems were not designed for this world. They operate on the principle of ambient authority. If a process has the right to open sockets, it can speak to any host on the internet; if it has permission to invoke git, it can perform every subcommand from commit to destructive reset. Applications are not capability-aware. There is no built-in way to grant “just enough” access, such as permitting only commits but not pushes, or only fetches from a fixed domain. This leaves us with a mismatch: agents need fine-grained boundaries, while the substrate they run on provides only coarse ones.

It may seem as a contrived concern, but early real-world cases and recent research suggest otherwise. Benchmark studies like Agent Security Bench and Agent-SafetyBench have shown that LLM agents are trivially exploitable via prompt injection and tool misuse, with exploitation success rates often above 80 percent. None of the tested agents achieved robust safety under these conditions. Surveys of LLM agent security highlight specific attack classes such as functional manipulation, output poisoning, and context hijacking, all of which bypass naive mitigations. Real incidents reinforce this: the discovery of PromptLock, an AI-powered ransomware prototype that leveraged local LLMs to evade detection, illustrates the practical risk surface of allowing agents uncontrolled access¹. Another recent case, the Nx build system compromise (August 2025), illustrates the same structural weakness. A trojanized post-install script had full ambient authority to extract SSH keys, API tokens, and even cryptocurrency wallets, then exfiltrated them using the victim’s own GitHub credentials². Similarly, researchers have demonstrated autonomous LLM-driven cyberattacks capable of multi-stage planning and execution, underlining the seriousness of unbounded authority³.

Looking back, capability-based operating systems such as EROS and KeyKOS proposed a model that could be applicable: unforgeable, fine-grained permissions that can be granted and revoked explicitly. Plan 9 and its modern descendant 9front also offered a practical approach, with per-process namespaces that allow a process to see only the subsystems mounted for it. These designs embody least privilege more cleanly than Linux or BSD ever did. But they failed to take hold in the marketplace. Most alternative operating systems may be great control planes, but poor as a data plane. Modern workloads, GPU stacks, and ecosystems are mostly built on Linux. The reality is that capability security must be approximated within mainstream environments, not sought in exotic ones.

That leads us to a pragmatic direction. The simplest workable pattern is to treat each agent as its own user, with a narrowly scoped execution environment. Every invocation is described by a manifest: which directories are accessible, which network endpoints are reachable, which tools are available, and what resource limits apply. A tri-state policy governs each requested action: some are allowed, some are denied outright, and others trigger a request for human approval before proceeding. These policies are evaluated by the orchestrator, but the enforcement inside the VM could be driven by a manifest. The manifest is the one-time contract for a specific run: it encodes which directories are mounted, which hosts are reachable, which tools are callable, and which actions were resolved as allow, deny, or ask. Policy is abstract and reusable, while the manifest is concrete and short-lived, passed into the VM as the binding authority for that session.

The execution itself does not require a novel kernel. Disposable, sandboxed virtual machines built on ordinary Linux distributions are enough. An orchestrator screens the plan, applies policy, and if allowed, spins up a VM with only the required mounts and network access. The agent can be routed into such environment simply via SSH, but always lands in a forced entrypoint rather than a login shell. Once the task is complete, results are collected and the VM could even be destroyed, or reused only if it boots from an immutable snapshot with a matching policy and all mutable state lives on a disposable layer. Nothing needs to persist, and the scope of access can be strictly limited to what was approved.

This design echoes the Transaction Authorization Policies used in financial custody systems, where each transaction is evaluated against ordered rules that may allow it, deny it, or escalate it for approval. It also mirrors patterns proposed in recent research such as Progent, which defines programmable privilege controls for LLM agents, and “Sandboxed Mind”, which advocates for isolating agent execution surfaces and layering explicit approval paths.

What makes this approach practical is that nothing exotic is required. The orchestration layer can be built with QEMU or container runtimes, the policy logic with e.g. Casbin , OPA or keto , and the base image with Ubuntu or Debian slim as glibc-based distributions are a safer choice in practice for building and running LLM agent environments. We don’t need to invent a new operating environment, but suggest instead a disciplined combination of existing primitives: pre-execution policy checks, ephemeral sandboxed environments, structured manifests, and tri-state enforcement.

As recent incidents illuminate, agents amplify long-known security issues by introducing untrusted, semi-autonomous processes into sensitive contexts. Mainstream operating environments that are well understood by LLMs lack fine-grained capabilities, but the issue is addressable by borrowing ideas from capability security, applying them with modern sandboxing tools, and embedding a clear allow/ask/deny workflow. We stipulate that the boundary only holds if manifests are enforced before execution, VMs are destroyed after use, and network egress remains constrained; any deviation re-introduces ambient authority and collapses the model.