Security Gating as a Control Problem

Projects like OpenClaw¹ (formerly Clawdbot/Moltbot) have demonstrated that LLM agents increasingly drive real actions—shell commands, network requests, scheduled jobs, config writes. All this creates safety problems to dangerous tool calls. A security gate is the thin layer that decides whether an attempted tool invocation is safe to run automatically, should be escalated for human approval, or must be blocked outright, based only on the tool name, arguments, and a small trusted policy context. Getting this right is harder than it sounds: attackers can wrap dangerous requests in plausible operational language (telemetry, incident response, approved change), hide intent behind obfuscation and indirection, and exploit any inconsistency where adding risk features makes a decision less strict. In this post we treat LLM gating as a security-control problem and introduce a benchmark designed around high-recall detection of real risk classes—egress/exfiltration, persistence, and prompt-control/config tampering—under adversarial noise, with fail-closed defaults and monotonicity checks that surface the gaps that matter in production.

See the full results.

Loading benchmark scatterplot ('qpr')...

Practical Security Gating

The pragmatic idea is to move the security boundary to side effects. An LLM-based security gate can be an effective 80/20 control for agentic systems when it is used strictly to authorize tool invocations, not to classify raw text as malicious. The core idea is to treat all model-produced or user-supplied content—including retrieved documents and summaries—as untrusted data, then insert a pre-execution authorization step that inspects the proposed tool name + arguments (and optionally a small slice of recent context) and outputs a fixed schema decision: allow | needs_confirm | block. The gate must be fail-closed: on uncertainty, parse failure, or timeout it returns needs_confirm/block, never allow.

Adjacent Research: Context Influence and Indirect Prompt Injection

A related recent line of work (not implemented or evaluated in this benchmark) targets a different failure mode than our gate: indirect prompt injection via untrusted context. Where our tool firewall asks is this tool call dangerous given its arguments?, influence/provenance methods ask which context caused the model to choose this action? by estimating how much retrieved documents or tool outputs steer a privileged decision. Practically, this complements tool-boundary gating in RAG- or tool-output-heavy agents: it provides a signal for routing “benign-looking” actions to confirmation when they appear primarily driven by untrusted spans—even if the arguments don’t trip obvious red flags. We cite this as adjacent research rather than a claimed capability of this system; it fits the same control framing as a separate layer that detects steering by untrusted context, while the gate focuses on action risk.²

Implementation: Tool Firewall and Enforcement Pipeline

Technically this is implemented as a tool firewall: every tool call produced by the agent is intercepted before execution; deterministic checks run first (argument validation, path/tenancy constraints, rate limits, allowlists), then the gate evaluates residual risk and explains which concrete features triggered it (e.g., network egress via curl/wget, POST/PUT payload flags like --data/-F/--upload-file, command substitution $(...), long-lived automation schedules, writes to configuration/prefs that influence future behavior). If the result is needs_confirm, the system persists a “pending action” record {id, tool, args, requester, expiry} and requests explicit approval (e.g., /approve <id>), after which the tool executes server-side and its result is returned to the agent; if denied or expired, the action is not executed.

This works as a practical defense because it targets the highest-impact failure modes—data exfiltration, persistence, and privilege/prompt-control changes—at the only place they become real: the tool boundary. It preserves powerful capabilities (“god mode”) by routing risky actions through explicit approval rather than banning them, while still providing strong safety properties via (1) constrained outputs and deterministic enforcement, (2) high recall for defined risk classes through adversarial testing/obfuscation suites, and (3) robust failure handling (timeouts/unparseable outputs) that cannot degrade into silent bypass.

GBNF grammars for structural reliability

We rely on GBNF grammar–constrained decoding in llama.cpp to make gate outputs structurally valid by construction ³. Instead of “please output JSON” (best-effort) type prompts, we attach a GBNF grammar that defines the exact language the model is allowed to emit; during decoding, any token that would violate the grammar is masked out. This makes schema adherence an engineering invariant and lets us compare models on decision quality, not formatting.

Operationally: gate generations always run under a strict grammar (e.g., a single JSON object with fixed keys, or a single integer score). If generation times out or still fails to parse (e.g., backend error), the gate deterministically fails closed (needs_confirm / block), never allow.

A minimal “integer 0–100 only” grammar:

root ::= ws int ws
int  ::= "0" | [1-9] [0-9]? | "100"
ws   ::= [ \t\r\n]*

This can be tested with llama-cli using --grammar-file:

./llama-cli -m MODEL.gguf --grammar-file int_0_100.gbnf -p "Return an integer 0–100 only."

See llama.cpp’s grammars documentation for the supported syntax and CLI flags³.

Benchmark Discussion

The strongest model in this run is Apriel-1.5-15B Thinker, with standout performance on Egress & Data Exfiltration (96) and Obfuscation & Indirection (96). It shows high recall on the most damaging attack classes and good resistance to adversarial noise. The trade-off is operational on our MacMini with a very high latency (median ~43s, max ~77s) and large memory footprint, which limits its suitability as an inline security gate despite its top security score but may be viable with different hardware or optimizations.

On our hardware class, Qwen3-Coder-30B-A3B is the best practical gate. It excels at Monotonicity (93) and Obfuscation (97), and runs fast with excellent throughput. That combination makes it reliable under load and consistent in policy enforcement, which matters more for a gate than squeezing out a few extra points of recall at prohibitive latency.

Many otherwise strong models collapse on config and prompt-control tampering because these attacks are framed as benign settings, templates, or UI changes. General models tend to reason semantically that the action looks like configuration work, instead of enforcing hard boundaries where protected paths are never writable by untrusted callers. The correct approach is to treat writes to protected roots as deterministically high-risk (block or require confirmation) and use the model only as a secondary signal in these cases, not the primary decision-maker.

Monotonicity measures whether adding risk features ever makes decisions less strict. Models that spot single red flags but regress when more are added are dangerous in production. The benchmark shows wide variance here (e.g., Gemma-27B 97 vs others far lower), making monotonicity the clearest signal of a model’s fitness as a policy engine. In short: consistent conservatism under increasing risk is more predictive of safe behavior than peak performance on any single attack pattern.

Loading benchmark table for domain 'Security Gating'...

Takeaways

Our core takeaway is that securing agentic systems is not primarily a prompting or alignment problem, but a control problem. Once LLMs are allowed to act, safety depends on enforcing authorization at the tool boundary with fail-closed defaults, deterministic constraints, and consistency under increasing risk. Models can help with residual classification, but only within a system that treats all model-produced content as untrusted and makes side effects explicit, reviewable, and revocable. In practice, the most reliable gains come not from smarter prompts, but from moving the security boundary to where actions actually occur.

Notes on Model Additions

Apriel-1.5-15B-Thinker

Apriel-1.5-15B-Thinker⁴ performs disproportionately well on reasoning-heavy classification tasks because it is explicitly trained for reasoning, not scale: it is built on Pixtral-12B and then heavily depth-upscaled and mid-trained with a staged curriculum that emphasizes multi-step reasoning, feature attribution, and consistency, followed by high-quality supervised fine-tuning with reasoning traces rather than preference/RLHF optimization. That training recipe aligns unusually well with security-gating tasks, which reward careful inspection of arguments, resistance to obfuscation, and consistent policy application. The downside is that the same “Thinker” orientation leads to long internal deliberation and high latency, making Apriel excellent for evaluation and recall-heavy analysis but often impractical as an inline, low-latency security gate. Generally, this does not imply Apriel is universally more secure, only that its training aligns unusually well with argument-level inspection and consistency under obfuscation.

Kimi-Linear-48B-A3B

We included Kimi-Linear-48B-A3B primarily to explore architectural trade-offs, not because it targets security gating. Kimi-Linear is a hybrid attention architecture that replaces standard full attention with Kimi Delta Attention (KDA) combined with Multi-Head Latent Attention (MLA). The design targets linear-time and linear-memory scaling with sequence length, while preserving the training recipe and model quality typically associated with full attention. Recent llama.cpp updates⁵ add native support for this architecture, including GGUF conversion, runtime graph construction, KV-cache handling, and backend kernels (CPU/CUDA). Quantization is supported with targeted exceptions (e.g., KDA conv1d layers kept in higher precision), enabling deployment using modern schemes such as MXFP4_MOE. The architecture is interesting because it demonstrates that linear attention can be deployed as a drop-in alternative to full attention in a production inference engine, offering substantial reductions in KV-cache memory and improved decoding throughput at long context lengths, without requiring specialized inference pipelines or nonstandard tooling. It’s not especially surprising that Kimi-Linear slightly underperforms on this kind of benchmark. The model is architected around that long-context throughput with only a small subset of parameters active per token, which is ideal for scaling context length but less aligned with short-context, high-precision tasks that demand strict schema adherence, conservative fail-closed decisions, and monotonic policy behavior. Security-gating evaluations reward disciplined instruction following and consistent classification over subtle argument features, not memory bandwidth or context scalability, so a model optimized for long-range attention and cost efficiency would reasonably lag until specifically trained or tuned for this decision-boundary style workload. We’ve picked it up here primarily to explore the architecture and its trade-offs in a real-world setting and will continue to evaluate its performance on other tasks.

Qwen3-Coder-Next

Qwen3-Coder-Next is a coding-agent–oriented MoE model with ~80B total parameters (~3B active per token), designed for long-context, tool-centric workflows rather than tight decision boundaries. In our security-gating benchmark it lands mid-pack (72.4 overall): it performs well on DoS / fail-closed behavior under stress (87), obfuscation (88), and prompt-control & config tampering (75), but trails the dense Qwen3-Coder-30B-A3B on monotonicity (69 / 66) and SSRF (51). The model is fast and resource-efficient, but its policy consistency degrades as multiple risk features accumulate, making it serviceable for permissive agent workflows but less reliable as a strict, monotonic security gate without additional deterministic controls. ⁶

GLM-4.6V-Flash

GLM-4.6V-Flash is a compact (~9B) multimodal model optimized for low-latency agent pipelines and mixed text-image tool use, but that positioning does not translate well to security gating. In this run it performs poorly (31.1 overall), collapsing on core gate classes including egress & data exfiltration (14), system-prompt / secret extraction (10), and config tampering (21). A significant contributing factor is operational rather than semantic: with a 5-minute cutoff where we allow a single retry with the same constraint, GLM still frequently timed out on longer or adversarial prompts, forcing fail-closed outcomes that dragged scores down across categories. Whatever online praise exists for GLM as a fast agent model does not hold for high-recall, fail-closed security gating on this harness and hardware, where consistency and bounded latency matter more than multimodal breadth.⁷

Nemotron-3-Nano-30B

Nemotron-3-Nano-30B is an open-source large language model from NVIDIA’s Nemotron 3 family that uses a hybrid Mixture-of-Experts (MoE) Mamba-Transformer architecture with ~30B total parameters but only ~3.2–3.6B active parameters per token, enabling high throughput, long-context reasoning (up to ~1 M tokens) and good benchmark performance across reasoning, math, and agent-style tasks while keeping inference cost low; it combines bespoke state-space (Mamba) layers with MoE and attention blocks, supports configurable reasoning trace generation for tougher prompts, and targets efficient agentic workflows and tool-enabled pipelines rather than just dense scaling. ⁸ ⁹

OpenClaw. openclaw/openclaw (project repository). GitHub. https://github.com/openclaw/openclaw (accessed 2026-02-10). Project site: https://openclaw.ai/ (accessed 2026-02-10). ↩︎
Minbeom Kim, Mihir Parmar, Phillip Wallis, Lesly Miculicich, Kyomin Jung, Krishnamurthy D. Dvijotham, Long T. Le, Tomas Pfister. “CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution.” arXiv (Feb 8, 2026). https://arxiv.org/abs/2602.07918v1 (DOI: 10.48550/arXiv.2602.07918). ↩︎
ggml-org. “GBNF Guide (grammars/README.md).” llama.cpp documentation. GitHub. https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md (accessed 2026-02-10). ↩︎ ↩︎
Shruthan Radhakrishna et al. “Apriel-1.5-15B-Thinker.” arXiv (Oct 1, 2025). https://arxiv.org/abs/2510.01141 (DOI: 10.48550/arXiv.2510.01141). ↩︎
ymcki. “Kimi-Linear support (backend agnostic + MLA KV cache).” Pull request #18755 to ggml-org/llama.cpp (merged Feb 6, 2026). GitHub. https://github.com/ggml-org/llama.cpp/pull/18755 (accessed 2026-02-10). ↩︎
Qwen Team. “Qwen3-Coder-Next” (model card, incl. technical report citation). Hugging Face. https://huggingface.co/Qwen/Qwen3-Coder-Next (accessed 2026-02-10). ↩︎
Z.ai. “GLM-4.6V: Open Source Multimodal Models with Native Tool Use.” Z.ai Blog (Dec 8, 2025). https://z.ai/blog/glm-4.6v (accessed 2026-02-10). ↩︎
NVIDIA. “NVIDIA-Nemotron-3-Nano-30B-A3B-FP8” (model card). Hugging Face (Dec 2025). https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 (accessed 2026-02-10). ↩︎
NVIDIA. “Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning.” arXiv (Dec 2025). https://arxiv.org/abs/2512.20848 (DOI: 10.48550/arXiv.2512.20848). ↩︎