2025-12-17 · Authensor

Myth: The LLM Provider Handles AI Agent Safety

LLM providers like OpenAI and Anthropic control model behavior — content filtering, refusal of harmful requests, and output safety. They do not control what your agent does with tool calls after the model responds. SafeClaw by Authensor fills this gap by gating every tool execution through deny-by-default policies. The model layer and the agent layer have different safety responsibilities, and only you are responsible for the agent layer.

Why People Believe This Myth

LLM providers invest heavily in safety. OpenAI's usage policies, Anthropic's Constitutional AI, and Google's safety filters create a perception that the provider is handling safety comprehensively. If the model refuses to generate harmful content, shouldn't it also refuse to execute harmful actions?

The answer is no — because the model doesn't execute actions. Your agent framework does.

Where Provider Safety Ends and Your Responsibility Begins

What LLM Providers Control

What LLM Providers Do NOT Control

The model generates a tool call. Your agent framework executes it. The provider has no visibility into or control over that execution.

The Responsibility Gap

User Request → LLM Model → Tool Call Response → Agent Executes Tool
                  ↑                                    ↑
          Provider controls this            YOU control this
                                            SafeClaw gates this

The LLM might return: { "tool": "file.delete", "path": "/important/data" }

The provider's safety filtered the model's text output. But the tool call is structurally valid JSON. The provider's safety layer sees it as a normal response. Your agent framework is about to execute it. Only SafeClaw stands between the tool call and the action:

# .safeclaw.yaml
version: "1"
defaultAction: deny

rules:
- action: file.read
path: "./src/**"
decision: allow

- action: file.write
path: "./src/**"
decision: allow

- action: file.delete
decision: deny
reason: "File deletion blocked by policy"

- action: shell.execute
command: "npm test"
decision: allow

- action: shell.execute
decision: deny
reason: "Unapproved shell commands blocked"

- action: network.request
decision: deny
reason: "Network access requires explicit approval"

Provider Safety Does Not Cover Prompt Injection

LLM providers are improving at detecting prompt injection, but no provider claims to prevent it completely. When an agent reads a document containing injected instructions, the model may follow those instructions and generate tool calls that the provider's safety layer considers valid. Your agent executes them.

This is not the provider's failure. The model generated a structurally valid tool call. The safety gap is at the execution layer — which is your responsibility.

Quick Start

Take responsibility for your agent's actions:

npx @authensor/safeclaw

SafeClaw works with Claude, OpenAI, and any other provider. One policy file, universal enforcement.

Why SafeClaw

FAQ

Q: If I use Anthropic's Claude, doesn't it refuse dangerous actions?
A: Claude may refuse to generate certain responses, but tool calls are structured data returned by the API. The model's refusal mechanisms apply to content generation, not to the programmatic tool calls your agent framework executes.

Q: What about OpenAI's function calling safety?
A: OpenAI applies content safety to generated text. Function call parameters are generated as structured JSON. The provider does not know what file.delete with a specific path will do in your environment.

Q: Should I also use provider-level guardrails like Bedrock Guardrails?
A: Yes, as an additional layer. Provider guardrails protect the model layer. SafeClaw protects the action layer. Defense in depth means securing both.


Related Pages

Try SafeClaw

Action-level gating for AI agents. Set it up in your browser in 60 seconds.

$ npx @authensor/safeclaw