2025-12-17 · Authensor

Myth: The LLM Provider Handles AI Agent Safety

LLM providers like OpenAI and Anthropic control model behavior — content filtering, refusal of harmful requests, and output safety. They do not control what your agent does with tool calls after the model responds. SafeClaw by Authensor fills this gap by gating every tool execution through deny-by-default policies. The model layer and the agent layer have different safety responsibilities, and only you are responsible for the agent layer.

Why People Believe This Myth

LLM providers invest heavily in safety. OpenAI's usage policies, Anthropic's Constitutional AI, and Google's safety filters create a perception that the provider is handling safety comprehensively. If the model refuses to generate harmful content, shouldn't it also refuse to execute harmful actions?

The answer is no — because the model doesn't execute actions. Your agent framework does.

Where Provider Safety Ends and Your Responsibility Begins

What LLM Providers Control

Content generation (refusing to write malware, harmful content)
Token-level safety (output filtering, content moderation)
Rate limiting and abuse detection
Model-level refusals for harmful requests

What LLM Providers Do NOT Control

Whether your agent executes the file.write tool call the model returned
Which directories your agent can access
What shell commands your agent runs
Which network endpoints your agent contacts
How much money your agent spends on API calls
Whether deleted files can be recovered

The model generates a tool call. Your agent framework executes it. The provider has no visibility into or control over that execution.

The Responsibility Gap

User Request → LLM Model → Tool Call Response → Agent Executes Tool
                  ↑                                    ↑
          Provider controls this            YOU control this
                                            SafeClaw gates this

The LLM might return: { "tool": "file.delete", "path": "/important/data" }

The provider's safety filtered the model's text output. But the tool call is structurally valid JSON. The provider's safety layer sees it as a normal response. Your agent framework is about to execute it. Only SafeClaw stands between the tool call and the action:

# .safeclaw.yaml version: "1" defaultAction: deny rules: - action: file.read path: "./src/**" decision: allow - action: file.write path: "./src/**" decision: allow - action: file.delete decision: deny reason: "File deletion blocked by policy" - action: shell.execute command: "npm test" decision: allow - action: shell.execute decision: deny reason: "Unapproved shell commands blocked"

- action: network.request decision: deny reason: "Network access requires explicit approval"

Provider Safety Does Not Cover Prompt Injection

LLM providers are improving at detecting prompt injection, but no provider claims to prevent it completely. When an agent reads a document containing injected instructions, the model may follow those instructions and generate tool calls that the provider's safety layer considers valid. Your agent executes them.

This is not the provider's failure. The model generated a structurally valid tool call. The safety gap is at the execution layer — which is your responsibility.

Quick Start

Take responsibility for your agent's actions:

npx @authensor/safeclaw

SafeClaw works with Claude, OpenAI, and any other provider. One policy file, universal enforcement.

Why SafeClaw

446 tests covering the execution safety layer that providers don't
Deny-by-default on all tool executions, not just model outputs
Sub-millisecond evaluation — adds no latency to the provider's response time
Hash-chained audit trail for the actions providers don't log
Works with Claude AND OpenAI — provider-agnostic protection
MIT licensed — open source, zero lock-in, zero dependency on any provider

FAQ

Q: If I use Anthropic's Claude, doesn't it refuse dangerous actions?
A: Claude may refuse to generate certain responses, but tool calls are structured data returned by the API. The model's refusal mechanisms apply to content generation, not to the programmatic tool calls your agent framework executes.

Q: What about OpenAI's function calling safety?
A: OpenAI applies content safety to generated text. Function call parameters are generated as structured JSON. The provider does not know what file.delete with a specific path will do in your environment.

Q: Should I also use provider-level guardrails like Bedrock Guardrails?
A: Yes, as an additional layer. Provider guardrails protect the model layer. SafeClaw protects the action layer. Defense in depth means securing both.

Try SafeClaw

Action-level gating for AI agents. Set it up in your browser in 60 seconds.

$ npx @authensor/safeclaw