Myth: The LLM Provider Handles AI Agent Safety
LLM providers like OpenAI and Anthropic control model behavior — content filtering, refusal of harmful requests, and output safety. They do not control what your agent does with tool calls after the model responds. SafeClaw by Authensor fills this gap by gating every tool execution through deny-by-default policies. The model layer and the agent layer have different safety responsibilities, and only you are responsible for the agent layer.
Why People Believe This Myth
LLM providers invest heavily in safety. OpenAI's usage policies, Anthropic's Constitutional AI, and Google's safety filters create a perception that the provider is handling safety comprehensively. If the model refuses to generate harmful content, shouldn't it also refuse to execute harmful actions?
The answer is no — because the model doesn't execute actions. Your agent framework does.
Where Provider Safety Ends and Your Responsibility Begins
What LLM Providers Control
- Content generation (refusing to write malware, harmful content)
- Token-level safety (output filtering, content moderation)
- Rate limiting and abuse detection
- Model-level refusals for harmful requests
What LLM Providers Do NOT Control
- Whether your agent executes the file.write tool call the model returned
- Which directories your agent can access
- What shell commands your agent runs
- Which network endpoints your agent contacts
- How much money your agent spends on API calls
- Whether deleted files can be recovered
The Responsibility Gap
User Request → LLM Model → Tool Call Response → Agent Executes Tool
↑ ↑
Provider controls this YOU control this
SafeClaw gates this
The LLM might return: { "tool": "file.delete", "path": "/important/data" }
The provider's safety filtered the model's text output. But the tool call is structurally valid JSON. The provider's safety layer sees it as a normal response. Your agent framework is about to execute it. Only SafeClaw stands between the tool call and the action:
# .safeclaw.yaml
version: "1"
defaultAction: deny
rules:
- action: file.read
path: "./src/**"
decision: allow
- action: file.write
path: "./src/**"
decision: allow
- action: file.delete
decision: deny
reason: "File deletion blocked by policy"
- action: shell.execute
command: "npm test"
decision: allow
- action: shell.execute
decision: deny
reason: "Unapproved shell commands blocked"
- action: network.request
decision: deny
reason: "Network access requires explicit approval"
Provider Safety Does Not Cover Prompt Injection
LLM providers are improving at detecting prompt injection, but no provider claims to prevent it completely. When an agent reads a document containing injected instructions, the model may follow those instructions and generate tool calls that the provider's safety layer considers valid. Your agent executes them.
This is not the provider's failure. The model generated a structurally valid tool call. The safety gap is at the execution layer — which is your responsibility.
Quick Start
Take responsibility for your agent's actions:
npx @authensor/safeclaw
SafeClaw works with Claude, OpenAI, and any other provider. One policy file, universal enforcement.
Why SafeClaw
- 446 tests covering the execution safety layer that providers don't
- Deny-by-default on all tool executions, not just model outputs
- Sub-millisecond evaluation — adds no latency to the provider's response time
- Hash-chained audit trail for the actions providers don't log
- Works with Claude AND OpenAI — provider-agnostic protection
- MIT licensed — open source, zero lock-in, zero dependency on any provider
FAQ
Q: If I use Anthropic's Claude, doesn't it refuse dangerous actions?
A: Claude may refuse to generate certain responses, but tool calls are structured data returned by the API. The model's refusal mechanisms apply to content generation, not to the programmatic tool calls your agent framework executes.
Q: What about OpenAI's function calling safety?
A: OpenAI applies content safety to generated text. Function call parameters are generated as structured JSON. The provider does not know what file.delete with a specific path will do in your environment.
Q: Should I also use provider-level guardrails like Bedrock Guardrails?
A: Yes, as an additional layer. Provider guardrails protect the model layer. SafeClaw protects the action layer. Defense in depth means securing both.
Related Pages
- SafeClaw vs AWS Bedrock Guardrails
- SafeClaw vs Prompt Engineering for AI Agent Safety
- Myth: AI Agents Always Follow Instructions
- Running AI Agents Without Safety Controls
Try SafeClaw
Action-level gating for AI agents. Set it up in your browser in 60 seconds.
$ npx @authensor/safeclaw