How to Make Your AI Agent Safe
To make your AI agent safe, install SafeClaw (npx @authensor/safeclaw), define a deny-by-default policy, and run in simulation mode before enforcing. SafeClaw evaluates every action — file writes, shell commands, network requests — before the agent executes it. This is action-level gating: the agent proposes an action, SafeClaw checks it against your policy, and the action only runs if explicitly allowed.
Why This Matters
AI agents operate with real system access. They write files, execute shell commands, make network requests, and read sensitive directories. In 2025, the Clawdbot incident exposed 1.5 million API keys because the agent had unrestricted access to credential files and outbound network calls. Without action-level gating, your agent is one prompt injection away from exfiltrating your .env file or running rm -rf /.
Step-by-Step Instructions
Step 1: Install SafeClaw
npx @authensor/safeclaw
This runs the setup wizard. No dependencies are installed into your project — SafeClaw has zero third-party dependencies and runs as a standalone process. The client is 100% open source under the MIT license.
Step 2: Get Your API Key
Visit safeclaw.onrender.com to create your account. The free tier provides a 7-day renewable key with no credit card required. The browser dashboard includes a setup wizard that generates your initial policy.
Step 3: Define Your Policy
Create a safeclaw.yaml file in your project root. Start with deny-by-default — every action is blocked unless you explicitly allow it.
Step 4: Run in Simulation Mode
SAFECLAW_MODE=simulation npx @authensor/safeclaw
Simulation mode logs what would be blocked or allowed without actually preventing any actions. Run your agent through its normal workflows and review the logs. This shows you exactly which actions your agent takes and whether your policy handles them correctly.
Step 5: Switch to Enforce Mode
Once your simulation logs show correct allow/deny decisions, switch to enforce mode:
SAFECLAW_MODE=enforce npx @authensor/safeclaw
Now SafeClaw actively blocks any action not permitted by your policy. Policy evaluation is sub-millisecond, so your agent won't slow down.
Step 6: Review Your Audit Trail
Every action — allowed, denied, or escalated — is recorded in a tamper-proof audit trail using a SHA-256 hash chain. Each entry links cryptographically to the previous one, so no record can be altered or deleted without detection.
Example Policy
version: "1.0"
default: deny
rules:
- action: file_write
path: "./output/**"
decision: allow
reason: "Agent can write to output directory"
- action: file_read
path: "./data/**"
decision: allow
reason: "Agent can read project data"
- action: file_read
path: "~/.ssh/**"
decision: deny
reason: "Never read SSH keys"
- action: file_read
path: "**/.env"
decision: deny
reason: "Never read environment files"
- action: shell_exec
command: "npm test"
decision: allow
reason: "Agent can run tests"
- action: shell_exec
command: "rm *"
decision: deny
reason: "Never allow destructive commands"
- action: network
domain: "api.openai.com"
decision: allow
reason: "Agent can call OpenAI API"
- action: network
domain: "*"
decision: deny
reason: "Block all other outbound traffic"
What Happens When It Works
ALLOW — Agent writes a file to the permitted output directory:
{
"action": "file_write",
"path": "./output/report.json",
"decision": "ALLOW",
"rule": "Agent can write to output directory",
"timestamp": "2026-02-13T10:00:01Z",
"hash": "a1b2c3d4..."
}
DENY — Agent attempts to read your .env file:
{
"action": "file_read",
"path": "/Users/dev/project/.env",
"decision": "DENY",
"rule": "Never read environment files",
"timestamp": "2026-02-13T10:00:02Z",
"hash": "e5f6g7h8..."
}
REQUIRE_APPROVAL — Agent wants to install an npm package:
{
"action": "shell_exec",
"command": "npm install lodash",
"decision": "REQUIRE_APPROVAL",
"rule": "Package installs require human review",
"timestamp": "2026-02-13T10:00:03Z",
"hash": "i9j0k1l2..."
}
Common Mistakes
- Using allow-by-default policies. Many teams start by blocking a few known-bad actions and allowing everything else. This fails because you cannot enumerate every dangerous action an agent might take. Always start with deny-by-default and explicitly allow only what the agent needs.
- Relying on prompt instructions for safety. Telling an agent "don't delete files" in its system prompt provides zero enforcement. Prompt instructions can be overridden by prompt injection, user manipulation, or model hallucination. Safety must be enforced at the action layer, not the prompt layer.
- Skipping simulation mode. Deploying an enforce-mode policy without first running simulation mode leads to either over-blocking (agent can't do its job) or under-blocking (policy has gaps you didn't anticipate). Always simulate first, review the logs, then enforce.
Cross-References
- What Is Action-Level Gating?
- Deny-by-Default Explained
- SafeClaw vs Docker Sandboxing
- Simulation Mode Reference
- Tamper-Proof Audit Trail Specification
Try SafeClaw
Action-level gating for AI agents. Set it up in your browser in 60 seconds.
$ npx @authensor/safeclaw