2026-02-03 · Authensor

How to Set Up Guardrails for AI Agents

To set up guardrails for AI agents, install SafeClaw (npx @authensor/safeclaw) and define a deny-by-default policy that gates every action at the execution layer. The term "guardrails" is commonly used to mean content filters or prompt-level restrictions, but these only control what the model says — not what the agent does. Effective guardrails for agents must operate at the action layer, evaluating file writes, shell commands, and network requests before they execute.

Why This Matters

Most "guardrails" products filter model outputs for harmful content — profanity, bias, prompt injection. These are important but insufficient for agents. An agent that passes every content filter can still delete your files, read your SSH keys, or send your API keys to an external server. The Clawdbot incident leaked 1.5 million API keys despite the agent producing perfectly polite, professional output. The agent's text was fine; its actions were catastrophic. Guardrails for agents must control actions, not just words.

Step-by-Step Instructions

Step 1: Understand the Guardrails Spectrum

There are three layers where guardrails can operate:

| Layer | What It Controls | Example | Limitation |
|---|---|---|---|
| Prompt-level | What the model is told to do | System prompt instructions | Bypassed by prompt injection |
| Output-level | What the model says | Content filters, toxicity checks | Doesn't prevent actions |
| Action-level | What the agent does | SafeClaw policy evaluation | Enforces actual behavior |

Action-level gating is the only layer that prevents the agent from performing dangerous operations on your system.

Step 2: Install SafeClaw

npx @authensor/safeclaw

SafeClaw provides action-level gating with zero third-party dependencies. Policy evaluation runs in sub-millisecond time. The client is 100% open source (MIT license). It works with Claude, OpenAI, LangChain, CrewAI, AutoGen, MCP, and Cursor.

Step 3: Get Your API Key

Visit safeclaw.onrender.com. Free tier with 7-day renewable key, no credit card required. The browser dashboard walks you through policy creation.

Step 4: Define Action-Level Guardrails

Instead of writing rules about what the model should say, write rules about what the agent can do. Each rule specifies an action type, a target (path, command, or domain), and a decision (allow, deny, or require_approval).

Step 5: Layer Your Guardrails

Effective agent guardrails use defense in depth:

Deny-by-default policy — Blocks everything not explicitly allowed
Explicit deny rules — Named blocks for known-dangerous patterns (even though default-deny would catch them, explicit rules provide clear audit trail entries)
Require-approval rules — Human-in-the-loop for ambiguous actions
Tamper-proof audit trail — SHA-256 hash chain records every decision

Step 6: Simulate, Then Enforce

# Test your guardrails
SAFECLAW_MODE=simulation npx @authensor/safeclaw

Activate them
SAFECLAW_MODE=enforce npx @authensor/safeclaw

Example Policy

version: "1.0" default: deny rules: # ---- GUARDRAIL: File system boundaries ---- - action: file_read path: "./project/**" decision: allow reason: "Agent can read project files" - action: file_write path: "./project/output/**" decision: allow reason: "Agent can write to output" - action: file_write path: "./project/src/**" decision: require_approval reason: "Source changes need human review" # ---- GUARDRAIL: Credential protection ---- - action: file_read path: "*/.env" decision: deny reason: "Credential files blocked" - action: file_read path: "~/.ssh/**" decision: deny reason: "SSH keys blocked" - action: file_read path: "~/.aws/**" decision: deny reason: "AWS credentials blocked" # ---- GUARDRAIL: Shell command control ---- - action: shell_exec command: "npm test*" decision: allow reason: "Test execution allowed" - action: shell_exec command: "npm run lint*" decision: allow reason: "Lint execution allowed" - action: shell_exec command: "rm *" decision: deny reason: "Destructive commands blocked" - action: shell_exec command: "chmod *" decision: deny reason: "Permission changes blocked" - action: shell_exec command: "sudo *" decision: deny reason: "Privilege escalation blocked" # ---- GUARDRAIL: Network boundaries ---- - action: network domain: "api.anthropic.com" decision: allow reason: "LLM API allowed" - action: network domain: "api.openai.com" decision: allow reason: "LLM API allowed"

- action: network domain: "*" decision: deny reason: "All other outbound blocked"

What Happens When It Works

ALLOW — Agent runs tests within permitted guardrails:

{
  "action": "shell_exec",
  "command": "npm test -- --coverage",
  "decision": "ALLOW",
  "rule": "Test execution allowed",
  "timestamp": "2026-02-13T12:00:01Z",
  "hash": "f2g3h4i5..."
}

DENY — Agent attempts privilege escalation:

{
  "action": "shell_exec",
  "command": "sudo chmod 777 /etc/passwd",
  "decision": "DENY",
  "rule": "Privilege escalation blocked",
  "timestamp": "2026-02-13T12:00:03Z",
  "hash": "j6k7l8m9..."
}

REQUIRE_APPROVAL — Agent wants to modify source code:

{
  "action": "file_write",
  "path": "./project/src/auth.ts",
  "decision": "REQUIRE_APPROVAL",
  "rule": "Source changes need human review",
  "timestamp": "2026-02-13T12:00:05Z",
  "hash": "n0o1p2q3..."
}

Common Mistakes

Treating prompt-level guardrails as security. Telling the agent "never delete files" in the system prompt is a suggestion, not an enforcement mechanism. Prompt injection can override these instructions, and model hallucination can cause the agent to ignore them. Prompt-level guardrails are useful for steering behavior but are not a security boundary.

Using allow-by-default with a blocklist. Many guardrail implementations start by blocking known-bad patterns and allowing everything else. This fails because you cannot enumerate every dangerous action. A new attack vector, an unexpected shell command, or a novel file path will bypass your blocklist. Always use deny-by-default.

Not differentiating between output guardrails and action guardrails. Output guardrails (content filters) and action guardrails (execution gating) solve different problems. You need both: content filters for what the agent says to users, action gating for what the agent does to your system. SafeClaw handles the action layer; use your existing content safety tools for the output layer.

Cross-References

Try SafeClaw

Action-level gating for AI agents. Set it up in your browser in 60 seconds.

$ npx @authensor/safeclaw