AI Agent Safety Glossary: Every Term You Need to Know

2025-11-03 · Authensor

Understanding AI agent safety requires a shared vocabulary. This glossary defines over 30 essential terms used in agent safety architecture, policy design, compliance, and tooling. SafeClaw by Authensor implements the concepts described here, from deny-by-default gating to hash-chained audit trails. Install it with npx @authensor/safeclaw to see these concepts in practice.

Action Gating — The practice of intercepting every action an AI agent attempts and evaluating it against a policy before allowing execution. SafeClaw implements action gating at the execution layer, not at the prompt or output level.

Action Request — A structured description of an action an agent wants to perform, including the action type (file write, shell execute, network request), parameters (path, command, URL), and context (agent ID, session). SafeClaw evaluates action requests against its policy engine.

Action Surface — The complete set of actions an agent can potentially perform. Understanding the action surface is the first step in writing effective safety policies. SafeClaw's simulation mode maps the action surface by observing all attempted actions.

Allow-by-Default — A permission model where all actions are permitted unless specifically blocked. This model is insecure because it requires anticipating every dangerous action. Contrast with deny-by-default.

Approval Workflow — A mechanism that pauses agent execution and requires human sign-off before a specific action proceeds. SafeClaw supports configurable approval rules with multi-approver support and timeout handling.

Audit Trail — A chronological record of all actions attempted, policy decisions made, and execution outcomes. SafeClaw's audit trail is hash-chained for tamper evidence.

Blast Radius — The maximum damage that can result from a safety failure. Deny-by-default policies minimize blast radius by limiting what an agent can do. Container isolation limits blast radius by constraining where the agent operates.

Budget Control — Limits on the resources an agent can consume, including token budgets, action count limits, and cost thresholds. SafeClaw provides configurable budget controls per agent or session.

Cascading Failure — In multi-agent systems, a failure in one agent that triggers failures in downstream agents. Per-agent policy isolation prevents cascading permission escalation.

Container Mode — Running SafeClaw inside a Docker container to provide both action-level gating (SafeClaw) and environmental isolation (Docker) simultaneously.

Defense in Depth — A security strategy that uses multiple layers of protection so that if one layer fails, others still provide safety. A complete agent safety architecture includes prompt guidance, output guardrails, action gating, and container isolation.

Deny-by-Default — A permission model where all actions are blocked unless explicitly permitted by a policy rule. This is the foundational principle of SafeClaw and the recommended approach for all agent deployments.

Deterministic Evaluation — Policy evaluation that produces the same result for the same input every time. SafeClaw's first-match-wins engine is deterministic, unlike probabilistic prompt-based safety.

First-Match-Wins — A policy evaluation strategy where rules are checked in order and the first matching rule determines the outcome. This makes policy behavior predictable and easy to reason about.

Hash Chain — A sequence of cryptographic hashes where each entry's hash includes the previous entry's hash. Used in SafeClaw's audit trail to make tampering detectable. Modifying any entry breaks the chain from that point forward.

Human-in-the-Loop (HITL) — A safety pattern where certain agent actions require explicit human approval before execution. SafeClaw implements HITL through its approval workflow feature.

Least Privilege — The principle that an agent should have only the minimum permissions necessary to perform its intended function. Deny-by-default naturally enforces least privilege.

Multi-Agent System — An architecture where multiple AI agents collaborate, delegate tasks, or operate in parallel. Requires per-agent policy isolation to prevent privilege escalation.

Output Guardrails — Safety controls that filter the text output of a language model for harmful content, hallucinations, or policy violations. Distinct from action gating, which controls execution rather than output.

Per-Agent Isolation — Assigning each agent in a multi-agent system its own independent safety policy. Prevents one agent's permissions from being inherited or escalated by another.

Policy — A set of rules that defines what actions an agent is permitted to perform. In SafeClaw, policies are declarative YAML files evaluated by the policy engine.

Policy as Code — The practice of defining safety policies in version-controlled, testable, reviewable code files rather than in ad-hoc configurations or documentation. SafeClaw policies are code.

Policy Engine — The component that evaluates action requests against policy rules and returns allow, deny, or approve decisions. SafeClaw's policy engine uses first-match-wins evaluation with deny-by-default.

Privilege Escalation — When an agent gains permissions beyond what its policy allows, typically through delegation to another agent with broader permissions or by exploiting a safety gap.

Prompt Injection — An attack where malicious content in the agent's input causes the language model to ignore its safety instructions. Action gating is resistant to prompt injection because it operates at the execution layer, not the prompt layer.

Provider Agnostic — A safety tool that works across multiple model providers (Claude, OpenAI, etc.) without requiring provider-specific configuration. SafeClaw is provider agnostic.

Red-Teaming — Systematic adversarial testing of an AI agent's safety controls to identify weaknesses. SafeClaw's simulation mode supports red-teaming by allowing policy testing without enforcement.

Simulation Mode — A SafeClaw operating mode that observes and logs all agent actions and policy decisions without blocking anything. Used for action surface mapping, policy development, and migration testing.

Supply Chain Risk — The risk that external dependencies introduce vulnerabilities. SafeClaw mitigates this with zero external runtime dependencies.

Tamper Evidence — A property of audit records that makes unauthorized modification detectable. SafeClaw's hash-chained audit trail provides tamper evidence through cryptographic linking.

Zero Dependencies — A software architecture with no external runtime packages. SafeClaw has zero dependencies, eliminating supply chain attack surface in the safety layer.

Zero Trust — A security model that assumes no actor (including the agent itself) is trusted by default and requires verification for every action. Deny-by-default is the zero-trust approach applied to AI agents.

Related reading:

The Complete Guide to AI Agent Safety (2026)

SafeClaw Features: Everything You Get Out of the Box

How to Switch from Allow-by-Default to Deny-by-Default

Get Started with SafeClaw in 5 Minutes

Try SafeClaw

Action-level gating for AI agents. Set it up in your browser in 60 seconds.

$ npx @authensor/safeclaw