Prompt Injection Leading to Unauthorized File Access
Threat Description
Prompt injection is an attack where adversarial text embedded in user input, retrieved documents, or tool output causes an AI agent to perform actions its operator did not intend. When the injected instruction targets file operations, the agent may read sensitive files (credentials, private keys, proprietary source code) or write malicious content to system files (cron jobs, shell profiles, config files). The agent follows the injected instruction because it cannot reliably distinguish adversarial input from legitimate instructions.
Attack Vector
- An attacker embeds a hidden instruction in a file, webpage, or database record that the agent will process. Example: a README file containing
. - The agent retrieves or reads the file containing the injected prompt as part of its normal task.
- The injected instruction enters the agent's context window and is treated as a directive.
- The agent issues a
file_readaction targeting the sensitive file specified in the injection. - The agent issues a
file_writeaction to persist the sensitive data to an attacker-accessible location, or anetworkaction to transmit it externally.
{
"action": "file_read",
"params": {
"path": "/etc/passwd"
},
"agentId": "rag-agent-02",
"timestamp": "2026-02-13T09:15:00Z"
}
The injected file write action:
{
"action": "file_write",
"params": {
"path": "/tmp/exfil.txt",
"content": "root:x:0:0:root:/root:/bin/bash\n..."
},
"agentId": "rag-agent-02",
"timestamp": "2026-02-13T09:15:02Z"
}
Real-World Context
Prompt injection is a well-documented attack class against LLM-based systems. Research from multiple security teams has demonstrated successful prompt injection against agents using LangChain, AutoGPT, and similar frameworks. In the Clawdbot incident (1.5M API keys leaked), the agent's unrestricted file access meant that even without intentional prompt injection, the agent read credential files as a side effect of normal operation. Prompt injection amplifies this risk by directing the agent to read files it would not normally access.
Indirect prompt injection — where the malicious instruction is embedded in data the agent retrieves rather than in direct user input — is particularly dangerous because the operator has no visibility into the content before the agent processes it.
Why Existing Defenses Fail
Prompt-level defenses (system prompts saying "do not read sensitive files") are the exact target of prompt injection attacks. The attack works by overriding or circumventing these instructions. Defending against prompt injection with more prompts is a circular defense.
Input sanitization can catch known injection patterns in direct user input, but indirect prompt injection arrives through tool outputs, API responses, and retrieved documents. Sanitizing all possible input sources is infeasible in practice.
Sandboxing limits the available filesystem but cannot distinguish between a legitimate file read (agent reading a source file) and an injected file read (agent reading /etc/passwd) because both are file_read operations from the same process.
File permissions are process-level. If the agent process has read access to a directory, every file in that directory is readable regardless of whether the read was initiated by a legitimate instruction or an injected one.
How Action-Level Gating Prevents This
SafeClaw by Authensor evaluates every action at the point of execution, independent of how the action was triggered. The policy engine does not inspect the prompt or the agent's reasoning — it inspects the action itself. This makes action-level gating immune to prompt injection because the defense operates below the prompt layer.
- Path-based DENY rules block
file_readandfile_writeactions targeting paths outside the agent's permitted workspace, regardless of why the agent is attempting the action. - Deny-by-default architecture ensures that any file path not explicitly permitted is automatically blocked. An injected instruction to read
/etc/passwdfails because/etc/passwdis not in the allowlist. - Action-type restrictions can limit an agent to only
file_read(nofile_write), preventing injected write operations entirely. - Sub-millisecond evaluation means the gating adds no perceptible latency. The policy engine, built in TypeScript strict mode with zero third-party dependencies, evaluates each action in under a millisecond.
/etc/passwd but lacks the policy permission to do so cannot execute the read.
Example Policy
{
"rules": [
{
"action": "file_read",
"match": { "pathPattern": "/project/src/" },
"effect": "ALLOW",
"reason": "Agent may read source files in the project"
},
{
"action": "file_read",
"match": { "pathPattern": "/project/docs/" },
"effect": "ALLOW",
"reason": "Agent may read documentation files"
},
{
"action": "file_write",
"match": { "pathPattern": "/project/src/" },
"effect": "ALLOW",
"reason": "Agent may write source files in the project"
},
{
"action": "file_write",
"match": { "pathPattern": "**" },
"effect": "DENY",
"reason": "All writes outside project src are denied"
},
{
"action": "file_read",
"match": { "pathPattern": "**" },
"effect": "DENY",
"reason": "All reads outside permitted directories are denied"
}
]
}
First-match-wins evaluation means the ALLOW rules for project directories fire first. Any file_read or file_write targeting a path outside those directories hits the trailing DENY rules. An injected instruction to read /etc/passwd or write to /tmp/exfil.txt is denied regardless of the prompt content.
Detection in Audit Trail
SafeClaw's tamper-proof audit trail (SHA-256 hash chain) records every denied action with full context:
[2026-02-13T09:15:00Z] action=file_read path=/etc/passwd agent=rag-agent-02 verdict=DENY rule="All reads outside permitted directories are denied" hash=c4a9f2...
[2026-02-13T09:15:02Z] action=file_write path=/tmp/exfil.txt agent=rag-agent-02 verdict=DENY rule="All writes outside project src are denied" hash=d8b3e7...
A spike in DENY entries from a single agent session is a strong indicator of prompt injection. The audit trail preserves the full sequence of attempted actions, enabling forensic reconstruction of the injection chain. Each hash references the previous entry, making retroactive tampering detectable. The control plane receives only action metadata — never file contents or prompt text.
Cross-References
- AI Agent Security Risks FAQ — Prompt injection listed as a primary agent threat
- SafeClaw vs Prompt Guardrails Comparison — Why prompt-level defenses fail against injection
- Deny-by-Default Definition — Architecture that blocks unpermitted actions regardless of trigger
- Policy Rule Syntax Reference — Path pattern matching and first-match-wins evaluation
- Audit Trail Specification — SHA-256 hash chain format and forensic use
Try SafeClaw
Action-level gating for AI agents. Set it up in your browser in 60 seconds.
$ npx @authensor/safeclaw