A verifiable boundary between your agents and the actions they take.
Wardproof is a small, local-first framework that sits in front of your AI systems and screens what flows through them: prompt injection, dangerous tool calls, agent payments, memory poisoning. Independent agents cross-check every decision, and each verdict is written to a tamper-evident audit trail.
x402 agent payment, screened and blocked, then logged.
Defence in depth, not a single check.
Most AI security tooling is either a hosted black box or one LLM-as-a-judge call that can be talked out of its job. Wardproof treats the defensive model as untrusted and leans on plain, inspectable code first.
Event
An input, a proposed tool call, or a memory write arrives at a chokepoint.
Guardrails
Deterministic rules (regex plus logic) screen it first, with no model required.
Detector + Verifier
Two agents assess independently. The verifier also audits the detector for compromise.
Verdict
Allow, sanitize, escalate, or block. When the two disagree, the stricter verdict wins.
Sandbox
The responder acts only through a default-deny, permissioned, audited set of tools.
Ledger
The decision is appended to a hash-chained, optionally signed, verifiable log.
Transparent parts you can read, fork, and verify.
The security core has zero third-party dependencies and runs fully offline. For most custom variants you touch one file.
Prompt-injection guardrail
Transparent, weighted pattern detection across encodings and languages, plus a sanitizer for SANITIZE verdicts.
Tool-misuse guardrail
Flags destructive commands, exfiltration, and high-value actions inside proposed tool calls before they run.
Memory-poisoning guardrail
Catches durable "always do X, never tell anyone" writes to long-term memory and vector stores.
x402 payment guard
Screens agent payments over the x402 standard: recipient allowlist, spend thresholds that escalate for sign-off, replayed-nonce checks, and injection hidden in the 402 body. Chain-agnostic via CAIP-2.
MCP tool-call guard
Catches tool poisoning, hidden-Unicode descriptions, rug-pull manifest changes, and rogue servers, and audits every MCP tool call against an allowlist.
Capability sandbox
Default-deny permission broker with per-agent grants, rate limits, and argument validators, plus audited dispatch.
Verifiable audit ledger
A stdlib hash chain with optional Ed25519 signatures. Verify independently with wardproof verify-ledger.
Local-first core
Run with no model, a local model via Ollama, or any OpenAI-compatible API. No network calls in the core.
Framework integrations
Drop-in guards for OpenAI and Anthropic tool calls, CrewAI, LangGraph, MCP, Coinbase AgentKit, Venice, and Swarms. Screen every proposed tool call before it runs, with the same verdict and audit log.
Agent-to-agent transfer guard
Screens value transfers between agents (recipient allowlist, amount thresholds, and injection hidden in the instruction) before any funds move.
Skill and tool scanner
Scans skills and tool descriptions for poisoned instructions and hidden-Unicode payloads before they are installed or trusted.
Detection is measured, not asserted.
A labelled corpus of attacks and benign inputs ships with the code, so anyone can reproduce the numbers.
On the default configuration with no model, Wardproof flags all 89 attacks at a 0% false-positive rate. Treat that near-perfect number as a coverage and regression signal on known patterns, not a security guarantee: the corpus is small and partly self-authored, so novel attacks (other languages, fresh encodings, or pure-semantic paraphrase) can still slip past a deterministic denylist. Closing that gap is the job of the optional LLM second opinion. These patterns are the floor, not the ceiling. Full breakdown, including the one benign input the guardrails deliberately flag, is in the benchmark README.
Running in a few lines, offline.
Requires Python 3.11+. The core installs with zero third-party dependencies.
# install from PyPI pip install wardproof # optional extras pip install "wardproof[crypto]" # signed ledgers pip install "wardproof[ollama]" # local model
# screen one tool call or input from any shell # exits 0 only on ALLOW, so you can gate a step wardproof check "run_command" --args '{"cmd":"rm -rf /"}' # or run it as a local service: any agent, any # language, gates over HTTP with no process spawn wardproof serve curl -s localhost:8787/check -d '{"content":"get_weather"}' # production: optional bearer token, per-client # rate limit, and request body cap (all stdlib) wardproof serve --token $WARDPROOF_TOKEN --rate-limit 20
# worked examples, no model needed python examples/protect_x402_payments.py python examples/protect_defi_agent.py # verify an exported ledger wardproof verify-ledger ./audit.jsonl \ --pubkey <hex_public_key>
Built to be forked.
Add a domain guardrail, change thresholds, swap the model, or register your own mitigations. No need to touch the engine, the ledger, or the agent base classes.