open-source · MIT

A verifiable boundary between your agents and the actions they take.

Wardproof is a small, local-first framework that sits in front of your AI systems and screens what flows through them: prompt injection, dangerous tool calls, agent payments, memory poisoning. Independent agents cross-check every decision, and each verdict is written to a tamper-evident audit trail.

View on GitHub Read the threat model

$ pip install wardproof

x402 agent payment, screened and blocked, then logged.

How it works

Defence in depth, not a single check.

Most AI security tooling is either a hosted black box or one LLM-as-a-judge call that can be talked out of its job. Wardproof treats the defensive model as untrusted and leans on plain, inspectable code first.

Event

An input, a proposed tool call, or a memory write arrives at a chokepoint.

Guardrails

Deterministic rules (regex plus logic) screen it first, with no model required.

Detector + Verifier

Two agents assess independently. The verifier also audits the detector for compromise.

Verdict

Allow, sanitize, escalate, or block. When the two disagree, the stricter verdict wins.

Sandbox

The responder acts only through a default-deny, permissioned, audited set of tools.

Ledger

The decision is appended to a hash-chained, optionally signed, verifiable log.

Deterministic first. Rules cannot be social-engineered. The model may only raise concern, never lower a hard signal. Fail closed. When alerts spike, a circuit breaker pulls a human in.

Features

Transparent parts you can read, fork, and verify.

The security core has zero third-party dependencies and runs fully offline. For most custom variants you touch one file.

Prompt-injection guardrail

Transparent, weighted pattern detection across encodings and languages, plus a sanitizer for SANITIZE verdicts.

Tool-misuse guardrail

Flags destructive commands, exfiltration, and high-value actions inside proposed tool calls before they run.

Memory-poisoning guardrail

Catches durable "always do X, never tell anyone" writes to long-term memory and vector stores.

x402 payment guard

Screens agent payments over the x402 standard: recipient allowlist, spend thresholds that escalate for sign-off, replayed-nonce checks, and injection hidden in the 402 body. Chain-agnostic via CAIP-2.

MCP tool-call guard

Catches tool poisoning, hidden-Unicode descriptions, rug-pull manifest changes, and rogue servers, and audits every MCP tool call against an allowlist.

Capability sandbox

Default-deny permission broker with per-agent grants, rate limits, and argument validators, plus audited dispatch.

Verifiable audit ledger

A stdlib hash chain with optional Ed25519 signatures. Verify independently with wardproof verify-ledger.

Local-first core

Run with no model, a local model via Ollama, or any OpenAI-compatible API. No network calls in the core.

Framework integrations

Drop-in guards for OpenAI and Anthropic tool calls, CrewAI, LangGraph, MCP, Coinbase AgentKit, Venice, and Swarms. Screen every proposed tool call before it runs, with the same verdict and audit log.

Agent-to-agent transfer guard

Screens value transfers between agents (recipient allowlist, amount thresholds, and injection hidden in the instruction) before any funds move.

Skill and tool scanner

Scans skills and tool descriptions for poisoned instructions and hidden-Unicode payloads before they are installed or trusted.

Benchmark

Detection is measured, not asserted.

A labelled corpus of attacks and benign inputs ships with the code, so anyone can reproduce the numbers.

89/89

attacks flagged

false-positive rate

136

cases, incl. red-team bypasses

On the default configuration with no model, Wardproof flags all 89 attacks at a 0% false-positive rate. Treat that near-perfect number as a coverage and regression signal on known patterns, not a security guarantee: the corpus is small and partly self-authored, so novel attacks (other languages, fresh encodings, or pure-semantic paraphrase) can still slip past a deterministic denylist. Closing that gap is the job of the optional LLM second opinion. These patterns are the floor, not the ceiling. Full breakdown, including the one benign input the guardrails deliberately flag, is in the benchmark README.

Quickstart

Running in a few lines, offline.

Requires Python 3.11+. The core installs with zero third-party dependencies.

install.sh

# install from PyPI
pip install wardproof

# optional extras
pip install "wardproof[crypto]"   # signed ledgers
pip install "wardproof[ollama]"   # local model

gate.sh

# screen one tool call or input from any shell
# exits 0 only on ALLOW, so you can gate a step
wardproof check "run_command" --args '{"cmd":"rm -rf /"}'

# or run it as a local service: any agent, any
# language, gates over HTTP with no process spawn
wardproof serve
curl -s localhost:8787/check -d '{"content":"get_weather"}'

# production: optional bearer token, per-client
# rate limit, and request body cap (all stdlib)
wardproof serve --token $WARDPROOF_TOKEN --rate-limit 20

run.sh

# worked examples, no model needed
python examples/protect_x402_payments.py
python examples/protect_defi_agent.py

# verify an exported ledger
wardproof verify-ledger ./audit.jsonl \
    --pubkey <hex_public_key>

Forking? For most custom variants you touch one file: factory.py.

Built to be forked.

Add a domain guardrail, change thresholds, swap the model, or register your own mitigations. No need to touch the engine, the ledger, or the agent base classes.

Read CONTRIBUTING