📚 Agent Stack Datasets
13 small, hand-curated datasets for testing the boring parts of AI agents — token counting, prompt injection, RAG eval, tool validation, hallucination, PII detection, and more. All MIT licensed.
Test fixtures + benchmarks
jailbreak-corpus-mini
15 curated jailbreak fixtures across attack families.
prompt-injection-patterns-extended
30 prompt-injection patterns across 10 categories.
pii-detection-fixtures
25 strings labeled for PII / secrets with span offsets.
tool-arg-validation-cases
20 (tool, schema, args) tuples — valid + invalid.
mcp-tool-test-fixtures
22 MCP tool-call args across 8 categories.
llm-output-extraction-cases
20 messy LLM outputs with expected JSON.
hallucination-risk-cases
20 prompt → response pairs rated for hallucination risk.
rag-quality-benchmarks-mini
15 RAG eval queries with ground-truth answers.
Reference data
model-pricing-table
20 LLM models — input/output cost per 1k tokens, context window.
token-counting-edge-cases
20 short strings with token counts across 3 tokenizers.
mcp-config-examples
15 MCP client configs (Claude Desktop, Cursor, Cline, Windsurf, Zed).
Observability + traces
agent-trace-samples
10 agentsnap-format tool-call traces (good + regressed pairs).
agent-budget-violations
15 agent runs with budget caps + actual usage + root cause.