📚 Agent Stack Datasets

13 small, hand-curated datasets for testing the boring parts of AI agents — token counting, prompt injection, RAG eval, tool validation, hallucination, PII detection, and more. All MIT licensed.

Test fixtures + benchmarks

jailbreak-corpus-mini

15 curated jailbreak fixtures across attack families.

15 rows · MIT · open

prompt-injection-patterns-extended

30 prompt-injection patterns across 10 categories.

30 rows · MIT · open

pii-detection-fixtures

25 strings labeled for PII / secrets with span offsets.

25 rows · MIT · open

tool-arg-validation-cases

20 (tool, schema, args) tuples — valid + invalid.

20 rows · MIT · open

mcp-tool-test-fixtures

22 MCP tool-call args across 8 categories.

22 rows · MIT · open

llm-output-extraction-cases

20 messy LLM outputs with expected JSON.

20 rows · MIT · open

hallucination-risk-cases

20 prompt → response pairs rated for hallucination risk.

20 rows · MIT · open

rag-quality-benchmarks-mini

15 RAG eval queries with ground-truth answers.

15 rows · MIT · open

Reference data

model-pricing-table

20 LLM models — input/output cost per 1k tokens, context window.

20 rows · MIT · open

token-counting-edge-cases

20 short strings with token counts across 3 tokenizers.

20 rows · MIT · open

mcp-config-examples

15 MCP client configs (Claude Desktop, Cursor, Cline, Windsurf, Zed).

15 rows · MIT · open

Observability + traces

agent-trace-samples

10 agentsnap-format tool-call traces (good + regressed pairs).

10 rows · MIT · open

agent-budget-violations

15 agent runs with budget caps + actual usage + root cause.

15 rows · MIT · open