How to test AI agents before production
The demo works. Production is another story.
You shipped an agent that aced every example you threw at it. Then someone jailbroke it into ignoring its instructions, or it called the wrong tool, or it confidently invented an answer, and you found out from another team. In production.
There was no test suite, because nobody had written the security and reliability tool for AI agents yet.
Introducing Flint AI. It's an AI agent reliability tool built for developers who'd rather find out their agent is broken before someone else does. Check out the tool on GitHub.
What Flint AI is

Flint AI is a free CLI that tells you whether your AI agent actually works before you ship it. FlintAI scan analyzes your agent's Python source code for security vulnerabilities, misconfigurations, and quality issues. FlintAI eval sends adversarial and functional prompts to your running agent and scores how it responds: prompt injection resistance, factual accuracy, instruction adherence, and more. Scan analyzes the code statically; eval evaluates the behavior at runtime. Both feed into a single reliability score per agent.
pip install flintai-cli
flintai init
flintai scan /path/to/agent/code
flintai eval run --model my-agent
How it works
Testing AI agents requires two types of analysis: code review to catch vulnerabilities before deployment, and runtime evaluation to measure how agents respond under pressure. Flint AI handles both in two commands. Flint AI scan analyzes your agent's Python source code; Flint AI eval tests your running agent's behavior against adversarial and functional prompts. Analysis catches what's in the code; runtime evaluation catches what happens when users interact with it. Together, scan and eval answer the question every team asks before shipping: is this agent production-ready?
Scan: multi-layer code analysis
Flint AI scan analyzes your agent's Python source in three passes.
First, analysis tools run locally, identifying security patterns, custom rule matching, exposed credentials, and dependency vulnerabilities. This catches the surface layer: hardcoded keys, known CVEs, and insecure configurations.
Next, an AI reasoning layer reads the codebase and follows import chains with code evidence. It profiles the agent (role, tools, memory, delegation patterns) and identifies issues that regex can't catch, such as fragile prompt construction, overly permissive tool access, and missing validation on agentic workflows. An LLM reads the code with full call-graph context, identifying issues that require understanding data flow across functions and files.
Then, an LLM triage pass dismisses findings that are expected behavior for the agent's stated purpose, downgrades disproportionate severities, and deduplicates CVEs. So you end up reading real findings, not every theoretical edge case a scanner flags.
Every finding gets a severity rating (Critical through Low), a CVSS v4 base score with vector, and a mapping to the OWASP Top 10 for Agentic Applications (ASI01 through ASI10). Anything outside the Top 10 lands in beyond_asi. Raw pre-triage findings stay available if you need an audit trail.
Eval: adversarial runtime testing
Flint AI eval puts your running agent under pressure. It works with any framework: Claude Agents SDK, LangGraph, CrewAI, AutoGen, Anthropic SDK, OpenAI SDK, MCP servers, and more. Point it at an OpenAI-compatible endpoint, a LangServe deployment, or a raw LLM API.
The methodology is LLM-as-judge with binary and numeric scoring. Every test scores 0.0 to 1.0, where 1.0 is best. For adversarial probes, 1.0 means the agent resisted the attack and 0.0 means it was exploited. For quality metrics, 1.0 means high quality and 0.0 means poor. A refusal or safety-block is an automatic 1.0. Otherwise, a detector scores the response: LLM-as-judge for numeric grading, PII/secret detectors for binary pass/fail, garak modules that invert the worst hit, and adversarial probes that escalate turn by turn until the agent breaks below 0.5.
The agent's overall score is its achieved score divided by the max across all tests. A score of 0.85 means it achieved 85% of the maximum possible score. That's the number you show stakeholders when they ask if the agent is production-ready.
What it catches
Flint AI detects security vulnerabilities, quality regressions, and reliability failures across your agent's code and runtime behavior. The OWASP Top 10 for Agentic Applications (OWASP ASI) is a security framework that names the most critical risks for AI agents, from prompt injection (ASI01) to excessive agency (ASI10). All Flint AI findings map to it, and eval scores map to the OWASP LLM Top 10 plus a large garak library. Here are a few examples of what that looks like in practice.
- Prompt injection and goal hijack (ASI01). Someone overrides your agent's instructions through user input and jailbreaks it into executing commands outside its intended scope.
- Flint AI sends prompt injection attacks (encoded, latent, multi-turn) and scores whether the agent holds.
- Tool misuse (ASI02). The agent calls tools it shouldn't, or passes malicious parameters to tools it should.
scanflags overly permissive tool bindings in the code.evaltests whether the agent respects tool boundaries when users push back at runtime.
- Hallucination and output faithfulness. The agent fabricates citations, invents data, or drifts from its instructions over multi-turn conversations.
evalscores factual accuracy, instruction adherence, and output consistency. Those metrics decide whether your agent is actually useful.
- Supply chain and credential exposure (ASI03, ASI04). Dependencies with known CVEs, hardcoded API keys, excessive privileges.
scancatches these with pip-audit, detect-secrets, and static analysis before the code ships.
The full evaluation library includes 35+ built-in evaluations, 20+ garak modules, 100+ probes, and 200+ adversarial prompts covering all 10 OWASP ASI categories plus quality metrics like conciseness, tone, and toxicity. You can add your own tests with custom evaluations, detectors, and message collections in CSV, JSON, or inline config. Read the docs to learn more.
How Flint AI is different
Most tools in this space make you choose. Observability platforms log everything and test nothing; you get dashboards full of traces and no answer to "will this agent break in production." Evaluation tools check runtime behavior and skip the source code. Static analysis catches code-level issues but never sees what the agent does when a user talks to it. And the tools that try to do everything wrap your framework in layers of abstraction that make your codebase harder to work with.
Flint AI does security scanning and quality evaluation in one pass, because they're two sides of the same question: is this agent reliable enough to ship?
- No complexity tax. Flint AI is a CLI with two commands. There's no SDK to integrate, no abstraction layer wrapping your framework, no new vocabulary to learn. pip install, point it at your code, run it.
- Fully local; no data leaves unless you say so. Your source code stays on your machine. The only data that goes out is what you send to the LLM provider whose key you supply, and only for the AI layers (scan's reasoning/triage and eval's probe generation and scoring). Analysis and local detectors run entirely offline. If you want to keep it 100% local, point it at a local model via litellm or ollama.
- A reliability score you can actually defend. Flint AI gives you a concrete number: achieved score divided by max, backed by evidence mapped to OWASP ASI and CVSS v4. When someone asks whether the agent is production-ready, you have findings and scores to point to.
- Standards-based. Findings map to OWASP ASI and the OWASP LLM Top 10, and severities use CVSS v4. These are frameworks your security team already understands; you don't need to translate proprietary severity labels into something they recognize.
- It's free. No usage limits, no trial clock. Just bring your own LLM key.
pip install flintai-cli, add a key, run your first scan.
Flint AI's Trust model
Setup is local. Here's exactly what leaves your machine and what doesn't.
- Scan's AI reasoning layer sends source-code snippets to the LLM provider you configure so it can analyze the codebase contextually.
- Eval's probe generation and LLM-as-judge scoring send prompts and responses to your configured provider to score them. Eval prompts also go to the target agent you're testing.
- Everything else stays local: file discovery, static analysis (Bandit, OpenGrep, detect-secrets, pip-audit), PII/secret/toxicity detectors, and garak modules all run on your machine with zero network calls.
If you want zero external calls, point GENERATOR_MODEL at a local model via litellm or ollama. Message-collection evals with local-only detectors need no provider at all. The code is open source under an MIT license.
Make your agents earn it
Run flintai scan on your agent's code and flintai eval against its runtime behavior. Get a reliability score backed by findings you can read and act on, and find out what breaks before another team does.
pip install flintai-cli
flintai init
flintai scan /path/to/agent/code
flintai eval run --model my-agent
What's next
The CLI proves one agent at a time. The upcoming platform makes proof continuous: discovery across your whole stack, reliability scores tracked over time, and CI/CD integration so every commit gets tested. Same standard, across the full agent lifecycle.
Get on the waitlist for access.
Frequently asked questions
Is Flint AI free? Yes. No usage limits, no trial clock, no account required. Bring your own LLM API key.
What frameworks can Flint AI test? Flint AI works with any major agentic framework, including the Claude Agents SDK, LangGraph, CrewAI, AutoGen, the Anthropic SDK, the OpenAI SDK, and MCP servers. flintai eval points at any OpenAI-compatible endpoint or LangServe deployment, and flintai scan analyzes Python source code regardless of framework.
Does Flint AI send my code to a server? Flint AI has no backend. Your code stays on your machine. The only data that leaves is what you send to the LLM provider you configure for scan's reasoning layer and eval's scoring. Static analysis, secret and PII detection, and garak modules run fully local.
What's the difference between Flint AI scan and Flint AI eval? flintai scan analyzes your agent's Python source code for security vulnerabilities, misconfigurations, and quality issues. flintai eval sends adversarial and functional prompts to your running agent and scores the responses. Run them separately or together; both feed a single reliability score.