The Evolution of AI-Powered Security Consultants

In my fourteen years of security assessments with IOActive, our shared mission has always been defined by a single commitment: stay ahead. Stay ahead of the threats clients face today, and stay ahead of the techniques that will define how we find those threats tomorrow.

That responsibility has driven every meaningful evolution in how our consultants work.

When fuzzing was still a research curiosity, the consultants who built their own frameworks and integrated it into live engagements found entire vulnerability classes that manual reviews missed. When static analysis tools were primitive and noisy, the teams that wrote custom rule sets and built triage pipelines around them separated signal from noise faster than anyone relying on defaults. In each case, the competitive edge went to those consultants who built the infrastructure to use new capabilities correctly before the rest of the industry caught up.

I’ve watched that pattern repeat with every major shift in the offensive landscape. However, none of them compare to what’s happening now with the capabilities of the large language models (LLMs) that fuel AI.

What follows are the decisions shaping how security assessments are built today: from the architectural constraints that dictate where AI can and can’t be used, to the strategies that determine whether it actually finds what matters.

The Model as Both Tool and Risk

The question is already coming up in kickoff calls. On recent engagements, clients have asked directly whether our AI assessment tooling sends their information to external infrastructure. One CISO put it clearly: if it leaves the engagement perimeter, it’s a breach of our NDA.

That posture is what’s driving consultants to build local LLM infrastructures. The reason is simple: the AI tools gaining traction today require sending client artifacts, such as source code, firmware, configuration files and protocol captures, to infrastructure you don’t control.

Hold that thought for a second – there’s an asymmetry there worth mentioning: the codebase and architecture we assess have almost certainly already traveled through cloud AI infrastructure during development. Copilot suggestions, ChatGPT-assisted debugging, Cursor completions, and online forum questions that include function signatures are all common today. Developers share code with online AI constantly, and at scale. That’s the reality of how software is written and architecture designed today.

That asymmetry cuts both ways, but while developers share code to build software, security assessments produce something far more sensitive: a complete map of how that software breaks. Vulnerability findings, exploitation chains, and architectural weaknesses – these are not debugging prompts. Losing control of that output is an entirely different category of risk.

The Model Is the Least Interesting Part

The open-weight code model landscape changes on a quarterly basis (or faster). But for an effective security testing infrastructure: the model should be the most swappable component.

What matters when selecting a model for a new engagement cycle is straightforward:

A context window sufficient for the artifact sizes we work with (code, Nmap scan results, etc.)
Demonstrated performance on the analysis tasks relevant to the engagement surface
Evaluation against the specific requirements the targets demand

Public benchmarks do already exist for evaluating LLMs on vulnerability detection. Most of them use a curated set of functions with known vulnerabilities, where the model must identify the bug, locate the exact source, and explain the exploitation path, resulting in a binary pass/fail. Usually, a model that finds the bug but misidentifies the root cause scores the same as one that misses it entirely.

In addition, the largest model is rarely the best model for a specific security analysis task. A general-purpose frontier model with hundreds of billions of parameters will score well on generic coding benchmarks, but a focused model a fraction of that size, fine-tuned on vulnerability data from domains we actually assess, will outperform it on the task that matters. Domain specialization beats raw scale when the domain is narrow enough and the training data is good enough.

Models rotate on someone else’s schedule, and are shipped, improved, and deprecated by labs we don’t control. That’s fine, as long as the model is the part that’s designed to be swapped.

The Input Changes. The Scaffolding Doesn’t.

If the model is swappable, the scaffolding is the multiplier force across engagement types. What changes between an application security assessment and a firmware audit is the ingestion format, not the pipeline architecture. Source code, decompiled binaries, ICS configurations, protocol captures, Active Directory enumeration output – each is a different artifact, but they all flow through the same stages: ingest, prioritize, analyze, triage. The scaffolding adapts at the ingestion layer. Everything downstream stays the same.

The critical architectural decision is where inference begins and where deterministic processing ends. Not every stage in the pipeline benefits from a language model. Parsing an AST to extract function signatures is deterministic. Matching a known CVE pattern against a dependency manifest is a lookup. Normalizing binary sections for analysis is a format transformation. These operations are faster, cheaper, and more reliable as traditional code. Sending them through an LLM adds latency and unpredictability for zero gain.

The model earns its place at the points where human-like reasoning is required: tracing a data path across service boundaries to determine reachability, assessing whether a configuration combination that appears safe creates an exploitable condition in context, or deciding if a code pattern that resembles a known vulnerability class is actually exploitable given the surrounding control flow. Those are judgment calls. Everything else is plumbing, and plumbing should be deterministic.

The scaffolding bridges both sides. It runs the deterministic stages as code, assembles the structured context the model needs to reason effectively, provides tool access so the model can query additional information during analysis, and routes the model’s output back into the deterministic pipeline for triage and reporting. The result is a system where the LLM operates within a well-defined scope, reasoning over prepared context, rather than being asked to do everything, including the work that doesn’t require intelligence.

A recent firmware engagement made this concrete. The deterministic layer extracted and normalized the binary, identified function boundaries, and reconstructed the call graph. Standard reversing infrastructure. The model received the decompiled output with full cross-reference context and flagged a conditional path in the bootloader’s recovery mode that bypassed signature verification before loading an update image. The logic was split across multiple functions, no single function looked wrong in isolation. An analyst would have reached it eventually, but the overnight batch surfaced it before anyone had started manually reversing that code path. That’s the division of labor working as designed: deterministic tooling prepares the context, the model reasons over it, and the analyst’s first morning starts with the right target.

Teaching the Model What We Know

A base model is a generalist. It knows code. It doesn’t know how our team hunts, what we’ve found before, or what the current target looks like from the inside. Closing that gap requires two complementary strategies: baking knowledge into the model’s weights, and feeding it context at inference time.

Fine-Tuning: What Vulnerable Code Looks Like

Base instruction-tuned models are not optimized for security analysis. The gap between a capable general-purpose coder model and one that has internalized what vulnerable code actually looks like across thousands of real examples is measurable in engagement quality.

There are many approaches to closing that gap: full fine-tuning, reinforcement learning from human feedback on security-specific tasks, synthetic data generation, distillation from larger models, and others. LoRA-based fine-tuning on curated real-world CVE data offers the fastest path to measurable improvement with the lowest infrastructure overhead. The critical distinction is what we train on: not just vulnerable code paired with its fix, but the analytical context around how the vulnerability is identified and exploited: root cause, affected data flow, and exploitation conditions. That context-rich approach produces measurably better results than training on code pairs alone.

The Engagement Wiki: What We Know That Training Data Doesn’t

Fine-tuning teaches the model what vulnerable code looks like. What it doesn’t teach is how our team hunts for it, what we’ve found manually, or what the current target looks like from the inside. IOActive has spent years building and refining internal methodologies across every engagement surface we assess. That accumulated knowledge is exactly what a retrieval layer can operationalize.

A structured internal wiki that feeds directly into the LLM’s context window via retrieval-augmented generation can operate at two layers:

Methodology and technique library:

This includes attack trees, enumeration checklists, and exploitation chains organized by engagement surface. When the model analyzes a Modbus configuration, it retrieves our team’s documented approach for ICS protocol assessment, not a generic prompt. When it reviews Android IPC handlers, it pulls the methodology our mobile team has refined over a decade of engagements. The library grows with every assessment cycle.

Per-engagement context:

At the start of every assessment, we populate a target-specific context layer: technology stack, architecture constraints, scope boundaries, threat model assumptions. The model doesn’t analyze in a vacuum; it reasons with the same situational awareness a human analyst builds during the first days of an engagement.

Reporting Is Also a Technical Challenge

Finding a vulnerability is half the job. The other half is communicating it in a way that produces a response proportional to its severity, and that audience is rarely homogeneous.

A typical engagement report lands in front of at least three different readers with fundamentally different needs. The development team wants precise reproduction steps and code-level context. The CISO wants risk framing, business impact, and remediation priority. The board or executive sponsor wants to understand exposure in terms they can act on without a security background. Until recently, producing documentation that served all three required either multiple document versions, expensive in consultant time, or a dedicated graphics and communications team that most consultancies can’t staff on every engagement.

Before LLM infrastructure, the practical ceiling on report comprehensiveness was analyst time. With a local LLM pipeline producing the first draft of each finding’s documentation, that ceiling lifts. Attack path diagrams that would have required a dedicated graphics pass are generated programmatically. The report becomes a more complete representation of what the assessment actually found, not a triage-filtered subset of it.

The same infrastructure can generate audience-specific derivatives of each finding. An attack path showing how an externally controllable input reaches a deserialization sink three service hops downstream is significantly more persuasive to an engineering leadership team than a paragraph describing the same chain. The diagrams are not just decorative. They’re structural.

This matters because impact is a function of communication, not just discovery. A finding that doesn’t produce a remediation response is a finding that didn’t land. Consultants who can close that gap, and those who can make a complex technical issue comprehensible to the people who control the budget to fix it, deliver more value in the same assessment. AI doesn’t just make reports faster to produce. It makes them more thorough, more detailed, and more actionable than what was previously feasible within the engagement window.

The Commitment

Security assessments are almost always time-boxed. The consultant makes depth tradeoffs every day: which functions to reverse manually, which attack paths to pursue, which custom tooling to build or skip. With LLM-augmented analysis, those tradeoffs change. Binary analysis that previously consumed a full day of manual effort compresses into hours. Custom tooling that would have been out of scope for a two-week engagement becomes feasible to build and deploy within the assessment window. The constraint hasn’t disappeared, but the frontier of what’s reachable within it has moved significantly, enabling you to go deeper on the same clock.

The AI landscape moves faster than any fine-tuning cycle, but security consulting has always rewarded teams that built the infrastructure to use new capabilities correctly rather than waiting for a packaged solution. The teams that figured out fuzzing before it was mainstream, or built custom static analysis pipelines when off-the-shelf options didn’t exist: those teams found more vulnerabilities, found them faster, and became harder to replace.

There’s a dimension to this that goes beyond tooling. Every security consultant brings a distinct set of methodology and analytical instinct to an engagement. A local AI infrastructure should reflect that. Rather than standardizing every analyst’s workflow through a single cloud tool with fixed behavior, a sovereign setup lets each consultant’s accumulated knowledge and methodological preferences shape how the model operates, amplifying individual expertise rather than flattening it.

That’s the commitment: not to just use the latest technology, but to evaluate it seriously, understand its limits, and operationalize it when it earns its place in the workflow. Local LLMs have earned theirs.