top of page
Search

Building Sentinel: Teaching an AI Agent to Validate Vulnerabilities So Our Engineers Don't Have To

  • Writer: Trevor Baines
    Trevor Baines
  • Feb 19
  • 10 min read

Part Two of a Two-Part Series on Our AI Security Operations Platform


In Part One of this series, we introduced Mastermind, the AI knowledge assistant that gives our entire company perfect recall across every engagement, every document, and every conversation. Mastermind solves the knowledge problem. But there's another problem that eats just as much of our team's time, and it lives on the other side of the workflow.


Every penetration test produces findings. A vulnerability scanner flags dozens, sometimes hundreds, of potential issues across a client's application or infrastructure. Some of those findings are critical. Some are informational noise. And some are false positives that look alarming in a report but don't actually represent exploitable risk.


The problem is that telling the difference requires manual validation. An engineer has to take each finding, understand the context, attempt to reproduce or exploit it, observe the application's behavior, and make a judgment call. For a typical web application assessment with 50 to 100 findings, this validation process can consume hours of skilled analyst time, much of it spent on repetitive tasks that follow predictable patterns.


We asked ourselves a simple question: what if an AI agent could handle the predictable part?


The Validation Gap in Security Testing

Scanner tools are excellent at casting a wide net. They test for known vulnerability patterns, check for misconfigurations, and flag anything that matches their detection signatures. But scanners are inherently conservative. They would rather flag something that turns out to be benign than miss something real.


This is the right design philosophy for a detection tool. You want your scanner to be paranoid. The issue is what happens next.


Every flagged finding enters a queue that a human analyst has to work through. For each one, the process looks roughly the same: read the scanner's output, understand what it claims to have found, navigate to the relevant part of the application, attempt to trigger the vulnerability, observe the response, and determine whether the issue is real, partially real, or a false positive.


Validation Step

Typical Time

What the Analyst Is Doing

Read scanner output

1–2 min

Parse the finding type, affected URL, parameter, and evidence

Navigate to target

1–3 min

Open the application, authenticate if needed, find the relevant page or endpoint

Craft and inject payload

2–5 min

Build an appropriate test payload, inject it into the right parameter, submit the request

Observe and analyze response

1–3 min

Check the DOM, HTTP response, console output, and network traffic for confirmation

Document the result

1–2 min

Record whether the finding is confirmed, note any caveats or context


That's 6 to 15 minutes per finding, performed by a skilled security engineer. Multiply by 50 or 100 findings per engagement, and you're looking at an entire day (or more) of an analyst's time spent on validation alone. Time that could be spent on the creative, judgment-intensive work that actually requires human expertise: analyzing complex attack chains, testing business logic, advising clients on remediation strategy.


We decided to automate the predictable parts of this workflow and let our engineers focus on the parts that matter.


Enter Sentinel

Sentinel is an AI agent that takes vulnerability scanner output, reads each finding, autonomously navigates to the target application in a real browser, attempts to validate the vulnerability, observes the result, and delivers a verdict: Confirmed, Not Confirmed, or Needs Manual Review.


It's not a scanner. It doesn't discover new vulnerabilities. It's a validator. It takes findings that have already been flagged and does the manual verification work that our engineers would otherwise do by hand. Think of it as a tireless junior analyst who can work through a stack of findings at 2 AM, so the senior engineers can wake up to a pre-sorted queue with the confirmed issues already at the top.


The key distinction:

Sentinel doesn't replace the scanner and it doesn't replace the engineer. It replaces the tedious, repetitive validation step that sits between them. The scanner casts the net. Sentinel sorts the catch. The engineer focuses on the fish that matter.


How It Works: The Validation Pipeline

Sentinel's architecture is built around a simple loop that mirrors what a human analyst does, but executes it autonomously and at scale.


Stage

What Happens

1. Ingest

Scanner output (XML or JSON) is parsed. Each finding is extracted with its URL, affected parameter, vulnerability type, severity, and any evidence the scanner captured.

2. Reason

A code-specialized language model reads the finding, understands the vulnerability class, and generates a browser automation script tailored to validate that specific issue.

3. Execute

The generated script runs in a headless browser instance. Sentinel navigates to the target URL, interacts with the application (injecting payloads, submitting forms, following redirects), and captures the response.

4. Analyze

The language model examines the browser's response: the DOM state, HTTP headers, console output, and any observable side effects. It determines whether the vulnerability was successfully triggered.

5. Verdict

Each finding receives a classification: Confirmed (exploit succeeded), Not Confirmed (exploit failed, likely false positive), or Needs Manual Review (ambiguous result requiring human judgment).


The entire loop takes roughly 5 to 15 seconds per finding, depending on the complexity of the vulnerability and the responsiveness of the target application. For a typical scan with 50 findings, Sentinel completes the full validation pass in under 12 minutes. Compare that to the half day or more it would take an analyst to do the same work manually.



5–15 sec

Per finding validation time

< 12 min

Full scan validation (50 findings)

4–8 hrs saved

Per engagement, analyst time reclaimed


The Brain: A Code-Specialized Language Model

Sentinel's intelligence comes from a language model specifically chosen for code generation and security reasoning. This isn't a general-purpose chatbot being asked to do security work. It's a model that was trained extensively on code, understands web protocols and browser APIs, and can reason about how web applications behave when they receive unexpected input.


The model runs locally on dedicated GPU hardware, just like the intelligence layer in Mastermind. No scanner output, no target application data, and no validation results ever leave our network. This is especially important for Sentinel because the data it processes is inherently sensitive: it's working with live vulnerability details, target URLs, and exploit payloads.


When Sentinel receives a finding, the model doesn't just blindly replay what the scanner did. It reasons about the vulnerability class and generates a validation approach appropriate for the specific context. For a reflected cross-site scripting finding, it crafts a payload, injects it through the browser, and checks whether the script actually executes in the DOM. For an open redirect, it follows the redirect chain and verifies the destination. For a server-side injection, it constructs a proof-of-concept request and analyzes the response for indicators of successful exploitation.


This reasoning capability is what separates Sentinel from a simple replay tool. It adapts its approach based on what it observes, and it can handle variations in application behavior that would break a rigid automation script.


The Browser: Real Interaction, Not Simulation

One of the most important design decisions in Sentinel is that it validates vulnerabilities using a real browser, not just HTTP requests. Many web application vulnerabilities, particularly client-side issues like cross-site scripting, only manifest when the application's JavaScript executes in a browser context. Sending raw HTTP requests and inspecting the response body misses the actual exploitation path.


Sentinel drives a headless browser instance that renders pages, executes JavaScript, processes redirects, handles cookies and session state, and interacts with the DOM exactly like a human would. When it injects an XSS payload, it doesn't just check if the payload appears in the response HTML. It checks whether the script actually executes, whether it can access the document's cookies, whether the application's content security policy blocks it.

This browser-based approach means Sentinel can validate vulnerability classes that are notoriously difficult to confirm with network-level tools alone:


✅  Strong Validation Coverage

⚠️  Requires Human Judgment

Reflected and stored cross-site scripting (XSS)

Complex multi-step authentication vulnerabilities

Open redirects and URL manipulation

Race conditions and timing-dependent flaws

Form-based injection vulnerabilities

Deep business logic vulnerabilities

Client-side logic bugs and DOM manipulation

Vulnerabilities requiring extensive application context

Cross-site request forgery (CSRF)

Chained exploits spanning multiple systems

Authentication flow weaknesses

Subtle information disclosure patterns

Missing security headers and misconfigurations

Social engineering or phishing attack paths


This coverage breakdown is intentional. We designed Sentinel to handle the high-volume, pattern-based findings that consume the most analyst time, while explicitly flagging complex or ambiguous findings for human review. The goal isn't to eliminate human involvement. It's to focus human expertise where it adds the most value.


Privacy by Design: Everything Runs Locally

Just like Mastermind, Sentinel runs entirely on our own infrastructure. The language model, the browser automation engine, the validation orchestrator, and the reporting pipeline all operate on hardware we own. No vulnerability data, no target application details, and no exploit payloads are ever sent to an external service.


This matters for two reasons. First, the data Sentinel processes is among the most sensitive information in our entire operation. It's working with confirmed vulnerability details, proof-of-concept exploits, and detailed information about client application behavior. Second, Sentinel actively interacts with target applications. The browser sessions, the payloads it generates, and the responses it observes all need to stay within our controlled environment.


Running locally also gives us complete control over the validation environment. We can configure network isolation, set up proxy chains for traffic inspection, and ensure that Sentinel's browser sessions are properly sandboxed. This level of operational control would be impossible with a cloud-based service.


What Changes for Our Team

The most visible impact is speed. What used to take half a day now takes minutes. But the more significant change is in how our engineers spend their time.

Before Sentinel, a typical web application assessment workflow looked like this: run the scanner, wait for results, then spend several hours manually working through each finding to separate the real issues from the noise. The actual analysis, the creative part where the engineer thinks about attack chains, tests business logic, and develops meaningful recommendations, got compressed into whatever time was left.


With Sentinel, the workflow shifts. The scanner runs. Sentinel validates. By the time the engineer sits down to analyze the results, they already have a pre-sorted queue: confirmed vulnerabilities at the top, likely false positives filtered to the bottom, and a small set of ambiguous findings flagged for manual investigation. The engineer can skip straight to the work that requires human judgment and creativity.


Before Sentinel

After Sentinel

Run scanner (automated)

Run scanner (automated)

Manually validate each finding (4–8 hours)

Sentinel validates findings (< 15 minutes)

Sort and prioritize confirmed findings

Review pre-sorted, pre-validated results

Analyze complex vulns (remaining time)

Analyze complex vulns (most of the day)

Write report under time pressure

Write report with full context and depth


The result is better reports, faster turnaround, and engineers who get to spend their time doing the work they were hired for instead of burning hours on mechanical validation tasks.


Sentinel and Mastermind: Better Together

Sentinel and Mastermind are designed to complement each other as parts of a unified platform.


When Sentinel confirms a vulnerability, that finding flows into the knowledge base that Mastermind manages. The next time an engineer asks Mastermind about a client's security posture, the response includes not just what previous reports documented, but what Sentinel validated on the latest scan. When a sales lead asks about the current state of a client relationship, Mastermind can reference the fact that Sentinel just confirmed three critical findings on their latest assessment, and that remediation guidance was delivered last week.


We're also building integration between Sentinel and complementary network-level assessment tools that handle infrastructure testing: port scanning, service enumeration, and CVE exploitation against running services. Sentinel's browser-based approach is powerful for web application vulnerabilities, but some vulnerability classes live at the network and protocol layer, outside the browser's reach. By combining both approaches, we get comprehensive coverage that neither could achieve alone.


Vulnerability Domain

Browser-Based (Sentinel)

Network-Level Tools

Web application (XSS, CSRF, injections)

✅ Primary strength

➖ Limited coverage

Authentication and session handling

✅ Full browser context

➖ Cannot observe client-side state

Infrastructure CVEs (service-level)

➖ Outside browser scope

✅ Primary strength

Port and service exposure

➖ Not applicable

✅ Direct protocol testing

API endpoint vulnerabilities

✅ Full HTTP interaction

✅ Request-level testing


The vision is a platform where our scanners flag potential issues, Sentinel and its network-level counterparts validate them autonomously, and Mastermind makes the results queryable and actionable for every department in the company. Each component amplifies the value of the others.


Where We're Headed

The current version of Sentinel processes scanner output in batch: export the findings, feed them to Sentinel, get results. The next step is real-time integration. We're building a pipeline where findings flow from our scanning tools directly into Sentinel's validation queue as they're discovered, with results streaming back to our project management platform in near real time.


We're also working on expanding Sentinel's reasoning capabilities for more complex validation scenarios. Today, it excels at single-step vulnerabilities: inject a payload, observe the result. The next generation will handle multi-step validation chains: authenticate as a user, navigate to a specific workflow, trigger a vulnerability that only manifests under certain application state conditions, and verify the impact.


Longer term, we see Sentinel evolving from a validator into an active tester. Rather than waiting for a scanner to flag findings, it will proactively explore application surfaces, identify potential weaknesses based on patterns it has learned from our historical engagement data (powered by Mastermind's knowledge layer), and generate its own test cases. This is the direction the industry is moving, and we're building the infrastructure to get there.


The Bigger Picture

Sentinel is part of a broader shift in how we think about security consulting. The firms that treat AI as a bolt-on feature, a chatbot wrapper on top of the same manual workflows, will find themselves outpaced by firms that rethink the workflow itself.


We're not adding AI to our existing process. We're rebuilding the process around what AI makes possible. Mastermind eliminates the knowledge friction. Sentinel eliminates the validation friction. Together, they let our team operate at a level of speed, depth, and consistency that manual workflows simply cannot match.


The human expertise doesn't go away. It gets amplified. Our engineers spend less time on mechanical tasks and more time on the creative, judgment-intensive work that clients actually pay for. That's not a marginal improvement. It's a fundamental shift in how much value a security consultancy can deliver per engagement.


Sentinel is currently in active development alongside Mastermind, with both systems sharing the same local infrastructure. We'll continue to share updates as the platform matures and our team begins integrating it into live engagements.


This is part two of a two-part series on our AI security operations platform. Part one covers Mastermind, our AI-powered knowledge and coordination assistant for the entire company.


 
 
 

Comments


bottom of page