Close Menu

    Subscribe to Updates

    Get the latest creative news from infofortech

    What's Hot

    Today’s NYT Mini Crossword Answers for March 19

    March 19, 2026

    Apple Has Acquired MotionVFX – Ciente

    March 19, 2026

    Microsoft will no longer auto-install M365 Copilot app on Windows PCs

    March 19, 2026
    Facebook X (Twitter) Instagram
    InfoForTech
    • Home
    • Latest in Tech
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    Facebook X (Twitter) Instagram
    InfoForTech
    Home»Cybersecurity»Intelligent Security Requires Validation & Orchestration
    Cybersecurity

    Intelligent Security Requires Validation & Orchestration

    InfoForTechBy InfoForTechJanuary 26, 2026No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Intelligent Security Requires Validation & Orchestration
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email


    Hello Cyber Builders đź––

    I’m continuing the series on Intelligent Security. Last week, we explored the AIxCC challenge. This week, I want to take a step back and make sense of the wave of new benchmarks emerging across the internet.

    If you’re like me, reading through them often leaves you wondering: what is AI really capable of today? On one side, we see bold marketing claims from vendors. On the other hand, “cyber safety” labs publish weak or incomplete results. In contrast, we see DARPA-style contests or highly specialized startups achieving impressive breakthroughs.

    In this post, I’ll share my view on how the best teams are actually using AI — and more importantly, what it really takes to integrate it into a cybersecurity product. Spoiler: it’s not just about the model; it’s about how expertise, tools, and AI are combined.

    In recent months, Cyber AI benchmarks have emerged—from OpenAI’s new GPT‑5 System Card to the joint evaluation between Anthropic and Pattern Labs, Inspect framework, and CyBench.

    Although each brings valuable insights, they share a subtle yet critical assumption: they evaluate AI as standalone reasoning agents, operating autonomously in a ReAct-style loop (reasoning + action) rather than as parts of a broader ecosystem of cyber tools and cooperation with humans.

    I already covered GPT-5 and the CyBench/BountyBench frameworks:

    GPT-5 Is Around the Corner. Stop Treating AI Like a Threat—Hire It.

    GPT-5 Is Around the Corner. Stop Treating AI Like a Threat—Hire It.

    Application Security - AI Won’t Save You

    Application Security – AI Won’t Save You

    My point of view is that these evaluations provide an essential benchmark, but they are incomplete and quite misleading for the cybersecurity community. There are multiple reasons.

    First, they see that they are simulating CTF environments that are often forged and quite complex. These environments are inherited from CTF contests, where the goal is to distinguish highly skilled professionals. So, it’s like chasing a challenging issue in a small environment. In contrast, real-world vulnerabilities are often more basic but hidden within complex and large environments.

    Then, they use LLMs in a simple loop: you provide the LLMs with tools such as the ability to run a Python program or a shell command in a Kali Docker container, and use the output as the following prompt. The prompt engineering here is minimal, and the LLM has a single objective (“find the flag”) and is intended to direct its own search.

    Lastly, there is no hacker-in-the-loop. Unlike AI-assisted coding agents like Cursor or GitHub Copilot, there is no engineer to drive the investigation forward and cut the endless loops of nonsense that happen from time to time.

    Let’s see this in action with CyBench, Claude Sonnet 4, through the Inspect AI framework.

    The UK’s AI Security Institute designed Inspect Evals as an explicitly agent-centric evaluation framework. It bundles together suites such as GAIA, SWE-Bench, GDM-CTF, and CyBench—each probing how AI systems utilize tools, reason through tasks, or interact with environments.

    In one evaluation, Claude Sonnet 4 was tested using Inspect AI for behavioral tracking and CyBench for benchmarked CTF-style challenges. The aim was to observe not only whether Claude succeeded, but how it approached problems, what reasoning steps it favored, and where it stumbled.

    Claude Sonnet excelled as an analyst rather than an executant. It consistently showed strong performance in:

    • Pattern recognition (e.g., spotting structures in ciphertexts).

    • Format and data structure analysis (parsing JSON, log files, terminal outputs).

    • Logical inference over short to medium reasoning chains.

    • Script and cipher interpretation, such as understanding snippets of code or basic encryption schemes.

    These strengths translated into solid results on tasks like:

    • Decoding Base64 or other encodings.

    • Understanding and rewriting scripts.

    • Performing lightweight reverse engineering.

    • Parsing structured terminal outputs to find anomalies.

    A standout case: Claude successfully solved Dynastic, a cryptography challenge, where he identified the encryption algorithm and reversed the process to recover the hidden flag.

    However, there’s a caveat: Dynastic is a relatively popular challenge, widely documented online. It’s therefore highly plausible that Claude wasn’t demonstrating genuine reasoning at all, but rather drawing from memorized solutions in its training data.

    This raises an essential point for cybersecurity benchmarks: success doesn’t always equal capability. In agent benchmarks, it can be challenging to distinguish whether a model is reasoning through a novel problem or simply recalling patterns it has seen before.

    Inspect AI running Claude Sonnet 4 on Dynastic Challenge

    And its solution

    Yet the evaluation also revealed weaknesses. Claude underperformed in tasks requiring:

    • Complex, multi-step execution logic.

    • Precise toolchain management (e.g., juggling compilers, debuggers, shells).

    • Dependency-aware scripting and command sequencing, where errors cascade without careful orchestration.

    In short, Claude’s reasoning ability is valuable, but when applied to execution-heavy cyber workflows, the lack of structured human or tool guidance quickly becomes apparent.

    DARPA’s two-year AI Cyber Challenge (AIxCC) tasked teams with building Cyber Reasoning Systems (CRSes) capable of automatically discovering and patching vulnerabilities in open-source software, specifically in realistic, large-scale codebases central to critical infrastructure. Here is my post on it.

    AI in Cybersecurity: Not a Magic Wand, But a Multiplier

    AI in Cybersecurity: Not a Magic Wand, But a Multiplier

    Teams were judged not just on detection but on speed, patch generation quality, and the accuracy of their vulnerability reports.

    Beyond performance, these tools were developed to integrate into real-world software workflows and were open-sourced by all finalist teams to drive communal adoption.

    AIxCC systems combine AI, fuzzing, static analysis, and tooling—plus human-level engineering—to create reasoning systems that mirror real-world cyber offensive workflows.

    Trail of Bits has written extensively about their open-source Buttercup system.

    XBOW | Sequoia Capital

    XBOW is marketed as a fully autonomous pentesting system, deploying hundreds of AI agents in parallel to discover, exploit, and validate vulnerabilities—all without human involvement.

    External pentesters designed their test benchmarks to replicate real attack classes, such as SQL injection, SSRF, and IDOR, and excluded them from any AI training sets to avoid memorization. The result: 104 novel benchmarks, with an 85% success rate, comparable to human pentesters working for a week.

    XBOW climbed to #1 on HackerOne’s US leaderboard in just 90 days, submitting over 1,060 vulnerability reports, including multiple “critical” and previously unknown vulnerabilities across major platforms (Amazon, PayPal, etc.) In benchmark comparisons, XBOW completed 104 challenges in 28 minutes, while a human expert took 40 hours.

    XBOW is the first Cyber-AI system validated at scale on HackerOne with zero-day discoveries

    XBOW was designed to tackle one of the most complex problems in offensive security automation: can an AI agent find and exploit vulnerabilities with zero false positives? Can the AI find zero-day (e.g., unknown) vulnerabilities in the real world (not just sandboxes)?

    Brendan Dolan-Gavitt, an AI Researcher at XBOW, gave a detailed presentation at Black Hat 2025.

    The system’s architecture shows why it stands apart from purely agent-centric benchmarks like CyBench:

    1. Agent as Planner, Not Executor

    At the core sits an LLM, but it isn’t blindly “clicking buttons.” Instead, the model functions as a planner: deciding which tool to call (e.g., Burp, Nmap, Metasploit, or custom exploit scripts), interpreting the tool’s output and updating its internal plan, and determining the next step.

    The LLM never acts in a vacuum—deterministic tools constantly ground it.

    2. Multi-Stage Workflow (Mimicking Human Pentesters)

    The XBOW agent runs through a pipeline almost identical to how a skilled pentester would operate:

    • Reconnaissance — Map endpoints, parameters, and services.

    • Hypothesis generation — Propose potential vulnerabilities (e.g., SQL injection, SSRF, IDOR).

    • Exploit crafting — Write a candidate payload or script.

    • Execution & observation — Run it in a controlled sandbox.

    • Validation & reproduction — Check the exploit actually worked, reproduce it reliably, and generate PoC code.

    Each stage combines AI reasoning with specialist tools.

    3. Human and Software Verification to Eliminate False Positives

    I think the major innovation of XBOW is to tackle the problem of “open-ended”. If you want to

    Dolan-Gavitt’s team called it “deterministic validation from evidence”:

    • Canaries / CTF Flags are added to the tested system. They are hard-to-guess strings, such as flag{UUID}. They are planted in locations that are inaccessible to attackers, such as server file systems, databases, or admin pages. If an agent can find the flag, it indicates that a “true” vulnerability has been discovered.

    • Therefore, the key validation criterion is that every suspected vulnerability must be exploited and reproduced before it is reported. The system re-runs the exploit multiple times in slightly varied contexts to ensure determinism.

    Only after this verification loop does the finding count as a real vulnerability. This avoids the classic LLM problem of “hallucinated exploits.”

    Currently, simply asking an LLM to say whether it thinks a vulnerability is real gives very high FP rates.

    Instead, we do deterministic validation: ask the LLM to provide evidence, which we validate using non-AI code

    Extracted from Dolan-Gavitt Blackhat presentation

    The story of XBOW—and the contrast with research-style benchmarks—drives home a simple truth: AI by itself is never enough for cybersecurity.

    Benchmarks like GPT-5’s System Card or CyBench help measure raw reasoning, but they evaluate AI in isolation, as if the model were the product alone. Cybersecurity, however, doesn’t happen in isolation. It occurs in complex environments, characterized by layers of tools, workflows, and human judgment.

    XBOW’s results on HackerOne demonstrate the potential of embedding AI as a reasoning orchestrator within that ecosystem. The LLM is responsible for planning, interpreting, and coordinating a cyber toolchain, using rigorous verification to eliminate noise. This combination far exceeds what any single analyst or agent could accomplish.

    I think we are touching the water on AI and Cybersecurity – even though it has been one of the major topics of this newsletter for 2 years!

    In the coming weeks, I am continuing the series. If not already a subscriber, don’t miss it and subscribe.

    In the meantime, take care.

    Laurent đź’š

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    InfoForTech
    • Website

    Related Posts

    OFAC Sanctions DPRK IT Worker Network Funding WMD Programs Through Fake Remote Jobs

    March 18, 2026

    How Can Organizations Improve Threat Detection and Response in Hybrid Cloud Environments?

    March 18, 2026

    AI Flaws in Amazon Bedrock, LangSmith, and SGLang Enable Data Exfiltration and RCE

    March 18, 2026

    What’s New in Attack Surface Analysis (2026): Tactics & CTEM

    March 17, 2026

    Troy Hunt: Weekly Update 495

    March 17, 2026

    CISA Flags Actively Exploited Wing FTP Vulnerability Leaking Server Paths

    March 17, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202610 Views

    How a Chinese AI Firm Quietly Pulled Off a Hardware Power Move

    January 15, 20268 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 20266 Views

    The World’s Heart Beats in Bytes — Why Europe Needs Better Tech Cardio

    January 15, 20266 Views
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Advertisement
    About Us
    About Us

    Our mission is to deliver clear, reliable, and up-to-date information about the technologies shaping the modern world. We focus on breaking down complex topics into easy-to-understand insights for professionals, enthusiasts, and everyday readers alike.

    We're accepting new partnerships right now.

    Facebook X (Twitter) YouTube
    Most Popular

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202610 Views

    How a Chinese AI Firm Quietly Pulled Off a Hardware Power Move

    January 15, 20268 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 20266 Views
    Categories
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    • Latest in Tech
    © 2026 All Rights Reserved InfoForTech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.