How AI Found a Critical Windows Flaw Hidden in Plain Sight for Six Years
How AI Found a Critical Windows Flaw Hidden in Plain Sight for Six Years
An inside look at Microsoft’s Multi‑Model Agentic Scanning Harness (MDASH) and the vulnerability it uncovered in the Windows networking stackBy Aaron Rose · Tech Reader Magazine · June 19, 2026
Morning Coffee
Ethan had rules about Tuesdays.
Nothing before ten. No Slack before coffee. And never open the security dashboard before the second cup.
He’d been on the Windows Networking Platform team for eleven years. He’d lived through the IPv4-to-dual‑stack transition, written memory management code he was proud of and some he preferred not to remember, and developed the calm that comes from surviving too many P0 incidents that turned out to be nothing. In his experience, the panic‑to‑signal ratio in enterprise security hovered around forty to one.
So when his laptop chimed at 8:47 on a Tuesday in March with a notification reading MDASH — Priority Finding — tcpip.sys — Authentication Boundary, he poured the second cup first.
He almost didn’t open it.
What Was Waiting
When he finally did, the finding wasn’t a theory.
MDASH — Microsoft’s Multi‑Model Agentic Scanning Harness, integrated into his team’s CI/CD pipeline eight weeks earlier — hadn’t flagged a suspicious pattern or a code smell. It had traced a complete exploit path through 847 lines of networking code, identified the precise sequence of concurrent operations required for a remote, unauthenticated attacker to execute arbitrary code at kernel privilege, and assembled a proof‑of‑exploitability before Ethan’s team finished standup.
The vulnerability was in tcpip.sys. It had been there since 2019.
Six years. Through four major Windows releases. Through multiple security audits, penetration tests, and red‑team exercises. Through three security leads, two of whom had reviewed that exact section of code.
It had waited.
How You Miss Something for Six Years
To understand what MDASH found, you have to understand why nobody else did.
The bug wasn’t obvious. It wasn’t a buffer overflow or a hardcoded credential. It wasn’t the kind of thing that shows up in first‑pass static analysis. It was a race condition living in the interaction between two subsystems — the TCP state machine and the authentication token validator — that only became exploitable under a specific sequence of concurrent operations no human reviewer would naturally trace end‑to‑end.
Brooke, the team’s security architect, later called it a layered assumption failure. Each subsystem was correct in isolation. The TCP state machine handled transitions properly. The authentication validator behaved exactly as specified. The vulnerability lived in the gap — in the assumption that a particular token validation would always complete before a particular state transition.
Reasonable when written. True in almost all real‑world conditions. But breakable by an attacker who knew exactly where to push.
“Finding it requires holding both subsystems in your head at once,” Brooke said. “Understanding their timing dependencies. Then reasoning about what happens when those dependencies break. No human reviewer does that across a 200,000‑line codebase. It’s not intelligence. It’s scale.”
This is the class of problem MDASH was built to address — not because it is smarter than Brooke, but because it is different in kind.
The Harness
MDASH is not a model. Microsoft is deliberate about this distinction.
It is an orchestration harness — a structured pipeline of more than a hundred specialized agents running in sequence across an ensemble of AI models. The models are components. The architecture is the product.
The pipeline runs in six stages.
Prepare
Before scanning, MDASH builds context: call graphs, data‑flow paths, memory allocation patterns, privilege boundaries — a working map of how the system behaves, not how documentation claims it behaves. This scaffolding guides every downstream agent.
Scan
A fleet of specialized agents fans out across the codebase. Each has a narrow mandate: use‑after‑free patterns, integer overflows, user‑controlled input paths, authentication boundary crossings, timing‑assumption violations. They work in parallel, without fatigue, without cognitive shortcuts, without the “I’ve seen this before” bias that helps humans move fast but blinds them to the unfamiliar.
Validate
Static analysis tools often surface vulnerabilities in code paths that can’t actually be reached. MDASH filters those out early. The validation agents ask one question: Is this reachable in real conditions? Most findings die here.
Debate
This is where MDASH diverges from every prior automated security system.
A separate set of adversarial agents challenges each surviving finding. They argue that the vulnerability is not exploitable, that the context makes it safe, that the required conditions can’t be assembled. The original finding must survive that pressure. If it can’t, it’s dropped.
This stage is why MDASH’s false‑positive rate is low enough to be operationally useful. The system argues with itself before it argues with you.
Dedupe
Multiple agents often find the same root cause from different angles. This stage collapses duplicates and surfaces one clean finding with the strongest evidence.
Prove
The final stage constructs the exploit path — the specific sequence of operations, the timing window, the mechanism — and demonstrates exploitability. What lands in an engineer’s queue is not a hypothesis. It is a case.
Ethan’s Tuesday
When Ethan opened the MDASH finding, he saw a complete exploit path through tcpip.sys: the concurrent operation sequence required to trigger the race condition, the authentication boundary violation, the resulting privilege escalation, and a demonstrated exploit path showing remote code execution at kernel level with no authentication.
Severity: Critical. CVSS: 9.8.
He read it twice. Then he called Brooke.
“Is this real,” he asked.
She was quiet for a moment. “Let me look.”
It took her forty minutes to trace the path — through the TCP state machine, across the authentication boundary, through the timing dependency, into the exploit. She could do it only because MDASH had already mapped the route.
At the end she sat back and said something Ethan chose not to repeat to his manager.
It was real. It had been real since 2019.
The Fix
Ethan’s team had the patch written by Thursday.
This matters because people often assume that once a vulnerability is found, fixing it is trivial. For complex bugs, both are hard. Understanding what’s broken well enough to fix it without introducing new problems requires the same deep subsystem knowledge that makes the bug hard to find.
What MDASH changed wasn’t the difficulty of the fix. It was the starting point.
The finding gave the team the exploit path, the root cause, the timing dependency, the subsystem interaction. They didn’t spend days reproducing the issue or debating whether it was real. They started from certainty.
By Monday, the patch cleared internal review. By the end of the month, it shipped in Patch Tuesday as a Critical Remote Code Execution vulnerability in tcpip.sys — one of four Critical RCEs in that cohort surfaced by MDASH and patched before public disclosure.
Multi‑year vulnerabilities are common in large codebases. What changed was the speed and certainty of discovery.
What This Changes
The obvious question is: how do we find more bugs faster?
The product answer is simple: deploy MDASH in more pipelines.
The deeper question is structural: what happens when continuous autonomous inspection becomes infrastructure — when the harness runs on every commit, against every change, and the coverage problem that allowed six‑year‑old vulnerabilities becomes a solved problem not through more humans but through a system that never stops reviewing?
David, who leads security architecture for a large enterprise division that adopted MDASH in early 2026, described the shift:
“We used to think of security review as a gate. You write code, you pass through the gate, you go to production. Gates have capacity limits. They create backlogs. Engineers learn to work around them. MDASH turns the gate into an environment. The code is always being read.”
The implications are still emerging. What’s visible is the trajectory: MDASH scored 88.45% on the CyberGym benchmark at launch, then jumped to 96.55% in under three weeks — not because the models improved, but because the harness did. Better debate logic. Tighter deduplication. More specialized agents.
The system learns at the pipeline level, not just the model level.
This is Microsoft’s architectural argument: the harness does the work. The model is a component. Upgrade the model and performance improves incrementally. Improve the harness and performance improves structurally.
The Counterargument
Not everyone agrees this is the long‑term answer.
Researchers at XBOW argue that rigid harnesses may eventually constrain newer models. As frontier models become capable of tracing complex vulnerability chains on their own, a fixed pipeline built for older limitations might work against the grain of the tools it runs. The harness, in this view, is scaffolding built for a generation of models that needed it — and scaffolding can outlive its usefulness.
It’s a fair concern. The counterargument is that MDASH’s harness is not fixed — it improved substantially in the first weeks after launch — and that the debate‑and‑prove architecture provides adversarial discipline that raw model capability doesn’t automatically replicate. A more capable model that isn’t structured to challenge its own findings will still produce false positives.
The harness isn’t scaffolding around a weak model. It’s quality control applied to whatever model runs inside it.
The debate will continue. The CVEs, meanwhile, keep getting patched.
The Vulnerability That Waited
Ethan kept a copy of the MDASH finding on his desktop for a few weeks after the patch shipped. Not as a trophy — he was too experienced for that — but as a reference point.
He had reviewed that section of tcpip.sys himself three years earlier during a security initiative prompted by a different vulnerability. He had read the code carefully. He had found nothing.
“I wasn’t looking for the right thing,” he said. “Because I didn’t know the right thing existed. You can’t solve that by being more careful. The search space is too large. You need a different kind of coverage.”
The patch shipped. The CVE was disclosed. The networking stack that hundreds of millions of Windows machines depend on is safer than it was the week before Ethan’s Tuesday.
The vulnerability waited six years.
MDASH surfaced it in a single pipeline run — built on weeks of integration, years of engineering, and a harness designed to see what humans can’t hold in their heads at once.
Tech Reader Magazine
TechReaderMagazine.com