Measuring what matters: detection engineering beyond the dashboard

Craig Stancill

November 26, 2025

‍MITRE: Defense Evasion, Discovery

Detection engineers live in a world of competing metrics - alert volume climbs while executives demand proof of value. MITRE ATT&CK heatmaps show coverage gap and mean time to detect shrinks, yet incidents still slip through. Teams often end up optimizing for metrics that make leadership comfortable rather than metrics that reveal ground truth.

The DRAPE index (Detection Reliability And Precision Efficiency) represents a recent attempt to unify detection quality measurement. The core premise is sensible: balance true positive detection against false positive noise in a single score. For detection engineers managing hundreds of rules across multiple data sources, this offers practical appeal. For example: a rule generating 50 true positives but 5,000 false positives earns a negative score while a rule catching 10 true positives with 5 false positives earns a positive score. The math attempts to mirror how practitioners already think about signal versus noise.

But single-number metrics carry hidden assumptions. They work when your detection logic can be tuned, when you control the rule parameters, when false positives stem from threshold misconfigurations or poor regex authoring. They fail when the fundamental approach cannot distinguish attack behavior from normal operations.

The measurement problem runs deeper than scoring

Detection engineers spend most of their time not building new detections, but managing existing ones. Tuning thresholds. Suppressing known false positives. Adding contextual filters. Updating regex patterns when attackers change one character. This maintenance burden exists because signature-based and rule-based detection are, by definition, not generalizable.

Consider a detection rule for lateral movement via PsExec. The rule fires on process execution patterns: psexec.exe spawning cmd.exe on remote systems. A competent red team renames the binary amd the rule breaks. Detection engineers add wildcards, process hashes, parent-child relationships, network connection patterns and each refinement exposes new edge cases. When IT uses PsExec for legitimate administration, the rule requires exclusions based on who or what initiated the process. Those exclusions become attack paths.

The DRAPE index would score this rule poorly once false positives accumulate. But the problem is not the scoring system, it’s the fact that rules cannot distinguish malicious intent from legitimate administrative action because it operates on observable artifacts, not behavioral understanding.

Traditional metrics assume the detection approach can succeed given sufficient tuning. They measure efficiency within a fundamentally limited paradigm. What if the paradigm itself is the constraint?

From metrics that measure noise to metrics that measure understanding

The LogLM does not generate alerts that need scoring. It identifies sequences of activity in flow data that represent attack progression. There are no thresholds to tune, no regex patterns to update, no exclusion lists to maintain. The model learned what normal network behavior looks like across billions of network flows from different networks. When an attacker moves laterally, the sequence of connections violates learned patterns even if the attacker uses renamed tools, living-off-the-land binaries, or entirely novel techniques.

This changes what metrics matter. Traditional detection engineering optimizes alert precision: reducing false positives for a fixed set of known attack patterns. Foundation model detection optimizes behavioral recognition: identifying attack intent regardless of specific implementation.

‍

Detection without prior knowledge: Does the model identify novel attack techniques not seen during training? Zero-shot detection measures whether the model learned principles of attack behavior rather than memorizing specific patterns. When we deployed LogLM to detect threats in telecom network traffic, it identified ransomware variants, fuzzing attempts, and DDoS patterns with >99% detection rates across attack categories it had never explicitly trained on. The model recognized attack logic: reconnaissance scanning patterns, exploitation timing, lateral movement sequences.

Intent based coverage: Rather than counting rules mapped to MITRE techniques, Intent based coverage asks: can the system detect the underlying attacker objectives regardless of specific tooling? An attacker might use PsExec, WMI, PowerShell Remoting, or custom SSH tunnels for lateral movement. Traditional detection requires separate rules for each tool. Intent detection recognizes the movement pattern itself. LogLM achieves this by learning from network flow metadata: connection timing, session establishment patterns, data transfer characteristics. The specific protocol or tool becomes less relevant than the intent signature.

Operational sustainability: How much analyst time does the detection approach consume? Traditional signature-based systems generate thousands of alerts requiring triage. Each alert represents either a true positive requiring investigation or a false positive requiring investigation and suppression logic, which leads to permanent maintenance overhead. Foundation models reduce this burden not by scoring alerts differently, but by operating at a higher level of abstraction. Instead of alerting on individual suspicious process executions, the model surfaces sequences that represent complete attack chains: initial access followed by privilege escalation followed by lateral movement. This contextual detection reduces alert volume while increasing actionable intelligence.

The explainability requirement

Detection engineers building rules can explain exactly why an alert is fired. The process name matched, a specific registry key was modified, a network connection targeted port 445. This explainability is often presented as an advantage of traditional detection over AI-based approaches.

But explainability without accuracy is false comfort. A rule that explains why it fired on a false positive is not more useful than a model that correctly identified the attack without detailed explanation. The critical question is not "can you explain the detection logic" but "do you understand why this is malicious."

LogLM provides explainability at the level that matters: which flow sequences triggered detection, what those sequences represent in network behavior, how they deviate from learned normal patterns. This is not the same as explaining why a regex matched, but it answers the more important question: what behavior indicates attack progression?

At a major US bank, explainability came through operational proof - The model successfully found previously undetected lateral movement attempts that signature-based systems missed. Using DeepTempo’s alerts, Security analysts were able to trace the identified flow sequences back to specific network sessions, validate the suspicious behavior, and confirm the detection. The model's intent understanding proved more valuable in this case than rule logic by understanding underlying network behaviors rather than trying to write the perfect regex expression.

What detection engineers actually need to measure

If you manage a detection engineering program, your current metrics probably include: number of detections deployed, MITRE ATT&CK coverage percentage, alert volume trends, mean time to acknowledge, true positive rate, false positive rate. These metrics are not wrong per se, but they can also measure why existing approaches are not sufficient. Measurement of false positives specifically highlights the fundamental gap in the rule/signature based approach: you often need near perfect knowledge to prevent a flood of FPs.

Consider instead:

Attack detection capability against novel techniques: Run red team exercises using new attack frameworks or techniques not covered by existing detections. What percentage of the attack chain is visible? Traditional signature-based detection will miss novel techniques by definition. Intent models should detect new implementations of known attack patterns. This metric reveals whether your detection strategy can handle evolution in attacker tradecraft.

Maintenance overhead per detection: Track analyst hours spent tuning, updating, and managing each detection rule or model over a defined period. Signature-based rules require continuous maintenance as attackers adapt and business processes change. Models trained on behavioral patterns require less frequent updates. This metric exposes the true operational cost of different detection approaches. A rule scoring well on DRAPE today might consume hundreds of analyst hours annually in maintenance. A behavioral model might require quarterly updates but minimal day-to-day intervention.

Time from attack to alert at scale: This differs from mean time to detect. It measures how quickly detection scales across attack progression. An attacker typically executes multiple stages: reconnaissance, initial access, privilege escalation, lateral movement, data exfiltration. Signature-based systems might detect one stage hours after execution. Behavioral systems should recognize attack progression in near real-time as the sequence unfolds. This metric captures whether your detection approach provides defenders with time to respond before damage occurs.

Detection confidence under adversarial pressure: What happens when attackers actively try to evade your detections? Red teams should explicitly attempt evasion: renaming tools, using alternative protocols, introducing delays between attack stages, operating within normal business hours, or even using known evasion techniques and frameworks. Does your detection rate drop to zero? Fall to 50%? Remain stable? This metric reveals whether your detection approach is fundamentally brittle or robust to adversarial adaptation.

Beyond scoring: measuring what matters

The DRAPE index attempts to solve a real problem: detection engineers need objective metrics to evaluate rule performance and justify program investment. The index succeeds at measuring efficiency within rule-based detection paradigms. But the paradigm itself has fundamental limitations.

Signatures catch only known threats. They require constant maintenance as attackers adapt. They generate operational overhead through false positives that must be triaged. Most critically, they cannot detect novel attack techniques that exploit behavioral patterns rather than specific artifacts.

LogLMs approach detection differently. They learn attack behavior patterns from large-scale training data. They recognize attack progression based on the actual flows in your network rather than artifact matching. They generalize to detect novel implementations of known attack patterns without requiring new rules or signatures.

This shifts which metrics reveal value. Instead of scoring individual detection rules on their true positive and false positive rates, measure whether your overall detection approach can identify attacks it has never seen before, sustain detection capability as attackers evolve their techniques, and provide defenders with actionable intelligence rather than alert noise.

For detection engineers, this means: if you find yourself spending most of your time tuning false positives, writing exclusion rules, and updating detection logic when attackers make minor changes, you are optimizing within a paradigm that cannot fundamentally solve the problem. The metrics might improve slightly quarter over quarter. The underlying detection gap persists.

Foundation models for cybersecurity do not represent incremental improvement but categorical shift: from detecting what we explicitly program to recognizing what the model learned about attack behavior. The right metrics measure whether that shift delivers operational advantage.

Practical next steps

Detection engineers reading this might think: foundation models sound promising, but I still need to score my existing detections tomorrow morning. Fair point. Here is the practical path forward:

Continue using metrics like DRAPE to optimize your current detection rules. They provide value for managing rule-based systems. But simultaneously, start measuring the fundamental limitations those rules cannot overcome. Track: any novel attacks your rules missed, analyst hours spent on false positive triage and tuning, and detection failures during red team exercises using new techniques.

These metrics reveal the ceiling of what rule-based detection can achieve. When that ceiling becomes visible and your optimization curves flatten despite continued investment, that is when it becomes apparent that a new approach is required and the metrics shift from scoring individual rules to measuring whether your detection capability keeps pace with attacker evolution.

That is the measurement that matters.

To see how LogLM measures against those limitations in your environment, contact hello@deeptempo.ai.

The detection engineering discipline matured by focusing on measurable outcomes: alert accuracy, coverage gaps, response times. Foundation models for threat detection continue that discipline while expanding what can be measured. If your current metrics show optimization within a rule-based paradigm, consider whether measuring the paradigm's fundamental limitations might reveal more about where to invest next.

‍

Table of contents

Sample H2

Sample H3