Evaluating AI-based detection systems

Every security vendor now claims "AI-powered" detection. Some use deep learning models trained on millions of examples. Others apply simple statistical rules and call them AI. Distinguishing genuine machine learning capabilities from marketing requires a structured evaluation framework. Security leaders evaluating AI detection systems need specific questions to ask, tests to run, and red flags to recognize. This guide provides that framework, vendor-neutral criteria that help practitioners cut through marketing claims and assess whether AI detection will deliver operational value in their environment.

Questions to ask vendors

The initial conversation with vendors should establish what their "AI" actually is and how it works. Specific technical questions reveal substance behind marketing claims.

Model architecture and approach

Ask what type of machine learning the system uses. Deep learning neural networks? Random forests? Gradient boosted trees? Statistical anomaly detection? Rule-based systems with AI-sounding names? The architecture matters because different approaches have different strengths, weaknesses, and operational characteristics. Vendors should explain their choice of architecture and why it fits security detection.

Ask whether the system uses true machine learning or rule-based heuristics. Many tools marketed as AI are actually just rule-based systems with sophisticated-sounding names. Request technical documentation showing model architecture. Genuine ML systems have training procedures, learned parameters, and inference mechanisms. Rule-based systems have if-then logic regardless of what they are called or if it is an AI applying them.

Ask about foundational approaches versus single models or multi model. Systems using multiple models provide robustness against evasion and generally achieve better performancebut they are brittle. Foundation models generalize well and are adaptable. Vendors should explain how many models they use, how those models differ, and how results are combined.

Ask whether models are supervised, unsupervised, or semi-supervised learning. Supervised models learn from labeled examples (known attacks and benign traffic). Unsupervised models identify anomalies without labels. Semi-supervised combines both. Each approach has different data requirements, performance characteristics, and deployment considerations. Vendors should articulate their learning approach and justify why it fits security detection.

Training data characteristics

Ask what data the models were trained on. Public datasets? Proprietary vendor data? Customer data? The training data determines what the model learned, what it can detect and how well it generalizes. Models trained solely on public attack datasets from 2017 may not recognize techniques that emerged later. Models trained on diverse data from many organizations generalize better than models trained on narrow datasets.

Ask about training data volume and diversity. How many examples? How many attack types? How many different network environments represented? Larger, more diverse training datasets generally produce more robust models. Be skeptical of vendors claiming strong performance from small training sets, deep learning particularly requires substantial data. Ask if the models are trained on benign data. Recognizing normal behavior is just as important as identifying malicious.

Ask how training data is labeled and validated. Who labels attack data as malicious versus benign? What quality control prevents mislabeling? Incorrect labels poison training and create blind spots. Vendors should describe their data validation processes and quality metrics.

Ask how often models are retrained. Attack techniques evolve, so models trained once and never updated become obsolete. Monthly or quarterly retraining is reasonable. Annual retraining is insufficient. Never retraining suggests the model will degrade over time as threats evolve. Active learning is the gold standard.

Ask whether customer data is used to improve models. Some vendors aggregate anonymized telemetry across customers to continuously improve detection. This provides better models. Other vendors train only on vendor-controlled data, providing more control but potentially slower adaptation to new threats. Understand the data sharing model before deploying.

Performance metrics and validation

Ask about false positive rates measured in realistic environments. Vendors should provide precision metrics, not just recall. Ask specifically: "What percentage of your alerts are true positives in production deployments?" Expect precision above 90% for operationally-viable systems. Lower precision creates unsustainable alert volumes.

Ask about detection latency. How long from attack occurrence to alert generation? Real-time detection (seconds) enables rapid response. Batch detection (hours) is acceptable for threat hunting but problematic for active incident response. Latency depends on system architecture and deployment model.

Ask what attacks the system does not detect well. Every detection system has blind spots. Vendors claiming comprehensive coverage without acknowledged gaps are either lying or ignorant of their own limitations. Honest vendors explain detection limitations: "We excel at network-based attack detection but have limited visibility into memory-only attacks" or "Our models detect known attack families well but may miss entirely novel techniques."

Ask for independent validation results. Has the system been tested by third parties? MITRE ATT&CK evaluations? Academic research? Customer case studies? Independent validation provides more credibility than vendor-produced performance claims. Rigorous evaluation like the one conducted with a major bank provides concrete evidence of performance in realistic conditions.

Explainability and transparency

Ask how the system explains its detections. Does it provide human-readable reasoning for why it alerted? Model explainability is essential for analyst trust and investigation efficiency. Systems that only provide confidence scores without explanation force analysts to blindly trust or ignore alerts.

Ask what information accompanies each alert. Source/destination context? Timeline of related events? MITRE ATT&CK mapping? Supporting evidence from multiple data sources? Rich alert context enables efficient investigation. Minimal context ("anomaly detected, confidence 0.87") provides insufficient information for action.

Ask about model transparency. Will the vendor explain how their models work in technical detail? Some vendors consider model architecture proprietary. Others provide technical documentation. Full transparency is ideal but not always available. Minimum requirement is understanding what features the model considers and how it makes decisions conceptually.

Ask whether you can audit model behavior. Can you test the system with known benign and malicious inputs to verify it behaves as expected? Auditable systems enable ongoing validation. Black-box systems require blind trust, which is inappropriate for security-critical functions.

Operational requirements

Ask about deployment complexity. How long does typical deployment take? What infrastructure requirements exist? What expertise is needed? Vendors should provide realistic timelines (weeks to months for enterprise deployments) and honest skill assessments. Many claims of "deploy in minutes" often hide significant operational complexity. There are exceptions to this in the case of truly generalizing foundation models since they are by their very nature an almost out of box solution.

Ask about ongoing maintenance requirements. How much effort does tuning, updating, and monitoring the system require? Some AI detection systems are low-maintenance after initial deployment due to active learning. Others require continuous analyst attention to tune baselines and suppress false positives. Understand the operational burden before committing.

Ask about skill requirements for your team. What training is needed to operate the system effectively? Can existing SOC analysts handle it, or do you need data scientists? Most organizations need systems operable by security analysts, not just ML experts. Specialized ML expertise for basic operation is usually a non-starter.

Ask about vendor lock-in. Can you export data and models if you switch vendors? Are APIs documented? Is integration with other tools straightforward? Lock-in risks future flexibility and negotiating position. Prefer systems with open architectures and documented interfaces.

Benchmark datasets and testing

Requesting vendor performance on standard datasets helps compare different systems objectively. Several public datasets exist for security detection benchmarking.

Relevant datasets

CICIDS2017 and CIC-IDS2018 are widely-used network intrusion datasets from the Canadian Institute for Cybersecurity. They contain labeled network traffic representing various attack types: DoS, DDoS, port scans, brute force, web attacks, infiltration, and botnet activity. Many academic papers and commercial tools report performance on these datasets, enabling comparison.

NSL-KDD is a cleaned version of the older KDD Cup '99 dataset. It is less realistic than CICIDS but still used for benchmarking. Many tools can achieve high accuracy on NSL-KDD because its attacks are older and more easily distinguishable. Strong NSL-KDD performance is necessary but not sufficient evidence of real-world effectiveness.

The UNSW-NB15 dataset contains network traffic and labeled attacks from the University of New South Wales. It includes modern attack types and is considered more realistic than older datasets. Performance on UNSW-NB15 provides better indication of real-world capability than NSL-KDD.

MITRE ATT&CK evaluations provide scenario-based testing against emulated adversary behavior. These tests evaluate detection coverage across multiple attack stages and techniques. Published results enable comparing vendors that participated in evaluations. However, not all vendors participate, and participation does not guarantee operational effectiveness in your environment.

Vendor-specific datasets might be proprietary but should still be documented. If vendors test on their own datasets, they should describe dataset characteristics: how many attacks, what types, what time period, what networks represented. Vendors refusing to discuss testing datasets are hiding something.

What to request from vendors

Ask for performance metrics on at least one public dataset. Vendors should provide precision, recall, and F1 scores on datasets you can independently verify. If they refuse, question why they will not demonstrate performance on standard benchmarks.

Ask for confusion matrices showing true positives, false positives, true negatives, and false negatives for each major attack category. Aggregate metrics hide category-specific weaknesses. A system might have 95% overall accuracy but 20% accuracy on lateral movement, you need category-level detail.

Ask how their performance compares to published baselines. Academic papers establish baseline performance for datasets. Vendors claiming better-than-baseline performance should explain how they achieve it. Vendors performing worse than published baselines should explain why their approach is still valuable.

Ask to test the system on your own data. The ultimate validation is performance in your environment on your traffic. Request proof-of-concept that includes testing on representative samples of your network data, not just vendor-selected test sets.

Warning signs in test results

Perfect or near-perfect scores (99%+ across all metrics) suggest overfitting to test data or testing on unrealistic data. Real-world performance rarely achieves such scores. Be skeptical of claims that seem too good to be true. That isn't to say they aren't possible but they are unlikely with most approaches used today.

Inconsistent metrics across datasets suggest the system is tuned specifically to individual datasets rather than learning general principles. A system that scores 98% on one dataset but 70% on another has not learned robust detection patterns.

Metrics reported only on aggregate data without category breakdowns might hide poor performance on specific attack types. Demand category-level metrics to identify strengths and weaknesses.

Refusal to test on your data or to provide any quantitative metrics is the biggest red flag. Vendors confident in their systems welcome testing. Those avoiding quantitative evaluation are hiding poor performance.

Red flags in AI security marketing

Marketing hyperbole is endemic in cybersecurity. Certain claims indicate vendors are selling snake oil rather than genuine capabilities.

"100% detection rate" is physically impossible for any real-world system. Attacks evolve faster than training data updates. Novel attack techniques emerge constantly. Achieving perfect detection requires knowing all possible attacks in advance, which is logically impossible. This claim indicates either deception or fundamental misunderstanding of detection limitations. Even in systems trained on normal there will be small numbers of things being missed. There is no way around this fact without flagging 100% of traffic.

"Zero false positives" is similarly impossible in production environments. Real networks are highly variable. Legitimate behavior occasionally resembles attacks. Achieving zero false positives requires either extremely conservative detection (missing many attacks) or testing only on pristine data that does not represent real environments. Operational systems generate some false positives. The question is whether the rate is manageable.

Vague "AI-powered" or "machine learning-driven" without technical specifics usually indicates the system uses simple statistical methods or rule-based heuristics with AI-adjacent terminology. Genuine ML systems can articulate specific architectures, training approaches, and technical details. Vagueness suggests there is no real ML to discuss.

"Learns your environment" without explaining how learning works might mean the system just establishes statistical baselines (simple anomaly detection) rather than actual machine learning. Learning implies model updates based on observed data. Some vendors market baseline establishment as "learning" when it is just calculating means and standard deviations.

"Explainable AI" as a bullet point without demonstration of actual explanations is marketing without substance. Request examples of actual explanations the system provides. If the vendor cannot show clear, actionable explanations, the "explainable AI" claim is marketing fiction.

"Next-generation" or "revolutionary" AI detection usually means nothing specific. These are marketing adjectives without technical content. Focus on concrete capabilities, not superlatives.

"No tuning required" contradicts operational reality. All detection systems require some environmental adaptation to be at their best. They need to learn organizational baselines, integrate with existing infrastructure, and adjust to local false positive patterns. Systems truly requiring no tuning either generate excessive false positives (no tuning = no accuracy) or are not actually learning anything (static rules). This might change as AI evolves but the reality on the ground is that these systems don't generalize well enough yet to accomplish a true no tune functionality.

Claims about "understanding attacker intent" or "reasoning like a human" attribute capabilities AI systems do not possess. Current AI can recognize patterns and make predictions. It does not have human-like understanding or reasoning. This anthropomorphic marketing obscures what systems actually do. That said attack intent is something more advanced systems can see as it is a test of actions in sequence that exposes it.

Model transparency and explainability requirements

Understanding how detection systems make decisions is not optional for security operations. Transparency enables validation, troubleshooting, and analyst trust.

Minimum explainability for operational use includes: what triggered the alert (which events, patterns, or anomalies), what context supports the detection (related activities, historical patterns), what confidence level the system has (and what that confidence means), and how the alert maps to attack frameworks like MITRE ATT&CK. Without these, analysts cannot effectively investigate alerts.

Feature importance for each detection helps analysts understand what model considered relevant. If a lateral movement alert triggered because of unusual authentication patterns, timing characteristics, and destination system risk, analysts can verify each factor. Feature importance also helps identify when models focus on spurious correlations rather than genuine attack indicators.

Model decision boundaries should be conceptually explainable even if not precisely mathematically. Analysts need to understand roughly what makes activity cross from "probably benign" to "probably malicious" in the model's assessment. This understanding enables reasoning about edge cases and explaining detections to management or auditors.

Counterfactual explanations help analysts understand what would have made the detection different. "If this connection occurred during business hours instead of 3 AM, it would not have triggered" provides actionable insight. Systems providing counterfactuals enable analysts to assess whether detection logic makes sense.

The transparency required varies by use case. Automated blocking based on AI decisions requires high transparency because errors have immediate consequences. Threat hunting leads from AI models can tolerate less transparency because human analysts validate before action. Align transparency requirements with how you will use the system.

Some vendors resist transparency claiming proprietary models. This creates tension: they want you to trust their system, but they will not explain how it works. For security-critical applications, insufficient transparency should be disqualifying. You would not deploy security controls you cannot understand or validate.

Performance validation through proof-of-concept

The only reliable way to evaluate AI detection is testing it in your environment on your data. Well-designed POCs reveal whether vendor claims translate to operational reality.

POC design principles

Test duration should be weeks, not days. AI systems that learn organizational behavior need time to establish baselines and demonstrate adaptation. One-week POCs are too short to observe false positive patterns or detection effectiveness across attack types. Four to eight weeks provides meaningful data.

Test on representative traffic, not cherry-picked samples. Include production network segments, not just lab environments. Include normal operational variability: maintenance windows, new application deployments, business peak periods. AI systems that work in pristine lab conditions often fail in messy production reality.

Include red team testing during POC. Simulate attacks the vendor claims to detect. Test both known attack types (to validate baseline capability) and custom attacks (to test generalization). Systems should catch simulated attacks without prior knowledge of simulation schedule or techniques.

Measure false positive rates carefully. Count every alert that does not represent actual malicious activity. Categorize false positives by cause (legitimate administrative activity, new application behavior, data quality issue). This categorization reveals whether false positives are tunable or systemic.

Collect analyst feedback throughout POC. Are explanations clear and actionable? Is investigation time reasonable? Do alerts provide value versus generating busywork? Analyst experience predicts whether the system will be operationally sustainable long-term.

Success criteria

Define quantitative success criteria before POC begins. Example criteria: detection of 90%+ of red team attacks, false positive rate below 5%, mean time to investigate alerts under 15 minutes, analyst satisfaction rating above 7/10. Measurable criteria prevent moving goalposts or subjective assessment.

Require detection of novel attacks, not just known threats. The AI's value proposition is catching new attack variants. If it only catches known attacks that signature-based tools also catch, the AI provides minimal incremental value. Test specifically for generalization to attacks not in training data.

Assess operational fit, not just technical performance. Does deployment integrate with existing workflows? Do required skills match team capabilities? Does ongoing maintenance fit operational capacity? Technically excellent systems that are operationally impractical deliver no value.

Compare to existing detection capabilities. Does AI detection catch attacks current tools miss? Does it generate fewer false positives? Does it reduce analyst workload? The comparison justifies investment. If AI detection provides no advantage over existing tools, why deploy it?

Independent testing considerations

Third-party testing provides validation free from vendor or buyer bias. Consider engaging independent security testing labs or academic research groups to evaluate AI detection systems.

Independent testing should use attack scenarios relevant to your industry and threat model. Generic testing reveals general capability but may miss industry-specific requirements. Banking sector attacks differ from healthcare sector attacks. Ensure testing covers threats you actually face.

Independent testers should have access to ground truth, knowing when attacks occurred and when they did not. This enables accurate calculation of true positives, false positives, true negatives, and false negatives. Without ground truth, testing cannot measure accuracy reliably.

Published independent testing results from other organizations can supplement your POC. MITRE evaluations, academic papers, and detailed customer case studies provide data points. No single test is definitive, but multiple independent validations build confidence.

Evaluation criteria and scoring framework

A structured scoring framework helps compare vendors objectively. Weight criteria based on your organizational priorities.

Evaluation Criterion	Weight	Scoring Guidance	Maximum Points
Detection Performance	30%		30
Novel attack detection (red team)		90%+ detection: 10pts, 70-90%: 7pts, 50-70%: 4pts, <50%: 0pts	10
Known attack detection		95%+ detection: 10pts, 85-95%: 7pts, 75-85%: 4pts, <75%: 0pts	10
False positive rate		<1%: 10pts, 1-5%: 7pts, 5-10%: 4pts, >10%: 0pts	10
Explainability	20%		20
Alert explanations clarity		Clear reasoning with evidence: 7pts, Some explanation: 4pts, Minimal: 0pts	7
Feature importance transparency		Shows what factors triggered: 7pts, Partial visibility: 4pts, Opaque: 0pts	7
MITRE ATT&CK mapping		Specific technique mapping: 6pts, Tactic-level: 3pts, None: 0pts	6
Operational Fit	20%		20
Deployment complexity		Straightforward deployment: 7pts, Moderate effort: 4pts, Complex: 0pts	7
Skill requirements		Security analyst operable: 7pts, Specialized training: 4pts, Expert only: 0pts	7
Ongoing maintenance		Minimal tuning needed: 6pts, Moderate: 3pts, Intensive: 0pts	6
Technical Foundation	15%		15
Architecture appropriateness		Deep learning for security: 5pts, Adequate ML: 3pts, Rules only: 0pts	5
Training data quality		Large diverse dataset: 5pts, Moderate: 3pts, Limited: 0pts	5
Model updating frequency		Monthly/quarterly: 5pts, Semi-annual: 3pts, Annual/never: 0pts	5
Vendor Characteristics	15%		15
Transparency and honesty		Clear about limitations: 5pts, Some honesty: 3pts, Overpromises: 0pts	5
Independent validation		Multiple independent tests: 5pts, Some validation: 3pts, Vendor-only: 0pts	5
Integration capability		Open APIs, documented: 5pts, Limited integration: 3pts, Locked: 0pts	5
Total	100%		100

Scoring guidance:

85-100 points: Strong candidate, likely to deliver value
70-84 points: Viable option with some reservations
55-69 points: Significant concerns, proceed cautiously
Below 55 points: Not recommended

Adjust weights based on organizational priorities. Detection performance and explainability are universally important. Organizations with limited staff might weight operational fit higher. Organizations facing novel threats might weight technical foundation higher.

Multiple evaluators should score independently, then compare. This reduces individual bias. Discuss score discrepancies to reach consensus or identify areas requiring additional vendor clarification.

Document scoring rationale. Future reviews will need to understand why decisions were made. Documentation also helps if vendors contest scores or if you need to justify selection to management.

Making the decision

After evaluation, selection requires balancing technical performance, operational fit, and business considerations.

Technical performance is necessary but not sufficient. A system that detects attacks well but also flags 50% of traffic as false positives daily is operationally unusable. A system requiring data science PhDs to operate is impractical for most SOCs. Balance detection capability with operational reality.

Consider total cost of ownership beyond licensing. Include deployment services, training, ongoing tuning, infrastructure requirements, and maintenance. Some vendors have low license costs but high operational overhead. Others have higher license costs but lower operational burden. Compare TCO, not just purchase price.

Assess vendor viability and support. AI detection is complex. You need responsive vendor support for troubleshooting, updates, and guidance. Evaluate vendor financial stability, customer base, support responsiveness, and roadmap. Avoid vendors who might not exist in three years.

Reference checks reveal operational reality. Speak with current customers in similar industries with similar size. Ask specifically about false positive rates, operational burden, vendor support quality, and whether they would deploy the system again knowing what they now know. Honest reference feedback is invaluable.

Start small if possible. Deploy to one network segment or one use case before enterprise-wide rollout. This limits risk while demonstrating value. Some vendors resist pilot deployments, preferring enterprise sales. Pilots that prove value justify larger deployments; vendors confident in their products should support pilots.

Plan for change management. Introducing AI detection changes SOC workflows. Allocate time for training, process development, and adaptation. Budget for these soft costs, they often exceed technology costs for successful deployments.

The AI detection landscape

The market for AI-based security detection is maturing but still has significant variability in capability and marketing honesty. Some vendors provide genuine deep learning capabilities trained on massive datasets. Others rebrand statistical anomaly detection or rule-based systems as "AI." Practitioners must distinguish substance from marketing.

The industry has recognized that deep learning approaches provide advantages for pattern recognition in security telemetry. But not all deep learning implementations are equally effective. Training data quality, model architecture choices, and operational implementation determine whether theoretical advantages translate to practical detection capability.

The evaluation framework in this guide helps practitioners cut through marketing and assess actual capabilities. The questions reveal what systems really do versus what marketing claims. The testing validates whether vendor demonstrations translate to production performance. The scoring framework enables objective comparison across vendors.

The goal is not finding perfect AI detection, that does not exist, but finding systems that provide meaningful improvement over current capabilities at acceptable operational cost. Most organizations need AI detection that catches sophisticated attacks signature-based tools miss while generating manageable false positive volumes and fitting into existing SOC workflows. These criteria are achievable with careful vendor evaluation.

The investment in thorough evaluation prevents costly mistakes. Deploying AI detection that underperforms, generates excessive false positives, or proves operationally unmanageable wastes budget and damages team morale. Rigorous evaluation upfront identifies vendors likely to deliver value and avoids those that will not. The evaluation time is an investment in successful deployment.

Security leaders should demand evidence, not promises. Test claims through POCs. Validate performance with independent data. Check references thoroughly. Be skeptical of marketing superlatives. Focus on concrete capabilities and operational fit. This disciplined approach separates AI detection systems that deliver value from those that merely claim to.

‍

Table of contents

Sample H2