We tested against the CIC Datasets! How did we do?

Eric Zietlow

April 3, 2026

The CSE-CIC-IDS2018 dataset is one of the most cited benchmarks in network intrusion detection research. Published as a collaboration between Canada's Communications Security Establishment and the Canadian Institute for Cybersecurity, it covers seven attack scenarios against a simulated organization of 420 client machines and 30 servers. Researchers have used it extensively to validate detection approaches, and performance numbers in published literature routinely run high. The problem is that those numbers largely reflect training performance. Models tuned to this dataset's distribution tend to collapse when evaluated on traffic from a different environment. Cross-dataset evaluations have shown detection accuracy dropping from near-perfect to near-random-chance once the training set changes.

DeepTempo ran a zero-shot evaluation against NF-CSE-CIC-IDS2018-v2, the NetFlow-native version of this dataset published by Sarhan, Layeghy, and Portmann at the University of Queensland. The model had never seen this dataset before inference. No labels were used for training. No tuning was applied to this environment. The results below reflect what LogLM and its classifier layer produce when presented with unseen traffic at scale.

The dataset is publicly available from the University of Queensland Research Data Manager.

Why public benchmarks matter for evaluating detection claims

Most vendor detection evaluations share a structural problem: the data they test against is not something you can inspect. A vendor claims 97% detection accuracy on customer traffic. The traffic is anonymized, the labels are internal, the methodology is summarized in a paragraph. There is no way to reproduce the result, dispute the methodology, or compare it against another approach run on the same data. You are being asked to trust the number.

Public benchmark datasets resolve this. NF-CSE-CIC-IDS2018-v2 has fixed, published ground truth labels. The flows, the attack scenarios, and the class composition are all documented and accessible. Any team can download the dataset, run their own model against it, and produce numbers that are directly comparable to ours. If the methodology is sound, the results hold up. If something is off, it shows.

This also means the comparison set is large. Dozens of published papers report performance on this dataset using supervised classifiers, ensemble methods, and deep learning approaches trained directly on the data. That body of work gives concrete context for what an unseen-data zero-shot result means relative to approaches that had full access to the training distribution. The benchmark doesn't just validate a claim in isolation; it places the result in a lineage of reproducible work that the detection community can interrogate.

For practitioners evaluating threat detection capabilities, this distinction is worth holding onto. A number produced on an opaque internal dataset is a marketing claim. A number produced on a public dataset with documented methodology is something you can verify yourself.

The dataset

NF-CSE-CIC-IDS2018-v2 is a NetFlow-reformatted version of the original CIC dataset, re-generated from the source PCAP files with 43 extended NetFlow features. Where the original CIC release used CICFlowMeter to extract 80+ derived statistical features, this version produces standard NetFlow records directly from the captures. That distinction matters for DeepTempo: the model ingests native NetFlow telemetry, so this format removes any artifact introduced by CICFlowMeter's feature extraction and reflects the actual signal the model would see in a production deployment. The dataset and its feature set are documented in full by its authors.

The underlying traffic was constructed on Amazon AWS by CIC and CSE, emulating a five-department enterprise with diverse operating systems and services. The seven attack scenarios include Brute Force, Heartbleed, Botnet, DoS, DDoS, Web Attacks, and Infiltration. Of the 8.4 million total flows evaluated, 7.37 million (87.9%) are benign and 1.02 million (12.1%) are labeled as attack traffic. The class imbalance is intentional and reflects realistic enterprise ratios where attack traffic is a minority of the total.

This composition matters for evaluating detection quality. A model that classifies everything as benign achieves 87.9% accuracy. The real signal lies in how cleanly it identifies the 1.02 million attack flows while leaving the 7.37 million benign flows undisturbed.

What was evaluated

DeepTempo ingested the raw flow data and ran inference using the pre-trained LogLM foundation model and binary classifier. The evaluation covered all seven attack scenarios simultaneously, with no per-scenario tuning and no access to the dataset's label structure during inference. This is the same deployment configuration used against live production traffic: the model receives flows, constructs behavioral timelines between endpoint pairs, and the classifier assigns intent to each timeline.

The MITRE ATT&CK tactics represented across the seven scenarios span Credential Access (Brute Force, Heartbleed), Impact (DoS, DDoS), Command and Control (Botnet), Initial Access (Web Attacks), and Lateral Movement (Infiltration).

Results

Metric	Result
Total flows evaluated	8.4M
Overall accuracy	99.26%
Attack detection rate (recall)	94.5%
Precision (attack)	99.4%
F1-score (attack)	96.9%
False positive rate	0.085%
False positives (absolute)	6,265 out of 7.37M benign flows
True positives	963K out of 1.02M attack flows

Confusion matrix

‍

Confusion matrix

                    Predicted benign    Predicted attack
Actual benign       7,370,000 (99.9%)   6,265 (0.1%)
Actual attack       55,787 (5.5%)       963,000 (94.5%)

The false positive figure is worth holding. Of 7.37 million benign flows, 6,265 were flagged as malicious. That is 0.085% of benign traffic, or roughly one false alert per 1,177 legitimate flows. In a production environment receiving tens of millions of flows per day, this translates to a manageable alert volume rather than the noise that causes analyst fatigue and missed detections in legacy systems.

The 5.5% false negative rate, representing 55,787 missed attack flows, reflects the challenge of the dataset itself. CSE-CIC-IDS2018 contains documented labeling inconsistencies, with audits estimating up to 7.5% mislabeled flows. Some portion of what is recorded as a missed detection is likely mislabeled traffic in the ground truth. The model has no access to these labels during inference and makes its determinations from behavioral timeline structure alone.

Why zero-shot matters

Most published results on CSE-CIC-IDS2018 are within-dataset: train on a partition, test on another partition from the same source. Research has documented that models achieving near-perfect accuracy in this configuration often perform near-random-chance when tested against traffic from a different environment. This is the cross-dataset generalization problem. It is also the central operational problem with deployed IDS: the environment where the model is trained is rarely the environment where the attack happens.

Zero-shot evaluation removes that distinction. The model receives no information about this dataset's structure, label distribution, or attack composition before inference. The 96.9% F1 and 0.085% false positive rate reflect what the model produces on first contact with a previously unseen environment. That is the same condition that applies to every new customer deployment and to every attacker who operates in an environment the model has never been trained against.

This capability comes from the foundation model architecture. LogLM learns representations of behavioral timelines at scale across diverse environments. The classifier layer then interprets those representations to assign intent. Because the foundation model learns structural patterns that are intrinsic to how attacks are conducted, rather than statistical artifacts of a particular dataset's feature distribution, those patterns hold across environments the model was never exposed to during training. As we cover in from packets to patterns, this is the practical difference between a model that has memorized a dataset and a model that understands what attack activity structurally looks like.

What the behavioral timeline structure shows

Each attack scenario in CSE-CIC-IDS2018 has a distinctive behavioral timeline structure. A Brute Force timeline shows a particular pattern of how flows between endpoints are organized, including the density, direction, and distribution of activity, that differs from how legitimate authentication traffic is structured. A Botnet C2 timeline has a different structural signature. An Infiltration timeline has another. These structures exist independent of whether any individual flow is unusual. An attacker conducting Brute Force using normal protocols and staying within rate limits still produces a timeline that the classifier recognizes as Credential Access activity.

This is the distinction from threshold-based approaches. Anomaly detection fails when attackers operate within the baseline, because it measures deviation from normal rather than recognizing the structure of attack activity. DeepTempo's classifiers are not asking whether something looks different from expected traffic. They are asking what the behavioral timeline structure is attempting to accomplish. Attackers can make individual flows appear normal. They cannot make the structural signature of a Brute Force campaign look like routine service-to-service communication while the campaign is active.

What this evaluation covers and does not cover

This evaluation addresses binary detection accuracy: benign or malicious. MITRE tactic classification was not evaluated here. The seven attack scenarios are diverse in character, which means the results reflect the model's ability to generalize across Credential Access, Impact, Command and Control, Initial Access, and Lateral Movement in a single zero-shot pass. Detection performance across individual attack categories was not broken out in this evaluation.

The dataset also represents a simulated enterprise environment, not a live production network. Production environments contain traffic patterns that no benchmark fully captures. The 2026 threat landscape increasingly includes AI-assisted attacks designed to evade models trained on historical datasets, which is a separate evaluation question from what this benchmark measures.

What the results do establish is that a foundation model trained without access to this dataset's labels or structure can match the detection performance of supervised models trained directly on it, while maintaining a false positive rate that is operationally viable at scale.

Get in touch to run a 30 day risk-free assessment in your environment. DeepTempo will analyze your existing data to identify threats that are active. Catch threats that your existing NDRs and SIEMs might be missing!

‍