DeepTempo uses a deep learning foundation model to detect cyber threats that traditional detection approaches (think rules), miss. Attackers are now leveraging AI-powered attacks, where threats can leverage your existing infrastructure in ways never before seen, and hide behind normal activity. We approach this problem by observing operational activity, reasoning about malicious intent, and spotting these new threats early in the attack chain.
Early on in our product journey, we decided to leverage flow logs to initially power LogLM, our foundation model. Our choice of NetFlow and other flow logs was not obvious to everyone, because flow data has a long history of being seen as noisy, difficult, and sometimes disappointing for security detection. This blog explains why we made a different calculation and why flow logs are a uniquely strong foundation for a modern LogLM driven approach.
The historical view of flow data
For many years, flow was viewed as hard to work with and prone to producing high false positives. Several factors contributed to this perception. One of its earliest variants, NetFlow, was originally created for network operations rather than threat detection. Early exports were sampled, incomplete, or missing context. The data was often spread across devices and difficult to consolidate. Rules-based detection approaches became brittle as networks evolved. Traditional machine learning methods such as XGBoost or random forest models also performed inconsistently unless they were frequently retrained using labeled data, something many enterprises lacked the time and resources to maintain.
A number of industry experts have summarized the concerns. They noted limitations such as missing payloads, blind spots due to sampling, difficulties in correlating flows across devices, and operational challenges with storage and normalization. Many security teams came to believe that flow data was better suited for bandwidth monitoring or forensics than primary detection.
All that said, we arrived at the point of view that flow data presented unique advantages that other log sources do not.
Nowhere to run, nowhere to hide - the flow opportunity
Nearly every significant cyber attack involves communication. Lateral movement, command and control, data staging, exfiltration, and persistence mechanisms all leave traces in the flow. Because of this, flow telemetry records a large percentage of attacker behaviors. Our estimate is that flow data can reveal nearly every attack on a cloud, hybrid or on premise environment; only those attacks that are SaaS only, or that somehow complete their work on only one host, avoid leaving footprints on a network. No other point of visibility within the modern enterprise environment (cloud, data center, etc.) has the detection coverage that flow data can offer.
In addition to seeing at least critical parts of almost every attack chain, flow telemetry is also sometimes the only information you have. For example, bring-your-own-device environments such as much of the Stanford and other college environments often have unmanaged endpoints that will never run corporate EDR agents. Telecommunications environments frequently have very limited host visibility and rely almost entirely on flow-level monitoring across vast infrastructures. In these settings, flow data is not only an optional perspective, but often the primary one.
Flow data also has another property that makes it compelling. It is extremely difficult for an attacker to avoid. An adversary can disable an endpoint agent, or run fileless techniques, but it is far more challenging to conduct meaningful operations without generating observable traffic. Even encrypted traffic produces patterns of timing, size, and communication partners that can be modeled.
A foundation model needs the right training substrate
Foundation models like our own LogLM are pretrained on very large volumes of consistent data. They benefit when the training corpus is abundant, regular, and reflects meaningful structure. Flow data satisfies these requirements better than most other telemetry sources. There is a significant amount of flow data in any enterprise. It accumulates continuously and reflects patterns of life that are ideal for self-supervised learning. It is regularized and structured because flows capture the same types of events again and again in standardized formats. While there are variations between NetFlow, IPFIX, sFlow and cloud equivalents, the differences are modest compared to the diversity found in application logs, endpoint logs, or system logs, which vary dramatically by vendor, application, and environment.
Self-supervised training on flow data allows the foundation model to learn normal behavioral patterns for every entity and for every conversation between entities. This creates a flexible and generalizable representation space. It also means that once trained, the model requires only modest fine tuning, if any adaptation at all, to adapt to a new environment. DeepTempo’s experience has supported this claim. In multiple pilot deployments, our LogLM produced high-quality, actionable anomaly detection from network flow data alone. The accuracy is real, and so far is showing itself able to surpass both traditional rules-based approaches and prior ML models by a substantial margin while also reducing the operational overhead of these approaches.
You can read more about one telco’s results with DeepTempo LogLM here. And you can read about the value of foundation models in cyber security here.
Managing challenges with flow data
While flow data gives us a keen vantage point for threat detection, we still had to account for the complexity our customers might face in collecting flow data. Thankfully for us, data collection is no longer the barrier it once was. Platforms such as Cribl, cloud-native logging systems like CloudWatch, and other managed flow-collection pipelines have made it far easier to gather rich unsampled flow telemetry at scale. Modern exporters and cloud environments provide consistent formats and rapid export. Flow enrichment has also improved. NetFlow is no longer an isolated or impoverished signal.
Our experience is that when the data is complete, consistent, and richly contextualized, the network perspective becomes one of the most strategic vantage points for threat detection.
Below is a summary of challenges with flow data and the characteristics that make it well suited for DeepTempo’s foundation-model approach:
Final Thoughts
DeepTempo chose flow telemetry in spite of its challenges because it aligned with our vision wherein attackers armed with AI and automation will unleash new types of attacks on our socio-economic infrastructure. Flow data sources allow our LogLM to learn and discern a wide-angle view of attacker behavior; these flows are nearly impossible for an adversary to completely evade; and the structure and abundance of these logs have been perfect for training our LogLM foundation model. The result is clear in our results. Flow data, when interpreted through a LogLM foundation model, becomes a rich, reliable and scalable source of detection capability, advancing the state of the art in detecting especially advanced and signature avoiding attacks. Talk to us to give it a try.