How a major bank validated DeepTempo's accuracy

Evan Powell

November 26, 2025

How a major bank validated DeepTempo's accuracy

In March of 2025 we reached a milestone in our design partnership with a major financial institution in New York. This blog focuses on the evaluation process itself: the criteria we tested, the methods we used, and what we learned. The goal is to share techniques that could accelerate the adoption of deep learning in threat detection.

We have detailed write-ups of our methodology, architecture, and results available under NDA. Users can also try a version of our Tempo model running as a NativeApp on Snowflake. More about our approach is available in our blog posts and resources on our website.

Evaluation criteria

This bank brought the technical depth needed to properly assess a deep learning approach to cybersecurity. They acquired an NVIDIA Superpod years before their peers, which signaled both vision and commitment to innovation. That forward-looking stance made them an ideal partner as we work to reshape cybersecurity through deep learning.

Together, we defined evaluation criteria in two phases:

Phase I:

Accuracy (including F1 score)
Adaptability

Phase II:

Explainability
Efficiency
Return on investment

Measuring accuracy

Accuracy was the initial focus. This is standard for any approach that attempts to find incidents. Like most deep learning teams, we track F1 scores as aggregate measures of accuracy.

Measuring accuracy in cybersecurity presents specific challenges. Labels are scarce. Most enterprises have relatively few known attacks available. This scarcity is precisely why transfer learning from a foundation model matters. Our LogLM was pretrained on massive datasets, allowing it to generalize to new environments. This pretraining does NOT require any labels.

Another challenge: overfitting. Rules can only detect the exact patterns they were written for. Machine learning models trained on specific indicators face similar limits. They may catch one attack pattern while missing derivatives of that same pattern. A foundation model is intended to limit the risks of overfitting, and to be more resilient to inevitable changes in behavior across time.

The lack of labels also complicates precision measurement. Our partner bank manually reviewed LogLM's output. Our model identified dozens of concerning sequences and flagged IP addresses contained within those sequences. Their team investigated each sequence, using the IP addresses and related IPs as anchors.

The lack of labels is also why self-supervised pretraining works. Very large-scale pretraining of foundation models was impractical until transformer-based approaches enabled models of unprecedented scale, accuracy, and adaptability.

We also built synthetic MITRE ATT&CK sequences to help calibrate the model. We discussed this approach in a prior blog.

We should also credit the Canadian Institute of Cybersecurity. Their work from years ago helped us pretrain our Tempo LogLM model over the last two years.

The bank also uses red team evaluations. We participated in these as well.

The final test: SOC trust. Do analysts trust the solution in production? Do they act on the alerts, or do they deprioritize them? This is the most critical test. How will the insights from our Tempo LogLM be used in production?

Testing adaptability

Adaptability is underemphasized in cybersecurity. Most machine learning solutions require months of tuning to show effectiveness in a new environment. Rules are hard-coded to work best within a particular environment. This lack of adaptability creates brittle systems that are expensive to maintain and slow to deliver benefits.

We build foundation models called LogLMs. These LogLMs generalize well and demonstrate capabilities quickly, often without any customization at all.

To quantify adaptation, our bank partner suggested a four-part approach. In all cases, the model identified concerning sequences, and their detection engineers examined the validity of these results.

The four evaluation stages:

No adaptation, most limited dataset
No adaptation, additional data provided
Rapid classifier-based adaptation, limited dataset
Rapid classifier-based adaptation with additional data provided

This approach first established a baseline for the model, then demonstrated the ability to adapt within the bank's environment with additional data and a classifier.

Many machine learning models take months of tuning to achieve acceptable precision, a precision easily lost as the environment changes. Our Tempo LogLM compared well to these models before any adaptation, then showed further improvement in tests 2, 3, and 4.

This approach should be straightforward for any deep learning solution to follow to demonstrate whether it adapts well. If it does, it is likely a foundation model. Not all deep learning is the same. Graph neural networks, for instance, have not been shown to generalize well as they are not foundation models.

Demonstrating explainability

The real test for any set of indicators is whether the SOC comes to rely on them. We heard early from advisors like Chris Bates (former CISO and chief trust officer at SentinelOne) that a more accurate black box would be of limited interest to SOC operators. Our founding engineer Josiah Langley shared that as a former threat hunter at Dragos, he had to deeply understand rules and indicators to know how or whether to act on their alerts.

Since our underlying model relies on many-to-many comparisons of 512dimensional tensors, explainability was a challenge. We started addressing it even before founding DeepTempo.

Measuring explainability is not straightforward. Our approach includes:

Mapping incidents to MITRE ATT&CK patterns: We may not think in 512 dimensions, but if you work in a SOC, you speak MITRE ATT&CK. We provide and measure the accuracy of this mapping.

Creating and providing sequences within logs: Can a user immediately see the concerning sequence? We have invested significant thought into making all the embeddings our model creates useful, including those flagged as concerning and those that are not. This usefulness can be evaluated through human feedback and the accuracy of the model in predicting the ground truth of the sequences themselves.

Dashboards: Splunk, Snowflake, and other vendors have the attention of data analysts and SOC teams. We provide dashboards that fit our information into these environments. Whether users rely on our dashboards or their own, they can apply the full context of these solutions. For example, they can look at everything impacting an IP, including the outputs of our Tempo LogLM. A demo of an example dashboard is available on our YouTube channel.

Measuring efficiency

To measure efficiency, we capture concrete metrics and attempt to measure harder-to-quantify soft metrics.

First, the soft metrics. When discussing our approach with potential users, we want to understand them better. We often ask: How do you build and maintain your rules-based indicators? Who built them? How are they documented? How are they tested?

These questions help us understand what the team is doing and emphasize the costs of the massive technical debt under which most of the cybersecurity industry operates. This naturally leads to discussions about efficiency gains from a quick-to-adapt, extremely accurate solution with built-in explainability. These gains are often hard to quantify.

Easier to quantify is our models' and software's ability to handle large data streams while using relatively small amounts of computing and memory. Without getting into proprietary details, we have shown that the approach can scale horizontally with standard containerized approaches from NVIDIA and Snowflake. In many cases, the bottleneck is getting logs back to a location for analysis. In these cases, our models and related software run in a decentralized manner.

Note that as of the fall of 2025, DeepTempo is an official Cribl partner. We are seeing enormous ROI from customers using Cribl in part to gather telemetry data for use by our LogLM.

Calculating ROI

A deep technology investor who generally avoids cybersecurity explained to us in an all-company meeting last fall: there are two kinds of solutions in cybersecurity. Those that just document things and keep lawyers and regulators happy, and those that apply deep technology to address the fundamental job of cybersecurity, which is greater security.

We add value by reducing the risk of advanced attacks. How do we quantify this? What is the value of reducing the risk of potentially successful attacks? In the case of our bank partner, they are a bedrock of capitalism itself. What is the benefit of further securing that foundation?

We also have hard ROI from cost avoidance. Users decrease their retention of flow logs in expensive systems as they come to trust our solution to better identify and alert on certain attack vectors. The use of our embeddings for retroactive use cases, along with the log sequences we parse out and make immediately available, also increases confidence in users pushing a greater percentage of their logs into lower-cost data lakes like Snowflake, object storage, and other platforms.

Our approach to pricing attempts to share the hard ROI benefits and leave all soft ROI benefits to the user.

Like many enterprise vendors, we have a detailed ROI model used by larger customers to document their decision to rely on our Tempo. Other users just try it out, burning off some of their credits on Snowflake to get started.

What we learned

Cybersecurity has inherent measurement challenges, which likely explains some unwise and backward-facing investments. While spending on cybersecurity is increasing rapidly, so are losses. Attackers, thanks to their success, have much more to spend than the $200-$250 billion we collectively spend on cybersecurity. These attackers do not share our measurement challenges. They have a very simple method of measuring their success.

Measurement across at least the following criteria has proven helpful to our users: accuracy, adaptability, explainability, efficiency, and ROI. We hope this blog and other work, including our open source contributions, will help buyers make more fact-based decisions about necessary investments in improved cybersecurity.

MITRE: Discovery, Collection

‍

Table of contents

Sample H2

Sample H3

How a major bank validated DeepTempo's accuracy

How a major bank validated DeepTempo's accuracy

Evaluation criteria

Measuring accuracy

Testing adaptability

Demonstrating explainability

Measuring efficiency

Calculating ROI

What we learned

See the threats your tools can’t.