Vertical foundation models in production: what it actually takes

Evan Powell

May 3, 2026

The pitch and the gap

Vertical foundation models are an attractive thesis. A model trained on the data of one domain can outperform a generalist within that domain at a fraction of the size and cost. BloombergGPT, BioGPT, Med-PaLM, and CodeLlama are existence proofs in finance, biomedical text, medicine, and code.

The argument is also incomplete. The published papers describe models. Production systems are a different category. Between the model paper and the deployed product sit several engineering tracks that rarely make it into the writeup, and each one is where most projects stall. This post is about those tracks, drawn from the work of building DeepTempo's LogLM.

Track 1: training data

Public datasets in most domains are too small, too clean, or too narrow to train a production foundation model. In security telemetry, the published datasets that exist are either synthetic (and thus do not capture the messiness of real environments) or scoped to specific research questions.

Real training data has to be assembled. In our case the corpus included substantial volumes of flow logs, application logs, identity events, and threat intelligence across multiple environment types: cloud, data center, hybrid, and OT.

Diversity over volume. A larger corpus from a single environment is less useful than a smaller corpus that covers cloud, data center, and OT separately. The model has to learn structure that generalizes.

Adversarial coverage. Pretraining on benign telemetry produces a model that is good at understanding benign telemetry. To get a model that recognizes attacker intent, the corpus has to include enough adversarial activity to learn the structure. This requires labeled or labelable adversarial data, which is expensive to gather.

Privacy-aware curation. Operational telemetry contains identifiers, addresses, and content that has to be handled carefully. The corpus pipeline includes dedicated tooling for redaction, hashing, and access control, separate from the model training itself.

Track 2: tokenization and architecture

Most research papers in foundation models start from a standard transformer and adjust the training data and objectives. For a vertical model that has to run at line rate against a production telemetry stream, the architecture itself needs domain-specific adjustments.

For LogLM, the tokenizer was the first place this showed up. Natural-language tokenizers shred IP addresses, GUIDs, command-line arguments, and file paths into subword fragments that lose the structure that matters for detection. A custom tokenizer that preserves field-level structure was a substantial engineering investment in itself.

Architecture choices about context length, attention pattern, and embedding dimension follow from the production constraints. A long context window is desirable because attacks unfold over long sequences. A long context window is also expensive at inference. The model that ships balances these. Getting the balance right took multiple iterations against held-out evaluation data.

Track 3: the classifier head and the labeled data underneath it

A foundation model produces embeddings. Detection requires labels. The classifier head that maps embeddings to MITRE ATT&CK techniques is where the foundation model becomes a detection product.

The classifier head is much smaller than the foundation model. The work behind it is much larger. Building the classifier requires labeled examples of activity tagged to MITRE techniques. There is no public corpus of this at production scale. We built one. The work is slow, requires domain expertise, and never finishes. Each round of labels improves the classifier, surfaces edge cases, and informs what new labels are needed next.

This is one of the places vertical models compound. Without this loop a vertical foundation model is a research artifact rather than a production system.

Track 4: inference at production volumes

A model that takes ten seconds per inference is not a detection system. Production telemetry arrives at rates that require sub-second inference, often well below.

Production inference involves quantization and distillation to reduce model size without losing accuracy on the tasks that matter, hardware-aware inference scheduling across deployment targets, batching strategies that maximize throughput without exceeding latency budgets, and caching of intermediate representations for sequences that share context with previously seen sequences. Each is its own ongoing engineering track. None appear in the model paper. All matter for whether the model is usable in production.

Track 5: evaluation that means something

Reporting accuracy on a held-out test split is the bare minimum. Production accuracy has to be measured against the conditions of real deployments.

For LogLM this includes zero-shot accuracy on environments the model has not seen during pretraining, adaptation curves over the first thirty days of deployment, false positive rate across deployment types, and robustness against adversarial mutations. Each requires its own evaluation infrastructure, its own held-out data, and its own reporting cadence. The figures we publish (99 percent on common TTPs, 85 percent zero-shot improving to 94 percent after adaptation, sub-five-percent false positives) hold up against this evaluation discipline. They do not hold up if the discipline is dropped.

What this means for teams considering vertical foundation models

The pitch is sound. A vertical model in your domain can outperform a generalist at a fraction of the cost. The work to ship one is substantial, distributed across the tracks above, and ongoing rather than one-time.

If your team is considering building one in another domain, the question worth asking is not can we train the model but can we sustain training data assembly, classifier labeling, inference engineering, evaluation infrastructure, and adaptation pipelines simultaneously, indefinitely. The answer determines whether the project produces a research paper or a product.

If your team is considering buying one, the question worth asking the vendor is not what is the model architecture but what does your evaluation discipline look like, what is your false positive rate under production conditions, and how does the model adapt over time without drifting. Vendors who cannot answer those questions in detail have a research artifact, not a production system.

Table of contents

Sample H2

Sample H3