Training Your Own LLM: A Complete Guide from Theory to Code

Eric Zietlow

October 15, 2025

When people hear I work in the AI space the followup question tends to be “Can I train my own AI model like ChatGPT?” While the answer is complex, I’m going to break it down into simple terms that anyone can understand, and then provide the exact code you need to make it happen. We’ll focus on training a new model from scratch using a single RTX 3090 GPU.

The Reality Check

Before we dive in, let’s set some expectations. Training a language model on a single GPU (even a powerful one like the RTX 3090) is like trying to build a car in your garage. You can do it, but it won’t be a Ferrari. Your model will be much simpler than ChatGPT or Claude, but the process of building it will teach you a lot about how AI works.

Why a RTX 3090? With 24GB of VRAM, it’s one of the most powerful consumer GPUs available. Think of VRAM as your AI’s working memory, the more it has, the bigger the “brain” it can handle. For comparison, training the largest language models requires thousands of GPUs working together. But don’t let that discourage you, every journey starts with a single step!

Step 1: Preparing Your Kitchen (Computer Setup)

Imagine you’re about to cook a complex meal. Before you start, you need to prepare your workspace. Let’s break down exactly what we need and why:

bash# Create and activate a new virtual environmentpython -m venv llm_trainingsource llm_training/bin/activate # On Linux/Mac# Or use this on Windows:# llm_training\Scripts\activate# Upgrade pip and install required packagespython -m pip install - upgrade pippip install torch torchvision torchaudiopip install transformers datasets wandb numpy tqdm

Let’s understand what each package does:
torch: This is PyTorch, our main AI cooking tool. It provides the fundamental building blocks for deep learning.
transformers: Think of this as a recipe book from Hugging Face, containing pre-tested model architectures.
datasets: A library that helps us handle large amounts of text data efficiently.
wandb: (Weights & Biases) Like a smart kitchen timer that tracks our training progress.
tqdm: A simple progress bar to know how long until our AI is “cooked.”

Pro tip: The RTX 3090 runs hot when training. Make sure you have good ventilation in your PC case. Think of it like having proper kitchen ventilation when cooking at high temperatures!

Step 2: Gathering Your Ingredients (Training Data)

Just like a chef needs fresh ingredients, an AI needs good data to learn from. Here’s how we prepare our data:

import torchfrom datasets import load_datasetfrom transformers import AutoTokenizerdef prepare_dataset(): # Load tokenizer tokenizer = AutoTokenizer.from_pretrained('gpt2') # We'll use GPT-2's tokenizer # Load a sample dataset (using Wikitext for this example) dataset = load_dataset("wikitext", "wikitext-2-raw-v1") def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding="max_length", return_tensors="pt" ) # Tokenize the dataset tokenized_dataset = dataset.map( tokenize_function, batched=True, remove_columns=dataset["train"].column_names ) return tokenized_dataset, tokenizer

What’s happening here? Let’s break it down:

1. The Tokenizer: Think of this as your food processor. It chops text into smaller pieces (tokens) that your AI can digest. We’re using GPT-2’s tokenizer because it’s well-tested and works great with English text.

2. The Dataset: Wikitext is like a starter cookbook. It contains well-formatted, clean text that’s perfect for learning. In real applications, you might want to use your own custom dataset, but Wikitext is great for learning the process.

3. Text Processing: The tokenize_function is like your prep work, it:
— Cuts text into chunks of 512 tokens (like cutting ingredients into bite-sized pieces)
— Adds padding (like making sure all your ingredients fill their containers evenly)
— Converts everything into PyTorch tensors (the format our model can “eat”)

Step 3: Creating Your Recipe (Model Architecture)

Now comes the exciting part, building your AI’s brain! This is where the magic happens. Let’s look at each component:

import torch.nn as nnclass TransformerBlock(nn.Module): def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1): super().__init__() self.attention = nn.MultiheadAttention(embed_dim, num_heads) self.ff = nn.Sequential( nn.Linear(embed_dim, ff_dim), nn.GELU(), nn.Linear(ff_dim, embed_dim) ) self.ln1 = nn.LayerNorm(embed_dim) self.ln2 = nn.LayerNorm(embed_dim) self.dropout = nn.Dropout(dropout)

The TransformerBlock is like a single layer of your AI’s brain. Each component has a specific job:
MultiheadAttention: Imagine multiple readers looking at different parts of a text simultaneously
ff(FeedForward): The processing unit that helps the model understand patterns
LayerNorm: Keeps the numbers in check, like ensuring your cooking temperature stays consistent
Dropout: Randomly turns off some connections during training, like practicing cooking with different ingredients missing to become more adaptable

class SmallLLM(nn.Module): def __init__(self, vocab_size, embed_dim=256, num_heads=8, num_layers=6, ff_dim=1024, max_seq_len=512): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.pos_embedding = nn.Parameter(torch.randn(max_seq_len, embed_dim)) self.transformer_blocks = nn.ModuleList([ TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers) ]) self.fc = nn.Linear(embed_dim, vocab_size) self.dropout = nn.Dropout(0.1)

The SmallLLM class is your complete AI recipe. Let’s understand the ingredients:
vocab_size: How many words your model knows
embed_dim: How deeply it understands each word (256 dimensions)
num_heads:How many different ways it looks at the text (8 perspectives)
num_layers: How many thinking layers it has (6 layers)
max_seq_len: Maximum length of text it can process at once (512 tokens)

Step 4: Setting Up the Kitchen (Training Configuration)

Before we start cooking (training), we need to set up our workspace properly. This is where we define all the important settings that will affect how our model learns:

class TrainingConfig: def __init__(self): self.learning_rate = 3e-4 self.batch_size = 32 self.epochs = 10 self.warmup_steps = 1000 self.max_grad_norm = 1.0 self.device = "cuda" if torch.cuda.is_available() else "cpu"

Let’s break down these crucial settings:
learning_rate: How big each learning step should be (0.0003). Too high and your model might miss the best solution, too low and training takes forever.
batch_size: How many examples to look at once (32). Think of it like cooking 32 dishes at once to learn faster.
epochs: How many times to go through the entire dataset (10). Each pass helps refine the model’s understanding.
warmup_steps: Like preheating your oven, we start with a lower learning rate and gradually increase it.
max_grad_norm: Prevents the model from making too drastic changes at once, like avoiding sudden temperature changes while cooking.

def create_data_loader(dataset, config): return torch.utils.data.DataLoader( dataset, batch_size=config.batch_size, shuffle=True )

The DataLoader is like your kitchen conveyor belt, it feeds data to your model in organized batches. Shuffling the data is like mixing up your recipe practice to avoid memorizing the cookbook order.

Step 5: The Cooking Process (Training Loop)

Now comes the actual training. This is where your RTX 3090 really starts cooking:

from tqdm import tqdmimport wandbdef train_model(model, train_loader, config): # Initialize wandb for tracking wandb.init(project="small-llm-training") optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate) scheduler = torch.optim.lr_scheduler.OneCycleLR( optimizer, max_lr=config.learning_rate, steps_per_epoch=len(train_loader), epochs=config.epochs )

The training setup includes:
wandb: Your training dashboard, showing you real-time progress
optimizer: The AdamW optimizer is like your master chef, adjusting all the model’s parameters
scheduler: Controls the learning rate throughout training, like adjusting cooking temperature over time

for epoch in range(config.epochs): model.train() total_loss = 0 progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{config.epochs}') for batch in progress_bar: input_ids = batch["input_ids"].to(config.device) labels = input_ids.clone() # Forward pass outputs = model(input_ids) # Calculate loss loss = criterion(outputs.view(-1, outputs.size(-1)), labels.view(-1)) total_loss += loss.item() # Backward pass optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), config.max_grad_norm) optimizer.step() scheduler.step()

The training loop is where all the magic happens:
1. Forward Pass: The model tries to predict the next word in each sequence
2. Loss Calculation: We measure how wrong the predictions were
3. Backward Pass: The model learns from its mistakes
4. Parameter Update: Small adjustments are made to improve future predictions

Think of it like a cooking competition:
— The model makes a dish (prediction)
— Judges taste it (calculate loss)
— The chef learns what went wrong (backward pass)
— Recipes are adjusted slightly (parameter update)

Step 6: Tasting the Results (Text Generation)

After training, here’s how you can generate text with your model:

def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.7): model.eval() # Encode prompt input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): for _ in range(max_length): outputs = model(input_ids) next_token_logits = outputs[:, -1, :] / temperature next_token = torch.multinomial(torch.softmax(next_token_logits, dim=-1), num_samples=1) input_ids = torch.cat([input_ids, next_token], dim=-1) if next_token.item() == tokenizer.eos_token_id: break return tokenizer.decode(input_ids[0])

The generation process is like having your AI chef create a new recipe:
temperature: Controls randomness (0.7 is a good balance). Higher values (>1.0) make output more creative but potentially nonsensical, lower values (<0.5) make it more focused but potentially repetitive.
max_length : Maximum number of words to generate

The generation happens one token at a time, like adding ingredients one by one to create a complete dish

Putting It All Together

Here’s how to run the complete training process:

def main(): # Initialize config config = TrainingConfig() # Prepare dataset dataset, tokenizer = prepare_dataset() train_loader = create_data_loader(dataset['train'], config) # Initialize model model = SmallLLM( vocab_size=tokenizer.vocab_size, embed_dim=256, num_heads=8, num_layers=6 ) # Train model train_model(model, train_loader, config) if __name__ == "__main__": main()

Important Technical Notes

1. Hardware Management:
— Monitor GPU temperature (ideal range: 70–85°C)
— Watch VRAM usage (should stay under 22GB for stability)
— Consider running nvidia-smiin a separate terminal to monitor GPU status

2. Training Duration:
— Expect 2–4 weeks of training on the RTX 3090
— First few hours will show rapid improvement
— Progress slows but continues over weeks

3. Checkpointing:
— Save model checkpoints every epoch
— Keep at least last 3 checkpoints
— Store training logs separately

4. Common Issues and Solutions:
— Out of Memory: Reduce batch size or model size
— Slow Training: Check if CPU is bottlenecking
— Poor Results: Increase model size or training data

Conclusion

Training your own LLM is like raising a digital baby. It takes time, patience, and lots of computing power. While your homemade AI won’t compete with the big models, the process of building it will teach you incredible things about how artificial intelligence works.

The most important things to remember:
1. Start small and scale up
2. Monitor your training metrics
3. Be patient, good models take time
4. Experiment with different hyperparameters
5. Keep good documentation of what works and what doesn’t

Remember: Every great AI started as a simple model. Your first attempt doesn’t need to be perfect — it just needs to be a starting point for your journey into AI development.

Author’s Note: This is a simplified implementation of a very complex process. Many technical details and optimizations have been omitted for clarity. If you’re serious about training your own LLM, I recommend diving deeper into the technical documentation for PyTorch and transformer architectures. The code provided is a starting point and may need adjustments based on your specific hardware and requirements.*

‍