Perivitta Rajendran

AI in Finance: ML for Trading, Risk, and Fraud Detection

2026-06-15T02:00:00+00:00

AI in Finance: ML for Trading, Risk, and Fraud Detection

Introduction

Finance and machine learning have a longer shared history than almost any other industry pairing. Banks were building neural network-based fraud detectors in the early 1990s, long before deep learning became a household term. Quantitative hedge funds were running statistical arbitrage algorithms before the term "machine learning" had reached mainstream awareness. The industry was doing AI before it called it AI.

Today the transformation is far deeper and more visible. Fraud is caught in milliseconds. Credit decisions that once required a loan officer's judgment are now automated at scale. High-frequency trading firms run algorithms that execute thousands of trades per second based on signals no human could perceive. Risk models assess the probability of default for millions of borrowers simultaneously. AI is not coming to finance; it has been there for decades and is now embedded in nearly every layer of the industry.

This guide covers the four domains where AI's impact in finance is most substantial: fraud detection, credit scoring, algorithmic trading, and risk modelling. For each, it explains what the technology actually does, where it succeeds, and where it still fails in ways that matter.

Problem Statement: Why Finance Was an Early Adopter

Several properties of financial data made machine learning unusually attractive to the industry early on, before the broader technology world had caught up.

Financial data is abundant, structured, and already digital. Unlike healthcare, which stores information in PDFs and handwritten notes, or manufacturing, which encodes knowledge in physical processes, banks and markets have generated enormous quantities of clean, structured, time-stamped data for decades. Transaction records, price feeds, account histories, and credit files are exactly the kind of data that classical machine learning works well on.

The stakes are high and the feedback is fast. A fraud detection model that misclassifies a fraudulent transaction loses money in minutes. An algorithmic trading model's performance is visible in real-time profit and loss. This tight feedback loop, rare in medicine or policy, allowed financial firms to train, evaluate, and improve models quickly.

The business case was immediately quantifiable. Reducing fraud losses by one percentage point on a billion-dollar transaction book is a million dollars. Better credit models reduce default rates. Better trading algorithms generate alpha. In an industry obsessed with marginal returns, machine learning offered measurable, dollar-denominated value from day one.

Core Concepts and Terminology

Term	Plain English Definition
Fraud detection	Using machine learning to identify transactions, accounts, or behaviours that are likely fraudulent, in real time or near-real time.
Credit scoring	Assigning a numerical score to a borrower that predicts the probability they will repay a loan. Used to automate lending decisions.
Algorithmic trading	Using computer algorithms to execute trades automatically based on predefined rules or model outputs, often without human involvement in individual decisions.
High-frequency trading (HFT)	A form of algorithmic trading where the time advantage is measured in microseconds. Firms co-locate servers next to exchange matching engines to minimise latency.
Alpha	Returns that exceed what would be expected given market risk. A model generates alpha if it identifies profitable opportunities that cannot be explained by general market movements.
Feature engineering	The process of creating input variables for a machine learning model from raw data. In finance, features might include transaction velocity, time since last login, or a borrower's debt-to-income ratio.
False positive	A legitimate transaction or customer incorrectly flagged as fraudulent or high-risk. In fraud detection, false positives cause friction for real customers.
False negative	A fraudulent transaction or risky borrower that the model fails to flag. In fraud detection, false negatives result in direct losses.
Model explainability	The ability to explain why a model made a specific decision in terms a human can understand. Required by regulation for some credit decisions.
Overfitting	When a model performs well on training data but fails on new data because it has memorised patterns specific to the training set rather than learning general relationships.

How It Works: The Four Core Applications

Each major application area in finance uses machine learning in a distinct way, shaped by its specific data, constraints, and objectives.

Fraud Detection. Every card transaction is scored by a model in real time, typically within 50 to 100 milliseconds of the card being swiped. The model ingests dozens to hundreds of features: the transaction amount, merchant category, geography, time of day, the cardholder's typical spending patterns, and whether the card has been used recently in a different location. It outputs a fraud probability score. If the score exceeds a threshold, the transaction is declined or sent for manual review. Modern fraud systems use a combination of gradient boosted trees for the main scoring model and graph neural networks to detect fraud rings where multiple accounts and merchants are connected.
Credit Scoring. Traditional credit scoring relied on a small number of variables: payment history, amounts owed, length of credit history, new credit, and credit mix. These are the five categories behind the FICO score. Machine learning models can incorporate hundreds or thousands of variables, including alternative data such as utility payment history, rental records, or even mobile phone usage patterns. This allows lenders to score "thin file" borrowers who lack traditional credit history but are in fact reliable. Gradient boosted trees and logistic regression with feature engineering are the dominant approaches, partly because they satisfy regulatory requirements for explainability.
Algorithmic Trading. Quantitative trading models look for statistical patterns in price, volume, order flow, news sentiment, and alternative data (satellite imagery of parking lots, shipping container counts, credit card spending aggregates) to predict short-term price movements. A model might learn that when a particular combination of order book imbalance and recent price momentum occurs, a security tends to rise over the next 30 seconds. The model then places a buy order and exits when the predicted move materialises. At the high-frequency end, these strategies operate at microsecond timescales using custom hardware. At longer horizons, hedge funds run statistical arbitrage strategies that hold positions for days or weeks based on machine learning signals.
Risk Modelling. Banks and insurers use machine learning to estimate the probability that a borrower defaults, a counterparty fails, or an extreme market move occurs. Credit risk models assess loan portfolios. Market risk models estimate Value at Risk (VaR), the loss that a portfolio would exceed only a small percentage of the time. Stress testing models simulate what would happen to a bank's balance sheet under scenarios like a 30% equity market decline combined with a spike in unemployment. Machine learning supplements classical statistical models here, particularly in capturing non-linear relationships and tail risks that linear models underestimate.

Practical Example: Real-Time Fraud at a Major Bank

Consider how a major retail bank handles 10 million card transactions per day. Without automation, reviewing even a fraction of them for fraud would require thousands of analysts. With machine learning, the process is largely automated.

When a customer uses their card at a petrol station in Kuala Lumpur at 2am having last used it in London six hours earlier, the fraud model receives signals that in combination are highly unusual: geographically impossible travel time, unusual hour, merchant category mismatch with spending history, and transaction amount at the round-number threshold frequently used in card testing attacks. The model outputs a fraud score of 0.94 out of 1.0. The transaction is declined automatically.

The model has learned these patterns from millions of historical transactions, both fraudulent and legitimate, along with labels indicating which were ultimately confirmed as fraud. Gradient boosted tree models are particularly good at this task because they capture the interaction effects between features (the combination of impossible travel time AND unusual hour is far more suspicious than either alone).

Meanwhile, the bank's false positive rate must remain below a threshold that would cause unacceptable customer friction. A customer travelling internationally who gets their card declined at every transaction will close their account. The model is calibrated to balance these two costs, and the threshold is adjusted based on business rules about acceptable false positive rates in different transaction contexts.

Advantages

Speed and Scale Impossible for Humans

A machine learning model can score millions of transactions per second. No human team could match this throughput. For fraud detection, speed is existential: fraud happens in seconds, and a model that responds in 100 milliseconds prevents losses that a model responding in one second cannot.

Pattern Detection Beyond Human Intuition

Machine learning models can detect patterns in hundreds of variables simultaneously, including subtle interaction effects between variables that a human analyst would never think to look for. A fraud ring that routes transactions through a specific network of shell merchant accounts, timed to avoid round-number amounts, and using slightly rotated device fingerprints is invisible to a human reviewer but potentially detectable by a graph model trained on the underlying network structure.

Consistent and Auditable Decisions

A model applies the same logic to every input. Human loan officers, by contrast, may make different decisions based on factors they are not supposed to consider. Machine learning credit decisions, when properly audited, are more consistent and auditable, which is both a fairness advantage and a compliance advantage.

Continuous Improvement from Feedback

Fraud models improve as new fraud patterns are detected and labelled. Credit models improve as loan outcomes are observed. The feedback loop between model deployment and model retraining is a structural advantage that compounds over time for well-resourced institutions.

Limitations and Trade-offs

Adversarial Adaptation

Fraudsters and adversarial traders actively study and adapt to the models used against them. A fraud pattern that the model catches reliably today will be modified by sophisticated fraud operations until it no longer triggers detection. This creates an arms race that requires continuous model updates and monitoring, unlike most machine learning deployments where the environment is relatively static.

Regulatory Constraints on Explainability

In many jurisdictions, a lender who denies credit must provide the applicant with a specific reason. A gradient boosted tree model with hundreds of features can identify the most important reason for a denial, but the explanation is sometimes fragile or counterintuitive. Regulators in the EU and US have imposed requirements that push financial firms toward more interpretable models or require secondary explanation layers on top of complex ones.

Historical Data Encodes Historical Biases

Credit models trained on historical lending data inherit the biases of past human decisions. If a particular demographic group was systematically denied credit by biased loan officers in the past, the model learns to associate features correlated with that group with default risk, even when the causal relationship does not exist. Detecting and correcting these biases is a major active challenge in algorithmic lending.

Model Risk in Trading

Trading models that work in backtesting frequently fail in live deployment. The patterns they learned may be specific to a particular market regime, or their trading itself changes the market dynamics they were designed to exploit. Major losses from algorithmic trading errors, including the Knight Capital incident in 2012 where a faulty algorithm lost 440 million dollars in 45 minutes, illustrate how model risk in trading can translate rapidly into catastrophic outcomes.

Common Mistakes

Training on Biased Historical Labels

Fraud labels are only available for transactions that were investigated. If the old fraud detection system never flagged certain transaction types, those types will not appear as fraud in the training data even if they were fraudulent. The new model learns that those transaction types are safe, perpetuating the gap. This survivorship bias in training data is one of the most insidious problems in financial ML.

Ignoring Class Imbalance in Fraud Detection

Fraud rates in most consumer payment systems are below 0.1 percent. A model that predicts "not fraud" for every transaction achieves 99.9% accuracy but catches zero fraud. Fraud models must be evaluated on precision-recall curves and metrics like F1 or area under the precision-recall curve, not accuracy, and trained with techniques that handle class imbalance such as oversampling, undersampling, or cost-sensitive loss functions.

Overfitting to Market Regime in Trading

A trading model trained on bull market data will not have seen the dynamics of a bear market or a liquidity crisis. Backtesting that covers only a benign period dramatically overstates future performance. Walk-forward validation, where the model is retrained at each time step and tested only on future data, is more honest but still cannot prepare for regime changes not present in the historical data.

Treating Compliance as an Afterthought

Building a sophisticated ML credit model and then discovering it violates the Equal Credit Opportunity Act is an expensive mistake. Fairness analysis, explainability requirements, and model documentation should be incorporated from the design phase, not retrofitted after deployment.

Best Practices

Separate Detection and Explanation Layers

Use a high-performance model (gradient boosted trees, neural network) for the actual scoring decision, and a separate interpretable model (logistic regression, SHAP values) to generate the explanation that goes to the customer or regulator. This preserves model performance while meeting explanation obligations.

Monitor for Distribution Shift

Financial data distributions shift constantly. The spending patterns of a typical credit card user in 2020 were dramatically different from those in 2019 due to the pandemic. A model trained before a major economic shift will degrade rapidly. Set up monitoring dashboards that track key feature distributions and model score distributions in real time, and retrain on a schedule that reflects how quickly your data changes.

Run Regular Bias Audits

For credit and fraud models, run structured tests of model outcomes across protected demographic groups at least quarterly. Report the results to compliance teams. Build correction mechanisms into the model development pipeline before deployment, not after a regulatory finding.

Stress Test Against Adversarial Examples

For fraud models, periodically test against synthetic adversarial examples generated by your own security team mimicking how sophisticated fraud rings adapt. This red-teaming approach surfaces vulnerabilities before fraudsters find them in production.

Comparison: AI Applications Across Financial Domains

Domain	Primary ML Methods	Time Horizon	Key Success Metric	Biggest Risk
Fraud Detection	Gradient boosted trees, graph neural networks, anomaly detection	Milliseconds to minutes	Precision-recall at a given false positive rate	Adversarial adaptation by fraud operations
Credit Scoring	Logistic regression, gradient boosted trees, neural networks	Months to years	Default prediction accuracy (AUC, KS statistic)	Regulatory non-compliance, inherited bias
Algorithmic Trading	Reinforcement learning, LSTMs, gradient boosted trees, classical statistics	Microseconds to weeks	Risk-adjusted returns (Sharpe ratio)	Regime change, market impact, model failure
Risk Modelling	Survival models, neural networks, scenario simulation, tree models	Days to years	Accuracy of loss estimates under stress scenarios	Model risk during tail events not in training data

Frequently Asked Questions

Will AI replace financial analysts and traders?

AI has already replaced a significant portion of repetitive quantitative work: executing routine trades, scoring credit applications, and monitoring transactions for fraud. However, tasks requiring contextual judgment, relationship management, regulatory navigation, and creative problem-solving remain predominantly human. The realistic picture is not replacement but restructuring: fewer people doing execution tasks, more people doing oversight, strategy, and the work of building and maintaining the AI systems themselves. Goldman Sachs had 600 equity traders in 2000; by 2017 it had two, supported by 200 computer engineers.

How does AI detect fraud it has never seen before?

Fraud detection models are not purely rule-based; they learn general patterns of anomalous behaviour. A transaction that deviates sharply from a customer's established baseline across multiple dimensions simultaneously will score highly even if that specific combination has never appeared in training data. Anomaly detection methods explicitly model what "normal" looks like and flag deviations, rather than trying to catalogue all possible fraud patterns. That said, truly novel fraud methods do initially evade detection until examples accumulate, which is why continuous model updates and manual review queues for borderline cases remain essential.

Are AI-driven credit decisions fair?

The fairness of AI credit decisions depends heavily on the training data, the features used, and the fairness criteria applied. ML models trained on historical data can inherit historical discrimination. Using features that correlate with protected characteristics (such as neighbourhood, which correlates with race) can produce proxy discrimination even when the protected characteristic itself is excluded. The regulatory framework for algorithmic fairness in credit is evolving rapidly in the US, EU, and UK. Responsible lenders run ongoing fairness audits and apply fairness constraints to their model training, though there is genuine tension between maximising predictive accuracy and satisfying fairness criteria.

What happened in the 2010 Flash Crash and can AI prevent that?

The 2010 Flash Crash saw the Dow Jones Industrial Average drop about 1,000 points in minutes before recovering, triggered by a combination of algorithmic trading feedback loops and a large sell order that overwhelmed liquidity. Algorithmic systems amplified each other's signals: falling prices triggered automatic selling, which drove prices lower, which triggered more selling. Circuit breakers (automatic pauses in trading when prices move too fast) have been implemented in exchanges globally since then, and they do interrupt these feedback loops. But AI cannot prevent all such events; it can also cause them. The more correlated algorithmic trading strategies become, the more simultaneously they respond to the same signals, and the more violent the market moves when they all act together.

What is alternative data and how is it used in finance?

Alternative data refers to non-traditional data sources used to generate investment signals or improve financial models. Examples include satellite imagery of retail parking lots (predicting sales before earnings announcements), aggregated credit card transaction data (tracking consumer spending patterns), shipping container AIS data (measuring trade flows), social media sentiment, and weather data for commodity traders. Quantitative hedge funds pay substantial amounts for these datasets because they provide information advantages before that information appears in standard financial reports. The edge erodes quickly once many participants have the same data, which is why alternative data providers constantly seek new sources.

References

Bauguess, S. W. (2017). The Role of Big Data, Machine Learning, and AI in Assessing Risks: A Regulatory Perspective. Speech at the OpRisk North America Conference, New York. Published by the U.S. Securities and Exchange Commission.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. Foundational paper for one of the most widely used model families in financial ML applications.
Chen, T., and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. XGBoost is the dominant algorithm in production fraud and credit scoring systems.
U.S. Securities and Exchange Commission and U.S. Commodity Futures Trading Commission. (2010). Findings Regarding the Market Events of May 6, 2010. Joint report on the Flash Crash, describing algorithmic trading feedback dynamics.
Doshi-Velez, F., and Kim, B. (2017). Towards a Rigorous Science of Interpretable Machine Learning. arXiv preprint arXiv:1702.08608. Framework for thinking about model explainability requirements in high-stakes domains.

Key Takeaways

Finance adopted machine learning earlier than almost any other industry, driven by abundant structured data, fast feedback loops, and clear dollar-denominated value from model improvements.
Fraud detection, credit scoring, algorithmic trading, and risk modelling are the four domains with the deepest AI integration. Each uses different model types, operates at different time scales, and faces different failure modes.
Fraud detection models must balance false positive rates (blocking legitimate customers) against false negative rates (missing fraud). This balance is a business decision, not just a technical one.
Credit models face regulatory requirements for explainability and non-discrimination that constrain model complexity and require ongoing fairness audits.
Algorithmic trading models are subject to regime change and market impact, meaning they degrade as market conditions change and as their own trading behaviour alters the patterns they were designed to exploit.
The biggest ongoing challenge in financial AI is not model performance on historical data but model robustness when the world changes, whether through new fraud tactics, economic regime shifts, or market structure changes driven by the models themselves.

Decision Trees: A Complete Guide with Hand-Worked Examples

Decision trees split data by finding the best question at each node....

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

Knowledge Distillation: How Small Models Learn from Big Ones

2026-06-13T02:00:00+00:00

Knowledge Distillation: How Small Models Learn from Big Ones

Introduction

Every year, the largest AI models get bigger. GPT-4, Gemini Ultra, Claude Opus: these models run on clusters of thousands of GPUs and cost hundreds of millions of dollars to train. Deploying them at scale costs dollars per thousand requests. For a startup, a hospital system, or a developer building a mobile app, that is simply not viable.

Knowledge distillation is one of the most practical answers to this problem. Instead of training a small model from scratch and accepting that it will be less capable, distillation trains the small model to mimic a large one. The large model, called the teacher, has already learned a rich internal representation of the world. The small model, called the student, learns not just from raw data labels but from the teacher's own output distributions, which contain far more information than a simple correct-or-incorrect signal.

The result is a student model that often punches well above its weight class. DistilBERT, distilled from BERT, retains about 97 percent of BERT's performance on standard benchmarks while being 40 percent smaller and 60 percent faster. Microsoft's Phi-3 Mini, a 3.8 billion parameter model, outperforms models many times its size on reasoning tasks, partly because it was trained on carefully distilled data derived from much larger models.

This guide explains how distillation works mechanically, why it works at all, and how to decide when it is the right tool for your deployment problem.

Problem Statement: The Cost Gap Between Training and Deployment

The AI industry faces a structural tension. State-of-the-art performance requires large models with hundreds of billions of parameters. But most real-world deployment constraints, latency budgets, memory limits, cost per query, edge hardware, offline inference, pull in the opposite direction. You cannot run a 70 billion parameter model on a smartphone. You cannot serve it at 10 milliseconds per response on a single CPU server. You cannot afford it if your application processes millions of queries per day on thin margins.

The naive solution is to train a smaller model. But smaller models trained from scratch on raw data are just less capable. The training signal available from a dataset of labelled examples has a ceiling, and small models hit that ceiling lower than large ones.

The insight behind knowledge distillation is that the large model, after training, has already extracted and compressed a great deal of knowledge about the problem into its parameters. Its output probabilities across all classes, not just the top prediction, encode subtle relationships: how similar concepts relate, which errors are plausible, which distinctions matter. A small model trained directly on this richer signal can learn more than one trained on raw labels alone, approaching the performance of the teacher at a fraction of its size and cost.

This problem is not unique to deep learning. The underlying idea, that a knowledgeable expert can teach a novice faster than the novice could learn from first principles, is as old as apprenticeship. Distillation is its formal implementation in machine learning.

Core Concepts and Terminology

Term	Plain English Definition
Teacher model	The large, pre-trained model whose knowledge is being transferred. It is frozen during distillation; only used to generate training signals for the student.
Student model	The smaller model being trained. Its goal is to approximate the teacher's behaviour while using fewer parameters and less compute at inference time.
Soft labels	The teacher's output probability distribution over all classes, as opposed to a hard label which is simply the correct class. Soft labels contain information about which wrong answers are plausible and how similar different outputs are.
Hard labels	The ground-truth correct answers from the training dataset. A hard label for an image of a cat is simply "cat," with no information about the model's uncertainty or the cat's similarity to a dog.
Temperature (T)	A parameter applied to the teacher's softmax output that controls how "soft" the distribution becomes. Higher temperature spreads probability more evenly across all classes, revealing more information in the distribution. Set to 1 at test time.
Distillation loss	A loss function measuring the difference between the student's output distribution and the teacher's output distribution. Most often computed as KL divergence.
KL divergence	A measure of how different one probability distribution is from another. Used to push the student's output distribution toward the teacher's distribution.
Feature distillation	A variant where the student is also trained to match the teacher's internal intermediate representations, not just its final outputs.
Data-free distillation	A family of methods that perform distillation without the original training data, generating synthetic inputs using the teacher itself.
Logits	The raw unnormalized scores produced by the final layer of a neural network, before the softmax function converts them into probabilities.

How It Works: The Distillation Process

The mechanics of knowledge distillation are straightforward once you understand what soft labels contain and why temperature matters.

Train or obtain a large teacher model. This model is trained to high performance on your task using standard methods. It could be a model you trained yourself, or a pre-trained model like GPT-4 or Llama 3 that you are licensing or accessing via API. The teacher does not change during distillation.
Prepare the training data. You pass your training dataset through the teacher model and collect its output probabilities for every example. These probability distributions become the soft labels. For a classification task with 1,000 classes, each soft label is a vector of 1,000 numbers that sum to one.
Apply temperature scaling to the teacher's outputs. Before using the teacher's softmax outputs as training targets, you divide the logits by a temperature value T (typically 2 to 20). At T=1, the distribution is the standard softmax. At T=5, the distribution spreads out, and the small probabilities on non-top classes become larger and more informative. This is where the "dark knowledge" lives: the teacher's belief that a cat image is 3% dog and 1% fox tells you something meaningful about visual similarity.
Define the student's combined loss function. The student is trained on a weighted combination of two losses. The first is the standard cross-entropy loss against the hard labels from your dataset. The second is the distillation loss, measuring the KL divergence between the student's temperature-scaled outputs and the teacher's temperature-scaled outputs. A typical weight is 90% distillation loss, 10% hard label loss, but this is tuned per task.
Train the student normally. With this combined loss, you train the student model using standard gradient descent. The student learns simultaneously from the ground truth data and from the teacher's probabilistic judgements.
Restore temperature to 1 for inference. When the student model is deployed, temperature is reset to 1 and the model produces standard probability distributions. The temperature was only needed during training to amplify the soft label signal.

Practical Example: Distilling a Sentiment Classifier

Imagine you have a large BERT-large model (340 million parameters) that classifies customer reviews as positive, neutral, or negative with 94% accuracy. It takes 80 milliseconds per review on your server. You need sub-10 millisecond latency for a real-time dashboard.

You decide to distill it into a smaller 4-layer transformer with 66 million parameters. Here is what the process looks like in practice.

First, you run all 100,000 training reviews through BERT-large at temperature T=4. For a strongly positive review, the teacher might output: positive 91%, neutral 8%, negative 1%. At T=4, this becomes approximately positive 62%, neutral 29%, negative 9%. The neutral and negative signals, invisible in the hard label "positive," are now visible to the student.

The student then trains with 90% weight on these soft labels and 10% weight on the original hard labels. After 3 epochs, the student reaches 91% accuracy on the test set, compared to 94% for the teacher. But the student runs in 7 milliseconds, well within the latency budget, uses one-fifth the memory, and costs a fraction as much to serve.

The 3-percentage-point accuracy gap is the cost of the compression. Whether that trade-off is acceptable depends on your specific product requirements. For many applications, 91% accuracy with 10x faster inference is the right answer.

Advantages

Smaller Models Than Training from Scratch Justifies

A small model trained from scratch on your dataset is limited by what the dataset can teach it. Distillation lets the student access the teacher's implicit knowledge about relationships, ambiguities, and uncertainty, which is not present in the raw labels. The student can therefore achieve accuracy that a from-scratch trained model of the same size could not.

Faster Inference at Deployment

The primary reason to distill is deployment economics. A student that is 3x to 10x smaller runs proportionally faster and cheaper. For applications where the teacher's full capability is not needed on every query, this is a straightforward win.

Works Across Modalities

Distillation is not specific to text classification. It has been applied to image classification, object detection, speech recognition, code generation, and large language model fine-tuning. The core mechanism, training a student on the teacher's output distributions, applies anywhere the teacher produces probability distributions.

Preserves Model Interpretability Options

Because the student is a standard neural network of your choosing, you can select an architecture that supports interpretability methods. You could distill a black-box ensemble into a smaller model with attention mechanisms that are easier to audit.

Enables On-Device and Edge Deployment

Models that would never fit in a smartphone's memory or a browser's WebAssembly environment can be distilled into versions that do. Apple uses distillation extensively to build on-device models for features like Siri and autocorrect that run without a network connection.

Limitations and Trade-offs

Performance Gap Below Teacher

Distillation narrows the gap between a small and large model, but it does not close it entirely. The student will almost always be somewhat less accurate than the teacher. If your task requires the absolute maximum performance and you have the infrastructure to serve a large model, distillation may not be the right choice.

Requires Access to Teacher Outputs

Standard distillation requires that you can run inference on the teacher model and collect its output probabilities. If your teacher is a closed model accessible only through an API, you may not have access to full probability distributions. Some APIs return only the top prediction or a confidence score, not the full softmax distribution, which limits what you can extract.

Training Cost Is Not Zero

Distillation requires running your entire training dataset through the teacher model (which may be expensive via API) and then training the student. For very large datasets and very large teachers, the teacher inference pass alone can be costly. You are trading training cost for deployment cost savings, and the payoff requires sufficient deployment volume to justify the upfront expense.

Distribution Shift Sensitivity

If the teacher was trained on data from a different distribution than your deployment data, the soft labels it produces may not generalise well to your use case. Distilling a general-purpose language model into a domain-specific student works best when the teacher has at least some competence on the target domain.

Hyperparameter Sensitivity

The temperature T and the weighting between soft and hard label losses are significant hyperparameters that require tuning. The optimal values vary substantially across tasks and architectures. Getting distillation to work well requires experimentation, which adds time to your development cycle.

Common Mistakes

Using Temperature 1 for the Soft Labels

At temperature 1, the teacher's output distribution for a correctly predicted example is already very peaked, with nearly all probability mass on the correct class. The soft label is almost identical to the hard label, and the student gains almost no additional information. Always experiment with temperatures above 1, typically between 2 and 10, to reveal the dark knowledge in the distribution.

Choosing a Student Architecture That Is Too Small

Distillation cannot create something from nothing. A student with a fraction of a percent of the teacher's capacity will hit a capacity wall regardless of the quality of the training signal. The student must be large enough to represent the behaviour the teacher is demonstrating. A rule of thumb is to start with a student that is 20% to 50% the size of the teacher and compress further only if initial results are acceptable.

Distilling to a Completely Different Architecture Without Feature Matching

Output-only distillation works well when student and teacher share a similar architectural family. When they are very different (distilling a large transformer into a convolutional network, for example), output-only distillation often struggles. In these cases, adding intermediate feature matching, training the student to also match the teacher's internal representations, significantly improves outcomes.

Skipping Evaluation on Task-Specific Metrics

Distillation is often evaluated on benchmark accuracy, but your actual task may care about precision, recall, F1, calibration, or latency at a specific percentile. A student that matches the teacher on accuracy may perform very differently on these metrics. Always evaluate against what actually matters in your deployment.

Assuming Distillation Fixes a Bad Teacher

Distillation transfers what the teacher knows, including its biases, failure modes, and calibration errors. If the teacher is poorly calibrated or biased on certain subpopulations, the student will inherit these problems. Distillation is not a model improvement technique; it is a model compression technique.

Best Practices

Start with Output Distillation, Add Feature Distillation If Needed

Output distillation (matching only the final softmax distributions) is simpler to implement and often sufficient. Start there. If the performance gap between student and teacher is larger than acceptable, add intermediate layer matching: pick one or two internal layers in the teacher and train the student to match their activations via an adapter projection.

Tune Temperature with a Validation Set

Run a quick sweep over temperatures (2, 4, 8, 16) and measure validation accuracy for each. The optimal temperature varies significantly by task. Higher temperatures work better when the teacher is very confident on most examples. Lower temperatures work better when the teacher's distributions are already soft.

Use the Teacher for Data Augmentation

Generate additional synthetic training examples by prompting the teacher on edge cases, out-of-distribution inputs, or augmented versions of your data. Label these with the teacher's soft outputs. This is particularly effective for language tasks where you can generate varied phrasings of the same underlying query.

Consider Progressive Distillation for Very Large Compression Ratios

If you need to compress a model by more than 10x, distilling directly to the final size often leaves too large a performance gap. Consider distilling in stages: first from the teacher to a medium intermediate model, then from the intermediate model to the final small student. Each step is a more tractable compression ratio.

Comparison: Model Compression Approaches

Method	How It Works	Best For	Key Trade-off
Knowledge Distillation	Train a smaller student model to mimic a larger teacher's output distributions	Achieving near-teacher accuracy in a smaller model; any modality	Requires teacher inference pass; student is still a trained model of its own
Quantization	Reduce the numerical precision of model weights from 32-bit floats to 8-bit or 4-bit integers	Reducing memory and speeding up inference on the same model	Some accuracy loss; may require calibration data; hardware dependent
Pruning	Remove individual weights, neurons, or attention heads that contribute little to model output	Creating sparse models; reducing parameter count without retraining from scratch	Irregular sparsity is hard to accelerate; structured pruning loses more accuracy
Architecture Search (NAS)	Automatically find a smaller architecture that achieves good performance on your task	Finding the most efficient architecture for a given accuracy target	Very computationally expensive to run; requires task-specific search
LoRA / Adapter Fine-tuning	Add small trainable modules to a frozen large model instead of updating all parameters	Efficient fine-tuning of large models; does not reduce deployment model size	Does not reduce inference cost; base model must still be served

Frequently Asked Questions

Does distillation always make a worse model than the teacher?

Almost always, yes, in the sense that the student will have somewhat lower performance on the teacher's original training distribution. However, there are documented cases where a distilled student outperforms a teacher of the same size trained from scratch, because the teacher's soft labels provide a richer training signal than raw dataset labels. The student is better than a model of its size trained without distillation, even if it does not surpass the teacher.

What is "dark knowledge" and why does it matter?

Dark knowledge is Geoffrey Hinton's term for the information encoded in the non-top probabilities of a teacher's softmax distribution. When a teacher classifies an image of a BMW as "automobile" with 98% confidence, it also assigns small probabilities to "truck" (1.5%) and "van" (0.3%). These small values encode the teacher's learned understanding that automobiles, trucks, and vans share visual features. A student trained only on the hard label "automobile" never sees this relationship. Dark knowledge transfers structural understanding of the problem, not just which answer is correct.

Can you distill from a model you do not have weights for, like GPT-4?

Yes, but with limitations. If you only have API access, you can use the model's text outputs as training data for a smaller model. This approach, called output distillation or dataset distillation, generates a large set of (prompt, response) pairs using the teacher and trains the student on them directly. You lose the soft label signal (you only get generated text, not probability distributions), but you still benefit from the teacher's knowledge being encoded in the generated outputs. This is how many smaller instruction-following models are trained.

How is Phi-3 related to distillation?

Microsoft's Phi-3 models are a prominent example of what the Phi team calls "textbook quality" data distillation. Rather than distilling softmax distributions, the approach generates a very large corpus of high-quality synthetic training data using GPT-4, then trains a small model (3.8B parameters) almost exclusively on this curated dataset. The result is a model that performs remarkably well on reasoning and coding benchmarks despite its small size, because it was trained on data that reflects the implicit structure of GPT-4's understanding. It is a form of data-level distillation rather than logit-level distillation.

When should I use distillation versus quantization?

These are not mutually exclusive and are often combined. Quantization is faster to apply (no retraining required, just post-processing) and works well when you need to reduce memory usage of an existing model. Distillation requires retraining but produces a smaller model that is more portable and can be further quantized afterward. Use quantization when you have a trained model and want to reduce its footprint with minimal engineering effort. Use distillation when you have flexibility in the student architecture and want to maximise performance at a target size. Use both when you need the deepest compression ratio.

References

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. The foundational paper introducing the temperature-scaled soft label framework.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Demonstrated that BERT could be compressed 40% with only a 3% performance drop.
Abnar, S., and Zuidema, W. (2020). Quantifying Attention Flow in Transformers. ACL 2020. Relevant background on intermediate feature matching in transformer distillation.
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Carignan, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. (2023). Textbooks Are All You Need. arXiv preprint arXiv:2306.11644. Describes the Phi-1 approach to data-quality-driven small model training.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. (2014). FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. Introduced intermediate feature matching as an extension to output-only distillation.

Key Takeaways

Knowledge distillation trains a small student model to mimic a large teacher's output distributions, not just match hard labels. This richer training signal allows the student to exceed what its size alone would normally allow.
Temperature scaling is the key mechanism that makes soft labels informative. Higher temperature spreads probability mass across all classes, revealing the teacher's beliefs about similarity and ambiguity.
The student's loss is a weighted combination of distillation loss (matching the teacher's soft outputs) and standard cross-entropy loss (matching hard labels). The distillation component typically receives most of the weight.
Distillation does not require the student to have the same architecture as the teacher. Any architecture that produces probability distributions can be the student.
Distillation is not a replacement for architecture search, quantization, or pruning. It is most effective when combined with those methods, applied at different stages of the compression pipeline.
The main practical decision is access to teacher outputs. Full probability distributions give the best training signal. If only API text outputs are available, dataset distillation (training on the teacher's generated text) is still substantially better than training on unfiltered data alone.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Decision Trees: A Complete Guide with Hand-Worked Examples

Decision trees split data by finding the best question at each node....

Decision Trees: A Complete Guide with Hand-Worked Examples

2026-06-13T02:00:00+00:00

Decision Trees: A Complete Guide with Hand-Worked Examples

Introduction

A decision tree is one of the most intuitive models in machine learning. It makes predictions by asking a sequence of yes/no questions about the input features, branching left or right at each node until it reaches a leaf that gives the prediction. Every path from root to leaf is a human-readable rule: "if area > 2000 and bedrooms > 3, predict price > $450k."

That interpretability is why decision trees are used in medical diagnosis, credit risk, and fraud detection — domains where you need to explain the reasoning behind a decision, not just report a number. They are also the building block of the most powerful ensemble methods: Random Forest trains hundreds of trees in parallel; XGBoost trains them in sequence. Understanding a single tree is therefore a prerequisite for understanding both.

1. The Core Idea: Find the Best Split

Building a tree is a recursive process. At each node, the algorithm asks: which feature, and which threshold, produces the most informative split? "Most informative" means the two resulting groups are as pure as possible, where pure means predominantly one class.

There are two standard measures of impurity that define what "best" means:

Gini impurity — used by CART (the algorithm behind scikit-learn's implementation)
Entropy / Information Gain — used by ID3 and C4.5

Both measures agree in most practical situations. We will derive and use both.

2. Gini Impurity

Gini impurity measures the probability that a randomly chosen element from a node would be misclassified if it were labelled according to the class distribution at that node.

\[ \text{Gini}(S) = 1 - \sum_{k=1}^{K} p_k^2 \]

where $p_k$ is the proportion of class $k$ at node $S$, and $K$ is the number of classes.

A perfectly pure node (all one class) has Gini = 0
A maximally impure binary node (50/50 split) has Gini = 0.5

When we evaluate a split, we compute the weighted Gini of the two child nodes:

\[ \text{Gini}_{\text{split}} = \frac{n_L}{n} \cdot \text{Gini}(L) + \frac{n_R}{n} \cdot \text{Gini}(R) \]

We choose the split that minimises $\text{Gini}_{\text{split}}$.

3. Information Gain and Entropy

Entropy measures the average unpredictability of a node's class distribution:

\[ H(S) = - \sum_{k=1}^{K} p_k \log_2 p_k \]

A pure node has entropy 0. A 50/50 binary split has entropy 1 bit. Information Gain is the reduction in entropy achieved by the split:

\[ \text{IG}(S, f) = H(S) - \left[ \frac{n_L}{n} H(L) + \frac{n_R}{n} H(R) \right] \]

We choose the feature and threshold that maximises information gain.

4. Worked Example: Building a Tree by Hand

Suppose we have 10 loan applicants and want to predict whether they default (D = Yes) or repay (D = No) based on two features: income level (High / Low) and credit score (Good / Poor).

#	Income	Credit Score	Default?
1	High	Good	No
2	High	Good	No
3	High	Poor	No
4	Low	Good	No
5	Low	Poor	Yes
6	Low	Poor	Yes
7	Low	Poor	Yes
8	High	Poor	No
9	Low	Good	No
10	Low	Poor	Yes

The root node has 6 No and 4 Yes. Its entropy is:

\[ H(\text{root}) = -\tfrac{6}{10}\log_2\tfrac{6}{10} - \tfrac{4}{10}\log_2\tfrac{4}{10} \approx 0.971 \text{ bits} \]

Step 1: Evaluate the split on Income

Income = High: rows {1,2,3,8} → 4 No, 0 Yes → $H = 0$

Income = Low: rows {4,5,6,7,9,10} → 2 No, 4 Yes → $H = -\tfrac{2}{6}\log_2\tfrac{2}{6} - \tfrac{4}{6}\log_2\tfrac{4}{6} \approx 0.918$

\[ \text{IG}(\text{Income}) = 0.971 - \left[\tfrac{4}{10}(0) + \tfrac{6}{10}(0.918)\right] \approx 0.971 - 0.551 = 0.420 \text{ bits} \]

Step 2: Evaluate the split on Credit Score

Credit = Good: rows {1,2,4,9} → 4 No, 0 Yes → $H = 0$

Credit = Poor: rows {3,5,6,7,8,10} → 2 No, 4 Yes → $H \approx 0.918$

\[ \text{IG}(\text{Credit}) = 0.971 - \left[\tfrac{4}{10}(0) + \tfrac{6}{10}(0.918)\right] \approx 0.420 \text{ bits} \]

Both splits give the same information gain here. We pick Income arbitrarily (ties are broken by index). The High Income branch is already pure (all No). On the Low Income branch, we recurse.

Step 3: Recurse on the Low Income branch

6 samples: rows {4,5,6,7,9,10}. Credit Score = Good → {4,9} both No (pure). Credit Score = Poor → {5,6,7,10} all Yes (pure). The second split on Credit Score produces two pure leaves. The tree is done.

Rule	Prediction
Income = High	No Default
Income = Low AND Credit = Good	No Default
Income = Low AND Credit = Poor	Default

5. Splitting on Continuous Features

Real datasets have continuous features like age, income as a dollar amount, or transaction size. For a continuous feature, the algorithm tests every possible threshold (midpoints between adjacent sorted values) and computes information gain for each. The threshold that produces the highest information gain is chosen.

For a feature with $n$ unique values, there are $n-1$ candidate thresholds. This is why tree training on large datasets can be slow: for each node, every feature's thresholds must be evaluated. XGBoost's histogram-based split finding is a direct optimization of this step.

Worked Example: Choosing a Threshold

Suppose we have 6 applicants with a continuous Age feature and a loan default label:

#	Age	Default?
1	22	No
2	25	No
3	30	Yes
4	35	Yes
5	38	Yes
6	42	No

Root entropy: 3 No, 3 Yes → $H = -\tfrac{3}{6}\log_2\tfrac{3}{6} - \tfrac{3}{6}\log_2\tfrac{3}{6} = 1.0$ bit.

Candidate thresholds (midpoints between consecutive sorted ages): 23.5, 27.5, 32.5, 36.5, 40.

Evaluating Age ≤ 27.5:
Left (Age ≤ 27.5): rows {1,2} → 2 No, 0 Yes → $H = 0$
Right (Age > 27.5): rows {3,4,5,6} → 1 No, 3 Yes → $H = -\tfrac{1}{4}\log_2\tfrac{1}{4} - \tfrac{3}{4}\log_2\tfrac{3}{4} \approx 0.811$

\[ \text{IG}(\text{Age} \leq 27.5) = 1.0 - \left[\tfrac{2}{6}(0) + \tfrac{4}{6}(0.811)\right] \approx 1.0 - 0.541 = 0.459 \text{ bits} \]

Evaluating Age ≤ 36.5:
Left: rows {1,2,3,4,5} → 2 No, 3 Yes → $H \approx 0.971$
Right: rows {6} → 1 No, 0 Yes → $H = 0$

\[ \text{IG}(\text{Age} \leq 36.5) = 1.0 - \left[\tfrac{5}{6}(0.971) + \tfrac{1}{6}(0)\right] \approx 1.0 - 0.809 = 0.191 \text{ bits} \]

The threshold Age ≤ 27.5 gives the highest information gain (0.459 bits) and is selected as the best split. The algorithm repeats this process for every feature at every node, always choosing the globally best split.

6. Decision Trees for Regression

When the target is continuous, impurity is replaced by variance reduction (or equivalently, minimizing mean squared error). Each leaf predicts the mean of training targets that fell into it.

\[ \text{MSE}(S) = \frac{1}{n} \sum_{i \in S} (y_i - \bar{y}_S)^2 \]

The split that most reduces the weighted MSE of the two child nodes is chosen. This is exactly how regression trees in Random Forest and gradient boosting work.

Worked Example: Regression Split

Suppose we want to predict house price from house size:

Size (sq ft)	Price ($k)
900	150
1100	200
1400	280
1800	350
2200	420

Root mean: $\bar{y} = (150+200+280+350+420)/5 = 280$. Root MSE = $\frac{1}{5}[(150-280)^2 + (200-280)^2 + (280-280)^2 + (350-280)^2 + (420-280)^2] = \frac{1}{5}[16900 + 6400 + 0 + 4900 + 19600] = 9560$.

Evaluating Size ≤ 1250 (splitting {900,1100} from {1400,1800,2200}):
Left mean = 175, Left MSE = $\frac{1}{2}[(150-175)^2 + (200-175)^2] = 625$
Right mean = 350, Right MSE = $\frac{1}{3}[(280-350)^2 + (350-350)^2 + (420-350)^2] \approx 3267$

\[ \text{Weighted MSE} = \tfrac{2}{5}(625) + \tfrac{3}{5}(3267) = 250 + 1960 = 2210 \]

Variance reduction = 9560 − 2210 = 7350. The algorithm compares this against all other candidate thresholds and picks the one with the largest variance reduction. At prediction time, a new house with Size 1050 falls in the left leaf and gets predicted price $175k (the mean of that leaf's training targets).

7. Controlling Tree Complexity

An unconstrained tree will grow until every leaf is pure, perfectly fitting the training set and badly overfitting. Several hyperparameters control this:

Hyperparameter	Effect	sklearn name
Max depth	Limits the number of splits from root to leaf. Depth 1 = a single decision stump.	`max_depth`
Min samples per leaf	A leaf must have at least this many training samples. Prevents splits on tiny subgroups.	`min_samples_leaf`
Min samples to split	A node must have at least this many samples before it can be split further.	`min_samples_split`
Max features	At each split, only consider a random subset of features. Reduces correlation between trees in ensembles.	`max_features`
Min impurity decrease	A split is only made if it reduces impurity by at least this amount.	`min_impurity_decrease`

Practical Guidance on Choosing Values

max_depth is the most important lever. Start with 3–5 for most tabular datasets. A depth-3 tree has at most 8 leaves, which is interpretable and often surprisingly effective. Only increase depth once you have confirmed that the model is underfitting (high training error, not just high test error).

min_samples_leaf directly prevents the tree from memorizing noise. A common rule of thumb: set it to at least 1% of your training set size. For a 10,000-row dataset, min_samples_leaf=100 means no leaf can represent fewer than 100 examples — small enough to be specific, large enough to be reliable. For regression, larger values (5%+) are often better since noise in continuous targets is harder to filter.

min_samples_split is typically set to 2 * min_samples_leaf. There is rarely a reason to tune it independently.

max_features matters most in ensemble contexts. For a single tree used for interpretability, leave it at None (use all features). In Random Forest, max_features="sqrt" (scikit-learn default) decorrelates trees effectively. For gradient boosting, a value around 0.5–0.8 acts as column subsampling.

The general tuning strategy: fix max_depth first, then adjust min_samples_leaf to reduce overfitting, then use cost-complexity pruning (Section 8) to fine-tune.

8. Cost-Complexity Pruning

Post-pruning (also called cost-complexity pruning) builds the full tree first, then removes branches that provide little benefit. It adds a regularization term $\alpha$ that penalizes tree complexity:

\[ R_\alpha(T) = R(T) + \alpha |T| \]

where $R(T)$ is the training error and $|T|$ is the number of leaves. Higher $\alpha$ produces a smaller, more regularized tree. The optimal $\alpha$ is found by cross-validation. In scikit-learn this is controlled by ccp_alpha.

How to Find the Right ccp_alpha

Scikit-learn exposes the full pruning path via cost_complexity_pruning_path(), which returns the effective alphas and corresponding impurities at each pruning step. You cross-validate over this set of alphas to find the one that maximises validation accuracy:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Build the full tree first to get candidate alphas
full_tree = DecisionTreeClassifier(random_state=42)
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # exclude the last (trivial root node)

# Cross-validate each alpha
cv_scores = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

# Pick the alpha with the best CV score
best_alpha = ccp_alphas[np.argmax(cv_scores)]
print(f"Best ccp_alpha: {best_alpha:.5f}")

# Train final model
final_tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
final_tree.fit(X_train, y_train)

A typical result: the full unpruned tree has 40+ leaves and 72% test accuracy; after pruning with the optimal alpha it has 6 leaves and 87% test accuracy. The pruned tree is both more accurate and more interpretable — a rare win in both directions.

When to use pruning vs pre-pruning hyperparameters: Use max_depth and min_samples_leaf when you have a clear interpretability requirement and want a fixed-size tree. Use ccp_alpha when you want to let the data determine the tree shape — it finds the pruning level that optimally trades off training fit and tree complexity.

9. Advantages and Limitations

Advantages	Limitations
Fully interpretable — every prediction has a traceable rule path	High variance — small changes in training data produce very different trees
Handles both categorical and continuous features natively	Prone to overfitting without depth or leaf constraints
Requires no feature scaling or normalization	Axis-aligned splits cannot capture diagonal decision boundaries efficiently
Handles missing values with surrogate splits	Biased toward features with more unique values when using information gain
Fast to train and predict on tabular data	A single tree rarely achieves competitive accuracy on its own

The high variance problem is the most practically important limitation. If you bootstrap-resample your training data (take a random 80% sample) and retrain the tree, you often get a completely different structure. This instability means single trees are unreliable for most production use cases. It is the primary motivation for ensemble methods.

The axis-aligned splits limitation means decision trees struggle with features that only matter in combination. If a class boundary is "x + y > 5", a tree needs many splits to approximate a diagonal boundary, while logistic regression captures it in one linear term. For these patterns, trees need to be deep (and therefore overfit) to compete.

The feature bias in information gain is a real concern: features with many unique values (like a customer ID) appear highly informative simply because they can partition the data into many small pure groups. The solution is to use the Gain Ratio (C4.5) or Gini (CART), which normalise for the number of distinct values. Scikit-learn uses Gini by default, which handles this better than raw information gain.

Despite these limitations, a decision tree is the right choice when the model's decisions must be explained to a non-technical audience, when regulatory requirements demand auditability (credit scoring, medical diagnosis), or when you want a fast interpretable baseline before moving to an ensemble.

10. From Trees to Ensembles

The high variance of a single decision tree is its main weakness. The insight behind ensemble methods is that averaging many high-variance, low-bias models can dramatically reduce variance without sacrificing bias:

Random Forest trains many trees on bootstrap samples of the data and averages their predictions. Each tree sees a random subset of features at each split, decorrelating the trees so that averaging produces a meaningful reduction in variance.
Gradient Boosting (XGBoost, LightGBM) trains shallow trees sequentially, each correcting the residuals of the ensemble so far. The depth constraint keeps individual trees weak (high bias), and the boosting process reduces bias at the ensemble level.

Both approaches work because of properties of the individual tree: interpretable splits, no feature scaling requirement, and good handling of tabular data with mixed feature types.

11. Python Implementation

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train with depth constraint and cost-complexity pruning
clf = DecisionTreeClassifier(
    criterion="gini",
    max_depth=4,
    min_samples_leaf=5,
    ccp_alpha=0.01,
    random_state=42
)
clf.fit(X_train, y_train)

print(f"Train accuracy: {accuracy_score(y_train, clf.predict(X_train)):.3f}")
print(f"Test accuracy:  {accuracy_score(y_test,  clf.predict(X_test)):.3f}")

# Visualize
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(clf, feature_names=load_iris().feature_names,
          class_names=load_iris().target_names,
          filled=True, ax=ax)
plt.tight_layout()
plt.savefig("decision_tree.png", dpi=150)

Frequently Asked Questions

What is the difference between Gini impurity and information gain?

Both measure node purity but use different formulas. Gini impurity measures the probability of misclassifying a randomly chosen sample and is used by CART (scikit-learn). Information gain measures the entropy reduction from a split and is used by ID3 and C4.5. In practice they produce very similar trees; Gini is slightly faster to compute since it avoids a logarithm.

How do you prevent a decision tree from overfitting?

The main controls are max_depth, min_samples_split, and min_samples_leaf. Limiting depth is the simplest approach. Post-pruning with cost-complexity pruning (the ccp_alpha parameter in scikit-learn) removes branches whose removal does not hurt validation accuracy. Cross-validation is used to tune the right pruning strength.

When should you use a decision tree over other models?

Use decision trees when interpretability is a hard requirement — medical diagnosis, credit decisions, and compliance contexts where you must explain the prediction. They also require no feature scaling and handle mixed data types natively. For pure predictive accuracy, Random Forest or XGBoost almost always outperform a single tree.

How does a decision tree relate to Random Forest and XGBoost?

Both ensemble methods are built on decision trees. Random Forest trains many deep trees in parallel on random data and feature subsets, then averages their predictions (bagging). XGBoost trains shallow trees in sequence, where each tree corrects the residual errors of the previous ones (boosting). Understanding a single tree is the prerequisite for understanding both.

Key Takeaways

A decision tree recursively splits data by choosing the feature and threshold that maximises information gain (entropy reduction) or minimises Gini impurity at each node.
Unpruned trees overfit. Control complexity with max_depth, min_samples_leaf, and ccp_alpha.
For regression, variance reduction (MSE) replaces impurity as the split criterion; each leaf predicts the mean of its training targets.
A single tree has high variance. Random Forest and gradient boosting methods reduce this variance through ensembling while retaining the tree's interpretable split structure.

References

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

LLM as Judge: How to Evaluate AI Models Automatically at Scale

2026-06-11T02:00:00+00:00

LLM-as-Judge: How to Evaluate AI Models Automatically at Scale

Introduction

Evaluating a language model is harder than it sounds. For classification tasks with a fixed set of correct answers, automated metrics work fine. But most of what makes a language model useful is not captured by exact match accuracy. Is the explanation clear? Is the tone appropriate? Is the response helpful without being verbose? Does the code work correctly even though it looks different from the reference solution?

These questions require judgment, and judgment has traditionally meant human annotators. Human evaluation is the gold standard, but it is slow, expensive, and difficult to run at scale. A model deployed to millions of users generates outputs faster than any annotation team can review them. Running an A/B test between two model versions, or evaluating a new model against a benchmark of 10,000 open-ended questions, is impractical if every output requires a human read.

LLM-as-judge addresses this by using a capable language model as the evaluator. Rather than asking a person to score a response, you ask a model. The result is automated evaluation that can run at any scale, at low cost, and in near real time. This post explains how it works, when it is reliable, and how to avoid the failure modes that make it misleading.

Problem Statement

The fundamental challenge in evaluating generative AI is that quality is multidimensional and context dependent. A correct answer that is condescending is worse than a slightly less precise answer that respects the user. A technically accurate code snippet that introduces a security vulnerability is worse than a slightly less elegant version that is safe. Traditional metrics like BLEU, ROUGE, and perplexity do not capture these dimensions.

Human evaluation captures them, but at a cost: expert annotators are expensive, inter annotator agreement on subjective dimensions is often low, and annotation throughput is fundamentally limited. For organizations running continuous deployment of AI systems, the evaluation bottleneck can slow iteration cycles significantly and make it impossible to catch regressions before they reach users.

LLM-as-judge offers a middle path: evaluation that is faster and cheaper than human annotation but more nuanced than reference-based metrics. It is not a replacement for human evaluation but a way to extend human judgment to scales that human annotators cannot reach. The key insight is that judging quality is easier than generating quality — a model that cannot reliably produce excellent responses can still reliably distinguish between better and worse ones.

Core Concepts and Terminology

Term	Definition
LLM-as-judge	Using a large language model to evaluate the outputs of another model, assigning scores or preferences based on a rubric or comparison.
Pointwise evaluation	Scoring a single model output in isolation, typically on a numerical scale or categorical label (e.g., 1-5, or poor/acceptable/good).
Pairwise evaluation	Presenting the judge with two responses to the same input and asking which is better, producing a preference rather than an absolute score.
Rubric	A set of criteria given to the judge model specifying what dimensions to evaluate and what constitutes high versus low quality on each dimension.
Position bias	A tendency for the judge model to prefer the response presented first (or second), regardless of actual quality.
Verbosity bias	A tendency for judge models to prefer longer responses, even when brevity is more appropriate.
Self-enhancement bias	A tendency for a model to prefer outputs that resemble its own outputs or align with its own training, creating a conflict of interest when a model judges itself or a closely related model.
MT-Bench	A multi-turn benchmark where GPT-4 is used as the judge to evaluate chat model responses, one of the first widely adopted LLM-as-judge benchmarks.
Calibration set	A curated sample of examples with known human judgments, used to validate whether an LLM judge's scores correlate reliably with human assessment before using it at scale.

How It Works

LLM-as-judge is a prompt engineering task at its core, but the design of that prompt determines whether the results are meaningful or misleading.

Choose the evaluation mode. Pointwise evaluation scores each response independently. Pairwise evaluation compares two responses head to head. Pairwise judgments tend to be more reliable because comparing is easier than scoring in isolation, but they produce a ranking rather than an absolute measure and scale quadratically with the number of model comparisons.
Write a precise rubric. The judge model needs to know what to evaluate. A vague instruction like "score the quality of this response" produces inconsistent results. A rubric specifying the dimensions (accuracy, clarity, completeness, appropriate tone), what each score on the scale means, and any domain specific standards produces much more consistent and interpretable scores. The rubric is the primary lever on evaluation quality.
Include the full context. The judge needs the original question or prompt alongside the response being evaluated. Without this, it cannot assess relevance, appropriateness, or whether the response actually addresses the request. In agentic systems, this may include the full conversation history and tool outputs.
Ask for a rationale before the score. Prompting the judge to explain its reasoning before giving a score, a chain-of-thought approach, improves consistency and makes the evaluation auditable. You can read the rationale to understand what the judge was attending to and identify cases where its reasoning is flawed.
Run multiple trials and aggregate. For any given response, running the judge multiple times with temperature above zero and averaging the scores reduces variance. Variance in judge scores is a signal about evaluation uncertainty. High variance means the evaluation is not stable and should not be trusted without more trials.
Control for known biases. For pairwise evaluation, swap the order of the two responses in a second evaluation and compare results. If the judge prefers the first response in one order and the first response in the reversed order, position bias is driving the result, not quality. Consistent preferences across both orderings are more trustworthy.
Validate against human judgments. Calibrate your judge setup on a sample where human evaluations are available. If the judge's rankings correlate strongly with human rankings on that sample, you have evidence it is measuring something real. If not, revisit the rubric before trusting the judge at scale.

Practical Example

Suppose you are developing a customer support chatbot and want to evaluate whether a new model version produces better responses than the existing one. You have 5,000 question-response pairs from production logs where human agents had to intervene, suggesting the original model's response was inadequate.

You generate responses from both the old model and the new model to each of the 5,000 questions. You then run a pairwise LLM judge on each pair, presenting both responses in randomized order and asking the judge to determine which response would better resolve the customer's issue, with a specific rubric covering accuracy, resolution completeness, and appropriate tone. You run each comparison twice with the responses in opposite order to detect position bias.

The judge reports that the new model is preferred in 67 percent of pairs, removing the cases where the judge gives a tie or shows clear position bias. You spot-check 50 cases manually and confirm the judge's calls are reasonable in 88 percent of them. You have automated, scalable evidence that the new model is better on this distribution, achieved in a few hours rather than the weeks it would take to collect equivalent human annotations.

The 12 percent disagreement rate between the judge and human reviewers is expected and acceptable for this use case. Before shipping, you also run a manual review on the cases flagged as highest-stakes by the judge, ensuring that the automated evaluation did not miss critical safety or compliance issues.

Advantages

Scales to Any Volume

Running a judge model costs roughly the same per evaluation as running the model being evaluated. There is no human bottleneck. This means you can evaluate every output in a production system, run full benchmark sweeps on every model checkpoint, and detect regressions in near real time. Scale is the primary reason LLM-as-judge has become standard practice in AI development pipelines.

Captures Qualitative Dimensions

Unlike reference-based metrics, an LLM judge can evaluate tone, clarity, relevance, and helpfulness — dimensions that matter for user experience but have no ground truth string to compare against. A response that is factually correct but needlessly condescending will score poorly on a well-designed rubric, as it should. These subjective quality signals are what distinguish a usable product from a technically correct one.

Fast Iteration Cycles

Being able to evaluate a model change on thousands of examples in an hour rather than weeks enables rapid iteration on model improvements. Development teams can test a new prompt, a fine-tuned checkpoint, or a context engineering change and get quality signal the same day. This speed advantage compounds over a development cycle: more iterations means more opportunities to catch problems and improve quality.

Consistent Rubric Application

A well-prompted judge applies the same criteria every time. Human annotators vary in interpretation, attention, and fatigue over long annotation sessions. Consistency, even imperfect consistency, has value for comparative evaluation where you need to measure changes in quality across model versions. Consistent measurement of a relative change is more actionable than noisy measurement of an absolute level.

Auditable Reasoning

With chain-of-thought prompting, the judge's reasoning is visible and can be inspected, disagreed with, or used to understand what properties are driving scores. When a judge marks a response poorly, you can read why. This transparency is absent from reference-based metrics, which give you a number but no explanation of what drove it.

Limitations and Trade-offs

Biases Compound and Are Hard to Measure

Judge models carry the same biases as any language model: preferences for verbosity, confidence in fluent-sounding text regardless of accuracy, and stylistic preferences from their training. These biases become measurement artifacts in your evaluations. Worse, they are difficult to quantify without the human calibration set that many teams skip building. An evaluation system with unmeasured biases produces results that feel authoritative but may be systematically wrong.

Cannot Catch Factual Errors It Does Not Know About

A judge model evaluates plausibility based on its training. If the correct answer to a question is a recent fact the judge was not trained on, it may mark a wrong answer correct because it sounds right. This is particularly concerning for domains where facts change frequently: financial data, medical guidelines, regulatory requirements, current events. The judge's knowledge cutoff is a hard ceiling on its factual checking ability.

Self-Evaluation Is Unreliable

Asking a model to judge its own outputs, or outputs from a model closely related to it, introduces a conflict of interest that is difficult to remove through prompt engineering alone. Self-enhancement bias causes models to systematically prefer their own stylistic patterns and reasoning approaches. Always use a different judge model from the model being evaluated, and prefer a model from a different training lineage when possible.

Calibration Varies by Domain

A judge that correlates well with human judgments on general text may perform poorly on specialized domains like medical, legal, or technical content where the judge has limited domain expertise. Domain-specific vocabulary, implicit conventions, and specialized correctness criteria require a judge that has been trained on or calibrated against domain expert annotations. General-purpose judges applied to specialized domains produce unreliable results.

Does Not Replace Human Evaluation for High-Stakes Decisions

Deploying a model to production, publishing a safety evaluation, or making consequential decisions about model quality should not rest on LLM-as-judge alone. The stakes are too high and the failure modes too systematic. LLM-as-judge is a production accelerator for routine quality monitoring; it is not a safety gate for decisions where errors have real consequences.

Common Mistakes

Using a Vague Rubric

Instructions like "evaluate quality" give the judge too much latitude and produce inconsistent, uninterpretable scores. The judge will infer its own criteria, which may not match what you care about. Define exactly what you are measuring and what each point on your scale means. A rubric is not done until a person reading it could score responses the same way the model does.

Not Checking for Position Bias

If you run pairwise evaluations in a single order without swapping, position bias can dominate your results. A common finding is that the first response is preferred 55-65 percent of the time regardless of actual quality. Always run comparisons in both orders and check for consistency. Pairs where the preferred response changes with ordering should be flagged as ties or excluded.

Treating Judge Scores as Ground Truth

LLM judge scores are a proxy for quality. They are useful for relative comparisons, trend detection, and regression monitoring. They are not reliable ground truth for absolute quality claims. Validate them against human judgment on a calibration set before treating them as reliable ground truth for decisions that affect product quality or safety.

Using the Same Model as Judge and Evaluated Model

This creates self-enhancement bias that inflates scores for the evaluated model in ways that do not reflect actual quality improvements. Use the strongest available independent model as the judge. If you are evaluating GPT-4o outputs, do not use GPT-4o as the judge. Use Claude, Gemini, or another model from a different training lineage.

Ignoring the Variance in Scores

A single judge evaluation has meaningful variance. Running the same evaluation multiple times and reporting the variance tells you how confident the evaluation is. Low-variance evaluations are more trustworthy than high-variance ones. A result of "Model A preferred in 55% of comparisons" means something very different depending on whether the standard error of that estimate is 1 percent or 8 percent.

Best Practices

Write Rubrics Collaboratively with Domain Experts

Write rubrics collaboratively with domain experts and iterate on them using the cases where judge results surprise you. The quality of the rubric is the primary driver of evaluation quality, and domain experts can identify dimensions and failure modes that generalists miss. Plan to spend at least as much time on rubric design as on judge model selection.

Always Include Chain-of-Thought Reasoning

Always include a chain-of-thought step in your judge prompt, asking for reasoning before the score. It improves consistency and makes the evaluation interpretable. When the judge reasons poorly before giving a score, the reasoning makes that visible. Without the reasoning step, a bad score looks the same as a good one.

Build and Maintain a Calibration Set

Build a calibration set of 100 to 500 examples with human judgments. Measure how well your judge setup correlates with that ground truth before using it at scale. Maintain the calibration set over time, adding new examples when you discover failure modes. A calibration set is the only reliable signal that your judge is measuring something real.

Match the Evaluation Mode to the Decision

Use pairwise evaluation when comparing two systems; use pointwise when you need absolute quality thresholds rather than relative rankings, such as determining whether responses meet a minimum bar before deployment. The choice affects what statistical analysis is appropriate downstream and what decisions the results can support.

Report Variance Alongside Point Estimates

Report confidence intervals and variance alongside point estimates. A result of "Model A is preferred in 55% of comparisons" with high variance is very different from the same number with low variance. Reporting only point estimates misleads stakeholders about how much confidence to place in the comparison.

Version Your Judge Prompts and Rubrics

Maintain a changelog of your judge prompts and rubrics. When evaluation methodology changes, historical comparisons are invalidated. Versioning evaluation methodology prevents silent regressions where a quality improvement appears to occur because the judge changed rather than the model. Treat your evaluation system with the same discipline as your training pipeline.

Comparison: Evaluation Methods

Method	Speed	Cost	Qualitative dimensions	Bias risk
Human annotation	Slow	High	Yes	Human inconsistency, annotator fatigue
Reference-based metrics (BLEU, ROUGE)	Very fast	Very low	No	Penalizes valid paraphrases, rewards superficial matches
LLM-as-judge (pointwise)	Fast	Low to moderate	Yes	Verbosity bias, self-enhancement, factual blind spots
LLM-as-judge (pairwise)	Fast	Moderate (quadratic scaling)	Yes	Position bias; mitigated by order randomization
Automated unit tests	Very fast	Very low	Only what tests explicitly check	Tests only what was anticipated

Frequently Asked Questions

Which model should I use as a judge?

Use the most capable model available that is not the model being evaluated. In practice, GPT-4o and Claude 3.5 Sonnet are commonly used as judges for evaluation of mid-tier models. The judge should be at least as capable as the model being judged, ideally more capable, because a weaker judge cannot reliably identify the failures of a stronger model.

How do I know if my LLM judge is actually reliable?

Build a calibration set: a set of examples where you have both LLM judge scores and human evaluation scores. Compute the correlation or agreement rate between them. Agreement above 80 percent on pairwise judgments is a reasonable threshold for confidence. Below that, revisit your rubric and judge model selection before using the evaluation at scale.

Can I use LLM-as-judge for safety evaluation?

With significant caution. Safety evaluation using LLM-as-judge is used in practice, but the stakes of false negatives, judging an unsafe output as safe, are high. LLM judges can be manipulated by adversarial inputs and miss subtle policy violations. Safety evaluation should include human review and red-teaming alongside automated methods, not replace them.

Is pairwise or pointwise evaluation better?

Pairwise tends to be more reliable for model comparisons because the task of "which is better" is easier and less dependent on calibration than "what score does this deserve on a 1-5 scale." Pointwise is better when you need absolute quality thresholds rather than relative rankings, such as determining whether responses meet a minimum bar before deployment.

How should I handle cases where the judge gives a tie?

Ties are useful information: they mean the judge cannot distinguish a meaningful quality difference. Report the tie rate alongside win rates. A high tie rate on a pairwise comparison suggests the two models being compared are close in quality on that distribution, which is itself a valid finding. Do not force the judge to break ties artificially — the forced break introduces noise rather than signal.

References

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36 (NeurIPS 2023).
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). AlpacaEval: An Automatic Evaluator of Instruction-following Language Models. GitHub Repository.
Wang, P., Li, L., Chen, L., Zhu, D., Lin, B., Cao, Y., ... & Sui, Z. (2023). Large Language Models are not Fair Evaluators. arXiv preprint arXiv:2305.17926.
Liusie, A., Manakul, P., & Gales, M. J. (2024). LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models. arXiv preprint arXiv:2307.07889.
Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., ... & Cheng, X. (2023). Large Language Model Alignment: A Survey. arXiv preprint arXiv:2309.15025.
Zeng, Z., Yu, J., Gao, T., Meng, Y., Goyal, T., & Chen, D. (2024). Evaluating Large Language Models at Evaluating Instruction Following. International Conference on Learning Representations.
Shankar, S., Zamfirescu-Pereira, J., Hartmann, B., Parameswaran, A., & Arawjo, I. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST '24).

Key Takeaways

LLM-as-judge automates evaluation by using a capable model to score or compare outputs, enabling quality assessment at a scale and speed that human annotation cannot match.
The quality of the rubric is the single most important factor. Vague instructions produce vague results; precise rubrics produce actionable scores.
Known biases, including position bias, verbosity bias, and self-enhancement bias, must be actively controlled for rather than ignored.
Always validate your judge setup against human judgments on a calibration set before trusting it at scale. Correlation with human judgment is the only reliable signal that the judge is measuring something real.
LLM-as-judge complements but does not replace human evaluation for high-stakes decisions, safety assessments, or novel domains where the judge has limited coverage.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

Edge AI: Running LLMs on Your Phone Without the Cloud

2026-06-10T02:00:00+00:00

Edge AI: Running LLMs on Your Phone Without the Cloud

Introduction

For most of the history of AI, the assumption was simple: powerful models live in data centers, and devices are thin clients that send data up and receive results back. Running a large language model required racks of GPUs, megawatts of power, and a reliable internet connection.

That assumption no longer holds. Models like Phi-3-mini, Gemma 2B, and Mistral 7B run comfortably on a modern smartphone. Apple Intelligence processes most requests entirely on the device, never sending your messages, photos, or documents to a server. Google's Gemini Nano powers features in Pixel phones with no network call required. Edge AI, the practice of running machine learning models directly on the device where data is generated, has moved from research curiosity to shipping product.

This post explains how on-device AI works, why it matters, what its real limitations are, and where it is already delivering better results than cloud-based approaches.

Problem Statement

Cloud-based AI has three structural problems that on-device AI addresses directly.

The first is privacy. When you send a message to a cloud AI service, that message travels to a server, is processed, and a response comes back. Along the way, your data passes through networks, is logged by servers, and may be stored, reviewed, or used to train future models. For applications involving personal health data, private conversations, financial records, or sensitive business information, this is a meaningful concern that is difficult to resolve with contractual assurances alone.

The second is latency. Even with fast internet, a round trip to a remote server takes time. For real-time applications like live transcription, instant translation, or interactive on-screen assistance, even 200 milliseconds of network latency is perceptible. On-device inference eliminates that round trip entirely, making millisecond response times achievable for the right model sizes.

The third is availability. Cloud AI requires a connection. Devices often do not have one, or the connection is slow or unreliable. An AI feature that stops working in a tunnel, on a flight, or in a rural area with poor signal is a degraded experience. On-device models work regardless of connectivity, delivering consistent behavior in all conditions.

Core Concepts and Terminology

Term	Definition
Edge AI	Running AI model inference directly on the end-user device rather than on a remote server.
Quantization	A technique that reduces model size by representing weights with lower-precision numbers (e.g., 4-bit integers instead of 32-bit floats), trading a small amount of accuracy for large reductions in memory and compute.
Neural Processing Unit (NPU)	A dedicated chip found in modern smartphones and laptops, designed specifically to accelerate neural network operations with high efficiency and low power consumption.
Small Language Model (SLM)	A language model with a parameter count small enough (typically 1B to 7B parameters) to run on consumer hardware without requiring a data center GPU.
Model pruning	Removing weights from a trained model that contribute little to its output, reducing size with minimal accuracy loss.
Knowledge distillation	Training a smaller model (the student) to mimic the behavior of a larger model (the teacher), transferring capability into a smaller footprint.
Private Cloud Compute (PCC)	Apple's architecture that routes AI requests requiring more compute than the device can handle to cloud servers with strong privacy guarantees, verified through cryptographic attestation.
GGUF / llama.cpp	An open-source runtime and file format that allows quantized language models to run efficiently on consumer CPUs and GPUs, including Apple Silicon and x86 machines.

How It Works

Running a language model on a phone sounds impossible until you understand the techniques that make it practical. Here is what happens under the hood.

Start with a capable but smaller model. Full-scale models like GPT-4 have hundreds of billions of parameters and require enormous memory. Edge AI uses models in the 1B to 7B parameter range, which are still capable at many tasks but fit within the memory budget of a phone. Microsoft's Phi-3-mini and Google's Gemma 2B were designed specifically for this use case, trained on high-quality curated data to maximize capability at small parameter counts.
Quantize the weights. A 7B parameter model stored in 32-bit floating point requires roughly 28 GB of memory. The same model quantized to 4-bit integers requires about 3.5 GB, comfortably fitting in the RAM of a modern flagship phone. Quantization reduces precision but modern techniques (like GPTQ and AWQ) recover most of the lost quality through careful calibration on representative data.
Use the NPU for acceleration. The Apple Neural Engine in the A17 Pro (iPhone 15 Pro) and A18 (iPhone 16) chips, the Qualcomm Hexagon NPU in Android flagships, and similar chips in mid-range devices are optimized for the matrix multiplication operations that dominate transformer inference. Routing computation through the NPU achieves significantly better tokens-per-second than the CPU at a fraction of the power draw, enabling interactive speeds without draining the battery.
Load the model once, keep it in memory. On a phone, startup latency matters. On-device frameworks keep the model loaded in memory so that inference can begin immediately without a model load step on every request. The model loads once when the application starts, and subsequent inferences run without that overhead.
Return results locally. The generated tokens never leave the device. The entire inference loop runs on-chip. No network call is made unless the task explicitly requires external data, such as fetching a web page or calling an API.

Practical Example

Apple Intelligence on the iPhone 15 Pro, iPhone 15 Pro Max, and the full iPhone 16 lineup is the most widely deployed example of on-device language model inference as of 2026. When you use Writing Tools to rewrite a paragraph, the request goes to a language model running on the Apple Neural Engine, not to a server. The text you are editing never leaves your device. The response appears in about the same time it would take a cloud model to respond, but without any network round trip.

For tasks that require more compute than the device model can handle, such as generating a complex image or answering a research question, Apple's Private Cloud Compute architecture routes the request to cloud servers running Apple Silicon hardware. Crucially, these servers publish cryptographic attestations of their software configuration that any device can verify. Apple cannot see the data sent to PCC, and neither can anyone else.

This hybrid design, on-device for common tasks and privacy-preserving cloud for demanding ones, is the architecture that most serious edge AI deployments are converging on. The on-device model handles the high-frequency, latency-sensitive, privacy-critical cases. The cloud handles the low-frequency, high-complexity cases with stronger privacy protections than conventional cloud AI services offer.

Advantages

True Privacy by Default

Data that never leaves the device cannot be logged, stored, or leaked. For applications involving sensitive personal data, this is not just a feature; it is a prerequisite. On-device inference changes the privacy model fundamentally: instead of trusting a third-party server operator's data handling practices, users retain direct control over their data by never transmitting it in the first place.

Zero Latency from Network Round Trips

On-device inference is bounded only by the hardware, not by network conditions. For real-time features, this makes a perceptible difference in responsiveness. Live transcription, keyboard autocorrect, image tagging, and document classification all benefit from sub-50ms response times that cloud inference cannot reliably achieve over consumer networks.

Works Offline, Always

On-device models function in the absence of any network connection. Features that depend on cloud AI degrade or disappear without connectivity. On-device features do not. For applications used in transportation, field work, healthcare settings with restricted connectivity, or simply in everyday contexts where network reliability varies, offline capability is a significant practical advantage.

Lower Per-Request Cost at Scale

Cloud inference incurs a compute cost for every request. On-device inference has no marginal cost per request once the device is in a user's hands. For applications with very high query volume — keyboard suggestions, real-time translation, continuous audio processing — this economic difference is significant. The cost is borne by the device hardware manufacturer, not by the application developer on a per-query basis.

Reduced Regulatory Complexity

Applications that process personal data on-device are often simpler to comply with under data protection regulations like GDPR and HIPAA because no personal data is transmitted or stored externally. On-device processing can reduce the scope of a data processing agreement, simplify a compliance posture, and enable applications in regulated industries that cannot risk transmitting sensitive data to third-party servers.

Limitations and Trade-offs

Smaller Models, Lower Capability Ceiling

A 3B parameter quantized model will not match the reasoning capability of a 70B parameter cloud model on complex tasks. For multi-step reasoning, broad factual recall, nuanced creative writing, or tasks requiring knowledge of recent events, cloud models still win by a meaningful margin. The gap is closing with each generation of small models, but it has not closed.

Memory Constraints Are Real

Even with quantization, running a language model alongside other apps requires careful memory management. On devices with less than 8 GB of RAM, performance degrades noticeably or models cannot load at all without aggressive compression that further reduces quality. Not all devices your users carry are flagship devices, and the distribution of device capabilities in your user base matters for feature design.

Battery Impact Under Sustained Load

Neural network inference is computationally intensive. Sustained on-device inference draws more power than most other tasks a phone performs. Short queries on a well-optimized NPU are manageable, but long-running agentic tasks or continuous audio processing can meaningfully reduce battery life. Thermal throttling under sustained load also reduces performance over time.

Fragmented Hardware Ecosystem

The gap between flagship devices and mid-range or budget devices is significant. An experience that runs smoothly on an iPhone 16 Pro may be unusably slow on a 3-year-old mid-range Android phone. On Android in particular, the diversity of hardware configurations means that performance testing must cover a representative range of devices, not just the models your team carries.

Update Lag Compared to Cloud

Cloud models can be updated instantly for all users. On-device models are bundled with software updates, which take time to roll out and depend on users installing them. A model with a discovered bias or error cannot be corrected overnight for the entire user base. This matters most for safety-critical applications where model behavior needs to be updatable in response to discovered issues.

Common Mistakes

Assuming On-Device Always Means Worse Quality

For short-form tasks, summarization, quick classification, and text transformation, a small on-device model often performs comparably to a large cloud model. The quality gap is largest on knowledge-intensive and multi-step reasoning tasks. Evaluate your specific use case before concluding that cloud inference is required — the right task scope can make on-device models entirely sufficient.

Ignoring Thermal Throttling in Benchmarks

Many device benchmarks run a model for a short burst. Real applications run inference repeatedly over time. Sustained inference triggers thermal throttling that reduces performance significantly on most devices. Test with sustained load patterns that match your actual usage, not just peak burst performance. A model that runs at 30 tokens per second in a benchmark may run at 12 tokens per second after five minutes of continuous use.

Treating All Edge Deployments as Equivalent

Running a model on an NPU-equipped flagship phone, a laptop with Apple Silicon, a Raspberry Pi, and an IoT microcontroller are four entirely different engineering problems with different memory budgets, compute profiles, power envelopes, and software toolchains. Learnings from one do not transfer directly to another. Scope your deployment target early and design for it specifically.

Skipping Quantization Evaluation on Your Task

Different quantization levels have different accuracy trade-offs for different tasks and domains. A 4-bit quantized model that performs well on general reasoning benchmarks may perform significantly worse on medical terminology, legal language, or code in unusual programming languages. Evaluate quantized models on your specific use case rather than assuming published benchmarks reflect your workload.

Best Practices

Choose Model Size with Memory Headroom

Choose the model size that fits within the device's memory budget with headroom for other processes. Tight memory margins cause system pressure, background process termination, and degraded user experience. A model that uses 80 percent of available RAM on a target device will behave unpredictably in real usage where other apps compete for memory.

Route Computation Through the NPU

Use the device's dedicated neural processing unit rather than the CPU. The power efficiency and throughput difference is substantial: NPU inference typically delivers 3x to 10x better tokens-per-second per watt compared to CPU inference. Most on-device AI frameworks (Core ML, ONNX Runtime, MediaPipe) route to the NPU automatically when available, but verify this in your specific configuration.

Evaluate Quantization on Your Specific Task

Evaluate quantized model quality on your specific task and domain before committing to a quantization level. General benchmarks are a starting point, not a final answer. Run your evaluation on a representative sample of the inputs your application will actually process, including edge cases and domain-specific vocabulary.

Design Hybrid Systems Thoughtfully

Design systems that use on-device models for common, latency-sensitive tasks and route demanding tasks to the cloud with appropriate privacy protections. The routing decision should be transparent to users where possible, and the fallback behavior when cloud routing is unavailable should be explicitly designed, not left as an error state.

Test on Your Actual Device Distribution

Test on the actual device distribution your users have, not just the latest flagship. The performance gap between device tiers is wide. Identify the minimum supported device specification early and verify acceptable performance on it before shipping. Monitor performance metrics by device model in production to catch regressions on specific hardware.

Monitor Battery and Thermal Behavior Under Real Usage

Monitor battery and thermal behavior under real usage patterns, not just peak benchmark conditions. Set power budgets for your inference workload and test whether the application stays within them over a realistic session length. Users notice battery drain more quickly than they notice quality improvements.

Comparison: On-Device vs. Cloud AI

Dimension	On-Device	Cloud
Privacy	Data stays on device by default	Data transmitted to external servers
Latency	No network round trip	Network-dependent, typically 100-500ms additional
Offline capability	Full functionality	Requires connectivity
Model capability	Limited by device hardware	Virtually unlimited compute
Per-request cost	Zero marginal cost	Billed per token
Update speed	Dependent on app/OS update rollout	Instant for all users
Battery impact	Higher on sustained use	Network only; compute offloaded

Frequently Asked Questions

What phones can actually run a language model today?

Any iPhone from the iPhone 15 Pro onward, with A17 Pro or newer chips, can run Apple Intelligence on-device. On Android, devices with Qualcomm Snapdragon 8 Gen 2 or newer, or Google's Tensor G3 or newer, have sufficient NPU capability. Mid-range devices with 8 GB or more RAM can run smaller quantized models through apps like llamafile or MLC Chat, though more slowly. Phones with 4 GB or less RAM will struggle with most language models.

Are on-device models actually private?

Inference on a device you control, using a model stored locally, is private in the meaningful sense: the data does not leave the device during processing. Caveats apply: the app using the model may still transmit data for other purposes, and the model itself was trained on data elsewhere. On-device inference addresses the inference-time privacy concern, not the entire data lifecycle.

How much smaller are on-device models than cloud models?

Cloud models like GPT-4 are estimated at several hundred billion parameters. On-device models typically range from 1B to 7B parameters before quantization. After 4-bit quantization, a 3B model might occupy around 1.5 GB of memory and a 7B model around 3.5 GB. The quality gap is real but narrowing rapidly as smaller models are trained more efficiently on better data.

Is Apple Intelligence actually private?

For on-device tasks, yes: no data leaves the device. For tasks routed to Private Cloud Compute, Apple has published significant technical detail about how the architecture prevents access to user data even by Apple employees. External security researchers have been given access to verify these claims. It represents a significantly stronger privacy model than conventional cloud AI services, though it still involves sending data to infrastructure Apple operates.

Can I run a local model on my laptop today?

Yes, and relatively easily. Tools like Ollama, LM Studio, and llamafile allow anyone with a modern laptop to download and run quantized language models with a few commands. On Apple Silicon MacBooks, the Unified Memory architecture is particularly well-suited to this, allowing larger models than phones can handle. A MacBook Pro with 16 GB of RAM can comfortably run a 7B to 13B parameter model at useful speeds.

References

Apple. (2024). Apple Intelligence Overview. Apple Machine Learning Research.
Apple. (2024). Private Cloud Compute: A new frontier for AI privacy in the cloud. Apple Security Research.
Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint arXiv:2404.14219.
Team, G. (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv preprint arXiv:2210.17323.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978. (MLSys 2024 Best Paper Award)
Gerganov, G. et al. (2023). llama.cpp: Inference of LLaMA model in pure C/C++. GitHub Repository.

Key Takeaways

On-device AI is no longer theoretical. Small quantized language models run on flagship smartphones today, with no network required.
The three core advantages are privacy (data never leaves the device), latency (no network round trip), and offline availability (works without a connection).
Quantization and knowledge distillation are the key techniques that make capable models small enough to fit in device memory and fast enough to be interactive.
A hybrid approach, on-device for common tasks and privacy-preserving cloud for demanding ones, is the architecture most serious deployments are adopting.
The capability gap between on-device and cloud models is real but closing, driven by better training methods and hardware improvements in every new chip generation.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

AI Coding Assistants in 2026: Cursor, GitHub Copilot, and the Future of Software Development

2026-06-09T02:00:00+00:00

AI Coding Assistants in 2026: Cursor, GitHub Copilot, and the Future of Software Development

Introduction

Three years ago, an AI coding assistant meant a smarter autocomplete. It would finish the line you were typing, suggest a function signature, or generate a boilerplate class when prompted. Impressive, but still fundamentally a text-completion tool that required a human to drive every decision.

The tools available in 2026 are categorically different. Cursor can open your entire codebase, understand the relationships between files, refactor a module across twenty files simultaneously, and explain why it made each change. GitHub Copilot now reviews pull requests, suggests fixes for failing tests, and integrates into the CI pipeline. Devin and its competitors take a task description and attempt to deliver a working pull request with no further input.

This is not incremental improvement. It is a shift in what the relationship between a developer and their tools looks like. This post explains what each major tool does, where they genuinely deliver value, where they fall short, and how working developers are actually incorporating them into their workflows.

Problem Statement

Software development is one of the most cognitively demanding professions that exists. Developers hold large mental models of codebases, context-switch constantly between tasks, and spend a surprising fraction of their time on work that is mechanical rather than creative: writing boilerplate, translating a spec into routine code, searching documentation, writing tests for logic they already understand.

AI coding assistants target that mechanical fraction. The promise is that by automating the low-creativity high-volume work, developers can spend more time on the decisions that actually require human judgment: system design, trade-off evaluation, understanding user needs, and handling the genuinely novel problems that do not have a Stack Overflow answer.

The challenge is that the line between mechanical and creative work is not always clear, and tools that cross that line without flagging it create new categories of risk: subtle bugs introduced by confidently wrong suggestions, security vulnerabilities generated from outdated training data, and codebases that grow faster than anyone understands them.

Core Concepts and Terminology

Term	Definition
Inline completion	Real-time code suggestions that appear as ghost text while the developer types, accepted with a single keystroke.
Chat interface	A conversational panel inside the IDE where the developer asks questions or gives instructions in natural language.
Multi-file editing	The ability of a tool to understand and modify multiple files in a codebase in a single operation.
Agentic coding	A mode where the AI plans and executes a sequence of actions (read file, write code, run test, fix error) autonomously toward a goal.
Codebase indexing	The process of embedding and storing a codebase so that relevant files and symbols can be retrieved quickly during inference.
AI code review	Automated analysis of a pull request or diff to identify bugs, style violations, security issues, or logic errors.
SWE-bench	A benchmark of real-world GitHub issues used to evaluate how well AI agents can resolve actual software bugs.
Diff review	The presentation of AI-proposed code changes as a structured diff that the developer inspects and accepts or rejects before anything is written to disk.

How It Works

Most AI coding assistants follow a similar underlying architecture, though the user-facing experience varies significantly between tools.

The IDE or editor sends context to the model. This includes the current file, surrounding files, the cursor position, recent edits, and any explicit instructions from the developer. The amount of context sent varies by tool and depends on codebase indexing. The quality of what the tool sends to the model is the primary driver of output quality.
Codebase indexing makes retrieval possible. Tools like Cursor index the entire repository using embeddings. When you ask a question or trigger a completion, the tool retrieves the most relevant files and symbols from the index and includes them in the context sent to the model. This is what allows the tool to answer questions about code it has never explicitly been shown in the current session.
The model generates a completion or response. For inline completions, this is a continuation of the current code. For chat, it is an explanation or a suggested change. For agentic tasks, it is a plan followed by a sequence of tool calls: reading files, writing edits, running terminal commands, and checking outputs.
Edits are proposed as diffs. Rather than rewriting files directly, most tools present proposed changes as a diff that the developer can review and accept or reject before anything is written to disk. Agentic tools may apply edits automatically and run tests to verify them, but the best tools still surface the diff for human review.
Feedback loops improve results. The developer's acceptance or rejection of suggestions, the outcome of test runs, and any follow-up corrections are fed back into the context, allowing the model to adjust its next action. Longer agentic loops accumulate this feedback over multiple steps and converge on working solutions.

Practical Example

Suppose a developer needs to add pagination to a REST API endpoint that currently returns all records. Without an AI tool, this involves reading the existing handler, updating the query logic, modifying the response schema, updating the API documentation, and writing tests for the new parameters.

With Cursor in agent mode, the developer types a one-sentence instruction: "Add limit and offset pagination to the /users endpoint and update the tests." Cursor reads the existing handler, the database query layer, the test file, and the API schema. It proposes changes across all four files simultaneously. The developer reviews the diff, notices that the tool used a different default page size than the project's convention, corrects that in the diff, and accepts the rest. The test suite passes. The whole process takes a few minutes instead of an hour.

The developer did not stop thinking. They reviewed the output, caught the convention mismatch, and made a judgment call. The tool did the mechanical work of reading the existing code, understanding the pattern, and translating the requirement into correct changes across multiple files. That is the realistic version of what these tools deliver well.

The same task with a weaker workflow, copying the handler into a chat window and asking "how do I add pagination?", produces a generic explanation that the developer must still manually translate into their specific codebase. The difference is not the model but the context: Cursor sent the actual code; the chat window sent only the question.

Advantages

Significant Speed Gains on Routine Tasks

Boilerplate generation, test writing, documentation, and straightforward feature additions are genuinely faster with AI assistance. Developers consistently report 20 to 40 percent time savings on these categories of work. The gains are largest on tasks that are well-defined and repetitive, where the developer already knows exactly what should be produced.

Lower Barrier to Unfamiliar Territory

Working in an unfamiliar language, framework, or codebase is less intimidating when you can ask questions and get contextual answers without leaving the editor. A developer who knows Python well can be productive in a Go codebase much sooner than before, because the assistant fills in framework-specific patterns while the developer focuses on the logic.

Catches Common Errors Proactively

AI code review flags obvious issues like off-by-one errors, missing null checks, and insecure patterns before they reach human reviewers. These are exactly the errors that humans miss most often in review: they are mechanical rather than conceptual, and reviewers who have been looking at code for hours skip over them. Automated pre-screening reduces the load on human reviewers and lets them focus on design-level concerns.

Documentation Is Easier to Maintain

Generating and updating docstrings, README sections, and inline comments from code is a task AI tools handle well, making it more likely that documentation stays current. Outdated documentation is one of the most persistent problems in software projects. AI assistance lowers the marginal cost of keeping it accurate enough that developers actually do it.

Reduces Context-Switching

Asking the assistant a question about an API or a design pattern inside the editor is faster than switching to a browser, running a search, and returning. Every context switch costs time and breaks concentration. Keeping the question-and-answer loop inside the IDE reduces these interruptions and keeps developers in flow longer.

Limitations and Trade-offs

Confident Incorrectness

These tools can produce plausible-looking code that is subtly wrong. The polish of the output does not reliably signal its correctness. A function that compiles, passes linting, and reads naturally can still contain a logic error that only surfaces under specific input conditions. Developers who accept suggestions without reading them introduce bugs at scale — faster than they would have introduced them without the tool.

Security Risks from Training Data

Models trained on public code learn insecure patterns that appear in that code. Generated code may contain SQL injection vulnerabilities, improper input validation, or outdated cryptography that looked correct in training data from several years ago. The model has no awareness that a pattern it learned from an old Stack Overflow answer has since been deprecated or found to be insecure.

Weak on Novel Architectures

When a codebase has unusual design patterns or domain-specific conventions that are not well represented in training data, the model frequently produces suggestions that violate those conventions. Internal frameworks, proprietary abstractions, and highly opinionated codebases create exactly the conditions where AI assistance underperforms.

Agentic Tools Can Make Large Mistakes

A model operating autonomously across files can propagate an incorrect assumption through dozens of changes before a test failure surfaces the problem. Undoing that is costly, especially when the agentic loop has touched many files. The more autonomous the tool, the more important it is to establish short verification checkpoints before each major change batch.

Privacy and IP Concerns

Code sent to cloud-based assistants may be stored or used for training. Organizations with sensitive intellectual property or compliance requirements need to evaluate this carefully before adopting cloud tools. Enterprise tiers of most major tools offer explicit commitments against training on customer code, but verifying those commitments requires reading the contract, not just the marketing copy.

Common Mistakes

Accepting Suggestions Without Reading Them

The speed benefit of AI assistance disappears if you spend time debugging confidently generated bugs. Read every suggestion before accepting it. At minimum, verify that the generated code does what you believe it does before moving on. The review step is not overhead; it is the quality gate that makes the tool safe to use at speed.

Asking Vague Questions

"Fix this" produces worse results than "This function should return an empty list when the input is None, but it currently throws a TypeError. Fix that case." Specificity in instructions dramatically improves output quality. The more precisely you describe the expected behavior, the constraints, and the failure mode, the more accurately the tool can address the actual problem.

Trusting the Tool on Security-Sensitive Code

Authentication, authorization, cryptography, and input validation are areas where AI-generated code should be reviewed with higher skepticism and ideally by a security-aware developer. A model that has learned from millions of code examples has also learned from millions of insecure examples. Generated security code that passes all tests can still contain subtle vulnerabilities.

Using AI to Avoid Understanding the Codebase

Developers who use AI to navigate code they never actually understand become dependent on the tool to maintain code they cannot reason about independently. This creates fragility: when the tool produces a wrong suggestion, you cannot catch it because you do not understand the code well enough to know what correct looks like. Understanding is not optional; it is the safety net.

Letting AI Write All the Tests

Tests written by AI to satisfy AI-written code can pass trivially while covering nothing meaningful. The AI will write tests that pass its own implementation, not tests that verify the specification. Write or critically review tests yourself, especially for business-critical logic. The value of a test suite comes from its ability to catch future regressions, not from its current pass rate.

Best Practices

Use AI Most Aggressively on Code You Already Understand

Your ability to catch mistakes is the quality gate. The tool is most valuable when you can review its output quickly and accurately. If you would not be able to spot an error in the generated code, you are not ready to accept it without a more thorough check. AI assistance amplifies your existing knowledge; it does not substitute for it.

Give the Tool Explicit Context

When starting a task, tell the tool what the function should do, what conventions the codebase uses, and what the failure mode of a wrong answer would be. Tools like Cursor can read your codebase automatically, but explicit instructions about project conventions and constraints always improve results over relying on inference alone.

Run Tests After Every AI-Assisted Change

Catching a bad suggestion early is much cheaper than unwinding a sequence of changes built on top of it. Run your test suite after every significant AI-assisted change, not just at the end of a session. If you are using an agentic mode, configure the agent to run tests automatically after each file modification so failures surface immediately.

Maintain Your Own Understanding of the Codebase

Use AI to move faster through work you already understand, not to replace understanding you never built. Read the generated code as carefully as you would read a pull request from a junior developer. Over time, your pattern recognition improves and your review becomes faster — but it should never become perfunctory.

Evaluate Tools on Your Actual Stack Before Adopting

Different tools have different strengths across languages, frameworks, and codebase sizes. Test with your actual stack in a sandbox before adopting a tool for production use. Published benchmarks reflect aggregate performance across many tasks and languages; they may not predict how the tool behaves on your specific codebase.

Check Your Organization's Code Sharing Policy

Before using any cloud-based assistant with proprietary code, verify that it complies with your organization's data handling requirements. This is not a one-time check: review policies when you renew subscriptions, when a tool updates its terms of service, and when the sensitivity of the code you are working with changes.

Tool Comparison

Tool	Best for	Key capability	Watch out for
Cursor	Multi-file editing and codebase-wide refactoring	Full codebase indexing, agent mode, diff review	Large agentic runs can propagate errors
GitHub Copilot	Teams already on GitHub; PR review integration	Inline completions, PR review, CI integration	Less context-aware than Cursor for large codebases
Claude Code (CLI)	Terminal-driven, agentic development tasks	Long-horizon tasks, bash integration, large context	Requires comfort with CLI-first workflows
Devin / SWE-agents	Fully autonomous task completion	End-to-end issue resolution with no human steps	High variance outputs; still requires careful review
Codeium / Supermaven	Fast inline completions at low or no cost	Speed and low latency completions	Less powerful on complex multi-file tasks

Frequently Asked Questions

Will AI coding assistants replace software developers?

Not in the near term, and likely not in the sense the question implies. What is changing is the composition of a developer's work. The mechanical fraction is being automated, which means the judgment, design, and communication fractions become proportionally more important. Developers who develop those skills alongside their technical skills are well positioned. Developers who treat AI assistance as a substitute for understanding their craft are not.

How much faster does coding actually get?

It depends heavily on the task type. For boilerplate, test generation, and documentation, experienced developers commonly report 2x to 3x speed improvements on those specific tasks. For novel algorithmic problems, complex architecture decisions, or debugging subtle runtime issues, the improvement is much smaller. Across a full working day that mixes task types, 20 to 40 percent overall productivity gains are the figures most commonly cited by developers who have adopted these tools seriously.

Is it safe to use these tools with private company code?

It depends on the tool and your organization's policies. Most enterprise tiers of tools like Copilot and Cursor offer explicit commitments that code is not stored or used for training. Self-hosted and local models eliminate the concern entirely. Read the terms of service carefully and consult your legal and security teams before using any cloud-based tool with sensitive code.

Which tool should a beginner start with?

GitHub Copilot is the most widely used and has the most resources and community support. It integrates into VS Code, JetBrains, and most major editors with minimal setup. Start there, learn to use inline completions effectively, and then explore more powerful tools like Cursor once you have a sense of where you want more capability.

Can these tools help with learning to code?

Yes, with an important caveat. Using AI to get explanations, understand error messages, and see examples of patterns is genuinely useful for learning. Using AI to generate code you then submit without understanding is not learning; it is deferring learning while producing an artifact that you cannot maintain or debug. The best use for learners is to ask why, not just what.

References

Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv preprint arXiv:2302.06590.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? International Conference on Learning Representations.
GitHub. (2024). GitHub Copilot: The AI Pair Programmer. GitHub Documentation.
Cursor. (2025). Cursor Documentation. Anysphere Inc.
Ziegler, A., Kalliamvakou, E., Li, X. A., Rice, A., Rifkin, D., Simister, S., ... & Aftandilian, E. (2022). Productivity Assessment of Neural Code Completion. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming.
Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. IEEE Symposium on Security and Privacy.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.

Key Takeaways

AI coding assistants in 2026 span a wide range from inline autocomplete to fully autonomous software agents, and the right tool depends on your task and comfort with reviewing AI output.
The biggest gains come on mechanical, repetitive tasks. Novel problems, architecture decisions, and security-sensitive code still require human judgment.
Accepting suggestions without reading them is the most common and costly mistake. AI assistance amplifies developer speed but also amplifies the rate at which errors can be introduced.
The developers getting the most value from these tools are not those who use AI to avoid thinking; they are those who use AI to move faster through work they already understand.
Privacy and security review of cloud-based tools is not optional for professional developers working with proprietary code.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

Context Engineering: The New Skill That Is Replacing Prompt Engineering

2026-06-08T02:00:00+00:00

Context Engineering: The New Skill That Is Replacing Prompt Engineering

Introduction

A few years ago, prompt engineering was considered a genuine craft. Getting a language model to behave the way you wanted required careful wording, clever framing, and a mental model of how the model would interpret your instructions. Communities formed around sharing the best prompts. Job postings appeared. People wrote books.

Something has shifted. As language models have grown more capable, the exact wording of a prompt has become less decisive. What matters far more now is what surrounds the prompt: the documents, examples, instructions, memory, tool outputs, and conversation history that fill the context window alongside it. This is what practitioners now call context engineering, and it is quickly becoming the most important skill in applied AI development.

This post explains what context engineering is, how it differs from prompt engineering, why it matters more as models scale, and how to do it well in systems.

Problem Statement

Modern language models are extraordinarily capable inside the context window. They can reason, summarize, translate, code, and plan. But they are also fundamentally stateless. Every time you call a model, it sees only what you put in front of it. It has no persistent memory, no ambient awareness of your system, and no direct access to the world.

This means the quality of a model's output is almost entirely determined by the quality of its input. A model with 200,000 tokens of context capacity is only as useful as what you choose to fill those tokens with. Put in noisy, redundant, or misordered information and the model will produce mediocre results no matter how cleverly you word the instruction at the end. Put in precise, relevant, well-structured context and even a modest instruction will yield excellent output.

The practical implication is clear: optimizing the phrasing of your prompt is a local optimization. Optimizing what you put in the context window is a global one. Context engineering is that global optimization.

Core Concepts and Terminology

Term	Definition
Context window	The maximum number of tokens a model can process in a single call, including both input and output.
System prompt	Instructions placed at the start of the context that define the model's persona, constraints, and task framing.
Retrieval-augmented generation (RAG)	A technique that retrieves relevant documents from an external store and injects them into the context before inference.
Few-shot examples	Input-output pairs placed in the context to show the model the expected format or reasoning style.
Conversation history	Prior turns in a dialogue that provide continuity and allow the model to refer back to earlier information.
Tool output	The result of a function call or API request that is injected back into the context for the model to reason over.
Context compression	Techniques such as summarization or filtering that reduce context size while preserving essential information.
Lost in the middle	A documented phenomenon where models attend less reliably to information placed in the middle of long contexts.
Token budget	A deliberately allocated limit on the number of tokens each component of the context is allowed to consume, enforced to prevent any single component from crowding out others.
Dynamic few-shot selection	The practice of choosing few-shot examples at runtime based on semantic similarity to the current input, rather than using a fixed set of examples for all queries.

How It Works

Context engineering is less a single technique and more a discipline of decisions made before the model ever runs. Here is how a well-engineered context is typically assembled:

Define the role and constraints in the system prompt. This comes first and sets the frame. A well-written system prompt does not just name the role; it specifies what the model should and should not do, what format it should use, and what assumptions it can make about the user. Think of it as the standing instructions that apply to every interaction.
Retrieve only what is relevant. If your system uses RAG, do not dump an entire knowledge base into the context. Use semantic search or keyword filtering to pull the two to five documents most relevant to the current query. Irrelevant documents add noise, consume token budget, and make it harder for the model to locate the actual answer.
Place the most important information at the edges. Models attend more strongly to content near the beginning and end of the context. Put your task instructions and critical facts either early or late in the window, not buried in the middle. If a piece of information is critical enough that the model must not miss it, consider stating it twice: once near the top and once near the instruction.
Select few-shot examples that match the current input. Static examples written once at deployment time are often suboptimal. Dynamic example selection picks the examples most similar to the current query from a library, giving the model better pattern guidance. Even three well-chosen dynamic examples typically outperform ten static ones.
Compress conversation history as it grows. Long conversations fill the context with stale information. Summarize earlier turns into a compact memory block and retain only the most recent raw exchanges, keeping the context fresh and within budget. Summarization preserves semantic content; truncation from the front discards the original framing that gave the conversation its meaning.
Inject tool outputs cleanly. When a tool returns data, format it clearly before inserting it. Label what the data is, where it came from, and when it was retrieved. Raw JSON blobs or API dumps are harder for the model to reason over than structured prose or labeled tables.
Order the components for logical flow. The model reads the context sequentially. Arrange components so that each builds naturally on the previous one: persona, then background, then examples, then the current task. Components that conflict or repeat one another reduce coherence without adding value.

Practical Example

Consider a customer support agent that answers questions about a software product. A naive implementation puts the user's question directly into a chat prompt with a brief system instruction. A context-engineered implementation looks quite different.

The system prompt defines the agent's persona, tone, escalation policy, and the product version it is supporting. Before inference, the agent retrieves the three most relevant sections from the product documentation using the user's question as a search query. If the user has contacted support before, a compressed summary of prior interactions is included. If the user's account data is available, the relevant fields (plan tier, recent errors) are injected in a labeled block. Recent conversation turns are included in full. The user's question comes last.

The model never sees a different prompt wording between runs. What changes is the context surrounding the question. The agent consistently produces accurate, personalized answers not because the instruction was perfectly worded, but because the context contained exactly the information needed to reason well.

This is the practical difference between prompt engineering and context engineering. Prompt engineering asks: how should I word this? Context engineering asks: what information does the model need, and how should I structure and order it?

Advantages

Scales with Model Capability

As models get better at using long contexts, good context engineering compounds in value. The investment in structuring context pays off more with each model generation. A context pipeline designed carefully today will become more valuable as future models improve at attending to the information you provide, not less.

Model-Agnostic by Design

A well-designed context pipeline works across different model providers. Switching from one model to another requires little rework when the context structure is clean. You are not locked into a specific vendor's prompt format or quirks; the information architecture transfers, and the switching cost stays low.

Separates Concerns Cleanly

The information retrieval logic, memory management, and instruction design can each be developed and tested independently, making the system easier to maintain. A bug in retrieval quality can be diagnosed and fixed without touching the prompt or the output formatting layer. This separation dramatically reduces the surface area of debugging.

Reduces Prompt Sensitivity

When the context is rich and well-ordered, small changes in wording have less impact on output quality. The system becomes more robust to the kind of prompt fragility that plagues simpler setups, where rephrasing a question by a few words changes the answer significantly. Robustness is a production requirement, not a nice-to-have.

Enables Transparency and Auditability

Because the context is explicit and inspectable, you can audit exactly what information the model had access to when it produced any given output. This is essential for debugging, compliance review, and understanding why a model produced a particular response. No other part of an AI system offers this level of transparency into model behavior.

Limitations and Trade-offs

Token Cost Scales with Context Size

More context means higher inference cost and latency. Every token injected must be paid for and processed. Context engineering requires careful budgeting, and the cost of rich context grows with query volume. At scale, the difference between a 2,000-token and a 10,000-token context per request is a meaningful expense difference that affects product economics.

Retrieval Quality Is the Primary Bottleneck

If your retrieval system returns the wrong documents, no amount of downstream context structuring will save the response. Retrieval quality directly caps output quality. A significant portion of context engineering effort must therefore go into the retrieval system itself, not just the context format. Retrieval failures are context failures.

Lost-in-the-Middle Risk Persists

Very long contexts can still cause the model to miss information placed in the middle. Mitigation requires deliberate placement and sometimes repetition of critical facts. No context engineering technique fully eliminates this effect; it can only reduce it through careful positioning and selective emphasis.

Complexity Overhead Can Exceed the Benefit

A well-engineered context pipeline involves multiple moving parts: retrievers, summarizers, formatters, and selectors. Each introduces a failure mode and maintenance burden. For simple applications, the overhead may not be justified. Context engineering is most valuable when the output quality gain clearly exceeds the pipeline complexity cost.

No Guaranteed Grounding

Even with excellent context, models can still hallucinate or over-rely on training knowledge rather than context-provided facts. Context engineering reduces this risk substantially but does not eliminate it. Verification mechanisms, citations, and confidence signals remain necessary complements for high-stakes applications.

Common Mistakes

Retrieving Too Many Documents

Padding the context with loosely relevant content is worse than being selective. Irrelevant documents dilute the signal and push critical information further from the edges where the model attends best. In practice, two to five highly relevant documents consistently outperform ten loosely relevant ones. Relevance is the constraint; volume is not the goal.

Ignoring Position Effects

Placing critical instructions in the middle of a long context is a reliable way to have them under-weighted. Always position key content at the start or end of the context window. If you cannot avoid placing something important in the middle, repeat a summary of it near the end where the model will attend again before generating its response.

Using Static Few-Shot Examples for Every Query Type

Examples written for one kind of input pattern mislead the model on other patterns. A customer support agent with examples about billing questions will handle billing well and everything else inconsistently. Select examples dynamically based on the current input to give the model pattern guidance that matches what it is actually being asked to do.

Never Compressing History

Allowing conversation history to grow unbounded until it hits the context limit creates a cliff where the system suddenly forgets everything. Compress proactively rather than reactively. A well-summarized conversation block of 300 tokens contains more useful context than 300 tokens of the most recent raw exchanges, because summarization preserves meaning rather than just recency.

Injecting Raw Data Without Labels

Dropping a tool output into the context without explaining what it is forces the model to guess at its meaning, units, and recency. Always label data sources, what the numbers represent, what units are being used, and when the data was retrieved. A labeled table is dramatically easier for a model to reason over than an unlabeled JSON blob.

Optimizing the Prompt Before the Context

Spending hours on instruction wording while leaving retrieval and structure unexamined is misplaced effort. In most production systems, the context structure and retrieval quality have five to ten times the impact on output quality compared to the exact wording of the instruction. Fix the context first, then refine the prompt.

Best Practices

Treat Context Design as a First-Class Engineering Concern

Document what each component of the context is for and why it is ordered the way it is. Context structure should be version-controlled alongside the code. When the structure changes, the change should go through the same review process as any other system change, because context structure changes are model behavior changes.

Log Full Contexts and Inspect Them

Reading the actual context the model received before a bad output will reveal the root cause faster than any other debugging method. Build logging into your context assembly pipeline from the start. In development, read every context manually before assuming the system is working. Most production bugs in AI systems are context bugs, not model bugs.

Build a Token Budget and Enforce It

Assign token allocations to each context component and instrument your pipeline to alert when any component exceeds its allocation. Enforce the budget at runtime rather than hoping components stay within bounds. Without enforcement, components tend to grow over time as engineers add features, and the context silently degrades in quality.

Test Retrieval Quality Independently of Model Quality

Evaluate whether your retriever returns the right documents before evaluating whether the model produces the right answers. Use a test set of queries with known ground-truth relevant documents and measure recall and precision at each retrieval depth. A retrieval system that fails to return relevant documents at rank one to five cannot be saved by better context formatting downstream.

Use Summarization to Manage History, Not Truncation

Truncating conversation history from the start loses the earliest context that gave the conversation its framing and purpose. Summarizing preserves it in compressed form. A good rule of thumb is to summarize conversation turns older than five to ten exchanges into a memory block that is refreshed as the conversation continues.

Maintain a Curated Few-Shot Example Library

Build a curated library of high-quality input-output examples and use embedding-based search to select the best match for each query at runtime. Invest time in example quality: a library of 50 excellent, diverse examples will outperform a library of 500 mediocre ones. Prune the library regularly to remove low-quality or redundant examples.

Version Your Context Templates

When context structure changes, track what changed and how output quality was affected. Treat context template versions as you would treat model versions: with changelogs, regression tests, and a clear rollback path. Without versioning, it is impossible to attribute a quality change to a context change versus a model change.

Comparison: Prompt Engineering vs. Context Engineering

Dimension	Prompt Engineering	Context Engineering
Primary focus	Wording of the instruction	What information surrounds the instruction
Scope	Single prompt or template	Entire context assembly pipeline
Skills involved	Writing, linguistics, intuition	Systems design, information retrieval, data engineering
Impact on output	Moderate, diminishing with model scale	High, increasing with model scale
Transferability	Often model-specific	Generally model-agnostic
Testability	Hard to isolate variables	Each component can be tested independently
Relevant for agents	Partially	Centrally, agents are almost entirely context management

Frequently Asked Questions

Is context engineering only relevant for agents and RAG systems?

No, though it is most visible in those settings. Even a simple single-turn chatbot benefits from thoughtful context design: what examples to include, how to word the system prompt, whether to include user metadata. Context engineering applies wherever a model has a context window, which is always.

Does context engineering replace fine-tuning?

They address different problems. Fine-tuning changes what the model knows and how it behaves by default. Context engineering shapes what the model attends to at inference time. In many production cases, context engineering delivers most of the gains that developers initially hoped to get from fine-tuning, with less cost and faster iteration. Fine-tuning is still valuable for teaching the model new behaviors or domain-specific styles that cannot be reliably conveyed through context alone.

How do I know if my context is well-engineered?

The most direct signal is output quality under varied inputs. A well-engineered context produces consistently good outputs across diverse queries, not just the ones you tested on. You can also log and inspect contexts manually, run ablations by removing individual components and measuring the impact, and evaluate retrieval quality independently of the downstream model.

What happens when the context window is full?

You have to decide what to drop. This is one of the most consequential decisions in context engineering. Options include compressing conversation history through summarization, dropping the least relevant retrieved documents, shortening few-shot examples, or using a hierarchical approach where a cheaper model decides what to include before the main model runs. The decision should be policy-driven and consistent, not ad hoc.

Will larger context windows make context engineering less important?

Unlikely. Larger windows increase how much you can include, but they do not change the fact that relevance and position still matter. A 1-million-token context filled carelessly will produce worse results than a 32,000-token context filled thoughtfully. The discipline scales with window size rather than becoming obsolete — larger windows raise the ceiling of what context engineering can achieve.

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
Anthropic. (2024). Claude's Model Specification. Anthropic Technical Documentation.
Brown, T., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997.
Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E., ... & Zhou, D. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. Proceedings of the 40th International Conference on Machine Learning.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). Improving Text Embeddings with Large Language Models. arXiv preprint arXiv:2401.00368.

Key Takeaways

Context engineering is the practice of deliberately designing what goes into the model's context window: what information, in what order, at what level of compression.
As models become more capable, the wording of individual prompts matters less. What the context contains matters more.
The most impactful levers are retrieval quality, position of critical information, dynamic example selection, and history compression.
Treating context as inspectable, versionable, and testable infrastructure, rather than an afterthought, is what separates production-grade AI systems from demos.
Context engineering is not a replacement for prompt engineering but a broader discipline that subsumes it.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

Vision Language Models (VLMs): How GPT-4o, Claude, and LLaVA Understand Images

2026-06-07T02:00:00+00:00

Vision Language Models (VLMs): How GPT-4o, Claude, and LLaVA Understand Images

What you will learn: How VLMs bridge pixels and language tokens, covering CLIP encoders, patch tokenisation, projection layers, and the three dominant architectures used in production.

Why it matters: VLMs power GPT-4o, Claude Vision, Gemini, and the open-source LLaVA family. Understanding their internals is now a core skill for ML engineers building multimodal applications.

Architecture: Three paradigms dominate, encoder-projector-LLM (LLaVA-style), cross-attention fusion (Flamingo-style), and native multimodal (GPT-4o-style), each with distinct trade-offs.

Key insight: A 336x336 image becomes 576 visual tokens via patch tokenisation, each carrying rich spatial semantics that the language model attends to alongside text tokens.

Watch out for: Hallucination on fine-grained spatial details, high per-image token cost, and resolution limits that cause failures on small text or dense diagrams.

When you send a photo of a handwritten invoice to GPT-4o and ask it to extract the line items, or when you upload a chart to Claude and it summarises the trend, something extraordinary is happening under the hood. A model that was built around sequences of text tokens is somehow processing the continuous, high-dimensional signal of an image and integrating that information into its reasoning chain. How?

Vision Language Models (VLMs) are the class of architectures that make this possible. They bridge the gap between the continuous world of pixels and the discrete world of language tokens, enabling a new generation of applications: visual question answering (VQA), image captioning, optical character recognition at scale, chart and table understanding, document parsing, medical image analysis, and multimodal agents that can see and act on the world.

In 2026, VLMs have moved from research curiosity to production infrastructure. GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and the open-source LLaVA family are all deployed at scale. Understanding how they work, not just that they work, is now a core competency for ML engineers. This post gives you that understanding in full technical depth.

The Problem VLMs Solve

A pure language model operates in token space. Its input is a sequence of integers (token IDs), each mapped to a vector via an embedding table, and its output is a probability distribution over the vocabulary. Everything is discrete and one-dimensional. Images are neither of those things.

A 336 x 336 pixel RGB image contains 338,688 raw numerical values. Even at reduced resolution, the raw pixel array is a dense, spatially structured, continuous signal. Feeding raw pixels directly into a transformer would require attention over hundreds of thousands of positions, making computation prohibitively expensive. More fundamentally, raw pixel values carry no semantic structure: the number 127 in position (42, 83, 2) tells the model nothing useful by itself.

The core challenge of VLMs is therefore a representation mismatch: the language model expects semantically rich, fixed-dimensional vectors arranged in a short sequence. Images are high-dimensional, spatially structured, and continuous. Bridging this gap requires three things: (1) a vision encoder that converts raw pixels into compact, semantically meaningful representations; (2) a projection mechanism that maps those representations into the language model's embedding space; and (3) a training procedure that teaches the combined system to align visual and linguistic meaning.

Getting this wrong in any of the three places produces a model that confidently hallucinates image content, fails on spatially precise questions, or cannot generalise to images outside its training distribution.

Core Concepts and Terminology

Term	Definition	Why It Matters
Vision Encoder	A neural network (typically a Vision Transformer or CNN) that converts raw pixel data into a grid of feature vectors.	Determines the quality and type of visual representations available to the language model.
Language Model Backbone	The pretrained LLM (e.g., LLaMA, Vicuna, Mistral) that receives visual and text tokens and generates output.	Provides all the reasoning, instruction-following, and language generation capability.
Visual Tokens	The sequence of vectors produced by the vision encoder (and optionally compressed by a projector) that are fed into the LLM alongside text tokens.	Each visual token represents a region of the image in the language model's embedding space.
Image Patches	Non-overlapping rectangular regions of the input image (e.g., 14x14 or 16x16 pixels) that the ViT processes independently before applying self-attention.	The patch size directly controls how many visual tokens are produced per image.
CLIP	Contrastive Language-Image Pretraining (OpenAI, 2021). A dual-encoder model trained to align image and text representations in a shared embedding space.	CLIP's vision encoder is the most widely used backbone for VLMs because its representations are semantically aligned with language.
ViT (Vision Transformer)	An image encoder that divides an image into fixed-size patches, linearly embeds each patch, and applies transformer self-attention over the resulting sequence.	ViTs produce the per-patch token sequences that VLMs consume. CLIP uses a ViT as its image encoder.
Cross-Attention	An attention mechanism in which queries come from one modality (e.g., text) and keys/values come from another (e.g., image features).	Used in Flamingo-style architectures to let the language model attend to image regions at every transformer layer.
Projection Layer	A trainable module (linear, MLP, or Q-Former) that maps vision encoder output vectors into the LLM's embedding dimensionality.	The projection layer is the primary trainable interface between the two modalities in many VLMs.
Multimodal Alignment	The process of training or fine-tuning the combined system so that visual and language representations are compatible in a shared semantic space.	Without alignment, the LLM cannot interpret visual tokens and produces incoherent outputs.
Instruction Tuning	Fine-tuning a pretrained model on (instruction, response) pairs so it learns to follow natural language instructions, including multimodal ones.	Converts a pretrained VLM into a useful assistant that responds correctly to "describe this image" or "what is the trend in this chart?"

Architecture Overview

Three dominant paradigms have emerged for building VLMs, each making different trade-offs between flexibility, training cost, and performance.

Architecture 1: Encoder + Projector + LLM (LLaVA-Style)

This is the simplest and most widely used open-source architecture. The data flow is:

Stage	Component	What Happens
1	Input Image	Raw pixels (H x W x 3) fed into the vision encoder
2	CLIP ViT-L/14	Image divided into patches; each patch becomes a D_vision-dimensional embedding vector
3	Projection Layer	Linear or MLP maps patch embeddings from vision space into the LLM's embedding dimension
4	Token Concatenation	Visual tokens are prepended to the text token sequence to form a single combined input
5	LLM (LLaMA / Vicuna)	Processes the full combined sequence; self-attention spans both visual and text tokens
6	Output	Autoregressive text generation conditioned on both image and prompt

LLaVA-style architecture: the vision encoder and LLM are coupled through a lightweight projection layer. Every transformer layer in the LLM can attend to every visual token.

The vision encoder (typically CLIP ViT-L/14 or ViT-L/14@336px) processes the image and produces a sequence of patch embeddings. These are passed through a projection layer that maps them from the vision encoder's hidden dimension (e.g., 1024) to the LLM's embedding dimension (e.g., 4096). The resulting visual tokens are then prepended to the text token sequence, and the LLM processes the combined sequence autoregressively.

Trade-offs: Simple to implement and train. The entire image is visible to every layer of the LLM via self-attention. However, the visual token count can be large (576 tokens for a 336x336 image with 14x14 patches), consuming a significant portion of the context window. The projection layer is the only component that learns the cross-modal mapping; the vision encoder and LLM can be frozen or fine-tuned depending on compute budget.

Example models: LLaVA-1.5, LLaVA-NeXT, BakLLaVA, MoE-LLaVA, ShareGPT4V.

Architecture 2: Cross-Attention Fusion (Flamingo-Style)

In Flamingo (DeepMind, 2022), the image and text modalities are kept separate. The language model backbone is frozen, and new cross-attention layers are interleaved between its existing transformer layers. These cross-attention layers receive queries from the text stream and keys/values from a pooled representation of image features.

Component	Role	Key Detail
Vision Encoder (NFNet or ViT)	Extracts visual features from the input image	Produces a variable-length sequence of patch embeddings
Perceiver Resampler	Compresses visual features to a fixed token count	Learnable query vectors pool patch embeddings down to 64 tokens regardless of image size
Cross-Attention Layers	Inserted between frozen LLM blocks	Text hidden states act as queries; image features are keys and values
Frozen LLM Backbone	Language generation	Original weights unchanged; only the cross-attention layers and Perceiver are trained
Output	Text response	Generated autoregressively, informed by image features at every layer depth

Flamingo-style architecture: cross-attention layers injected between frozen LLM blocks allow the language model to attend to compressed image features at every depth, without disturbing the pretrained text weights.

A key component is the Perceiver Resampler, which uses a small set of learnable query vectors to compress the variable-length patch sequence from the vision encoder down to a fixed number of tokens (e.g., 64). This keeps the cross-attention computation tractable regardless of image resolution.

Trade-offs: The frozen LLM backbone is protected from catastrophic forgetting. Cross-attention at every layer gives the model fine-grained control over when and how it uses image information. However, the architecture is more complex to implement, and the cross-attention adds inference latency at every layer.

Example models: Flamingo, OpenFlamingo, IDEFICS, IDEFICS2.

Architecture 3: Native Multimodal (GPT-4o-Style)

The most capable but least open architecture trains a single unified model end-to-end on interleaved image, text, and audio data from the start. Rather than adapting a pretrained LLM to accept images, the model is pretrained jointly across modalities, allowing every layer to develop natively multimodal representations.

GPT-4o is believed to tokenise images into discrete visual tokens using a learned tokeniser, producing image tokens that live in the same vocabulary as text tokens, though the exact architecture has not been publicly disclosed by OpenAI. The model then processes these as a unified sequence.

Trade-offs: No modality boundary means the model can reason more deeply about relationships between text and image at every layer. End-to-end training allows the vision and language representations to co-evolve. The cost is enormous: joint pretraining requires vastly more compute, data, and engineering complexity. The architectural details of GPT-4o and Claude's vision system are not publicly disclosed.

Example models: GPT-4V, GPT-4o, Claude 3 Opus Vision, Gemini 1.5 Pro, Chameleon (Meta).

How CLIP Works

CLIP (Contrastive Language-Image Pretraining) is foundational to understanding most open-source VLMs. Published by OpenAI in 2021, CLIP trains two encoders simultaneously: an image encoder (typically a ViT) and a text encoder (a transformer). The training signal is contrastive: for a batch of N (image, caption) pairs, the model is trained to maximise the cosine similarity of the N matching pairs and minimise the similarity of the N^2 - N non-matching pairs.

CLIP applies a contrastive form of supervised learning: instead of predicting a single label, it learns to match images to their correct captions out of an entire batch. An image encoder and a text encoder are trained together on 400 million image-caption pairs. After training, images and their descriptions land close together in a shared vector space, which is why CLIP representations transfer so naturally into language models as visual backbones.

Trained on 400 million (image, text) pairs scraped from the internet, CLIP's image encoder learns to produce representations that are semantically aligned with language. A CLIP embedding of a photo of a golden retriever will be close to the text embedding of "a golden retriever", far from "a sports car". This alignment is exactly what makes CLIP representations useful as a visual backbone for VLMs: the image features are already in a language-compatible semantic space.

ViT Patch Tokenisation

CLIP's image encoder is a Vision Transformer (ViT). The ViT processes images as follows:

Divide the image into a grid of non-overlapping patches. For ViT-L/14, each patch is 14x14 pixels. A 224x224 image produces 16x16 = 256 patches. A 336x336 image produces 24x24 = 576 patches.
Flatten each patch into a 1D vector of length 14*14*3 = 588, then project it to the model's hidden dimension D (e.g., 1024) via a learned linear layer. This is the "patch embedding".
Add a learnable [CLS] token prepended to the sequence. Add learnable 2D positional embeddings to all patch embeddings.
Pass the resulting sequence (length 577 for 336x336 with ViT-L/14) through L transformer layers with multi-head self-attention.
The [CLS] token output is typically used as the global image representation for CLIP's contrastive loss. The full patch token sequence (without [CLS]) is used as visual tokens in LLaVA.

The critical insight is that each patch token in the final layer's output corresponds to a specific spatial region of the image. Self-attention allows patches to attend to each other, so a patch token representing the sky can incorporate information from the horizon patches. But the spatial correspondence is preserved: visual token 42 always corresponds to the same 14x14 region.

Visual Token Projection

The vision encoder produces patch embeddings in its own hidden space (e.g., D_vision = 1024 for ViT-L/14). The LLM operates in its own embedding space (e.g., D_llm = 5120 for LLaMA-2-13B). These spaces are not compatible: a vector from CLIP cannot be directly inserted into LLaMA's residual stream and produce meaningful computation.

The projection layer solves this by learning a mapping from D_vision to D_llm. Three main approaches are used:

Linear Projection (LLaVA-1.5)

A single linear layer: W ∈ R^(D_llm x D_vision), applied independently to each patch token. Fast, simple, surprisingly effective. LLaVA-1.5 found that a two-layer MLP with a GELU activation outperformed a single linear layer.

# Simplified linear projection
import torch.nn as nn

class LinearProjector(nn.Module):
    def __init__(self, vision_dim=1024, llm_dim=4096):
        super().__init__()
        self.proj = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim)
        )

    def forward(self, image_features):
        # image_features: (batch, num_patches, vision_dim)
        return self.proj(image_features)
        # output: (batch, num_patches, llm_dim)

Q-Former (BLIP-2, InstructBLIP)

The Querying Transformer (Q-Former) is a more sophisticated bottleneck. It contains a fixed set of N learnable query vectors (e.g., 32 queries) that cross-attend to the full patch sequence from the frozen vision encoder. The Q-Former's output is N vectors in D_llm space, regardless of the original image resolution. This dramatically compresses the visual token count (from 576 down to 32) at the cost of some spatial detail, and adds a trained interface that can be pretrained on image-text tasks independently.

Token Count: Why 576 Tokens

With ViT-L/14@336px: the image is 336x336 pixels. The patch size is 14x14. The number of patches is (336/14) x (336/14) = 24 x 24 = 576. Each patch becomes one visual token in the LLM's input sequence. This is why a single 336x336 image in LLaVA-1.5 consumes 576 context tokens, which is significant in a 4096-token context window. LLaVA-NeXT addresses this with higher resolution support via dynamic resolution strategies that tile the image.

Training Pipeline

Most open-source VLMs follow a two-stage training pipeline, pioneered by LLaVA and refined by subsequent work.

Stage 1: Projection Pretraining (Feature Alignment)

Goal: Teach the projection layer to map vision encoder outputs into vectors that the frozen LLM can interpret.

Data: Large-scale image-caption pairs (e.g., CC3M, LAION-CC-SBU, approximately 558K pairs in LLaVA-1.5 Stage 1).

Setup: Both the vision encoder and the LLM are frozen. Only the projection layer weights are updated. The LLM is trained to predict the caption tokens given the projected visual tokens.

Why freeze the LLM? The LLM already has strong language priors. Updating it on caption pairs alone could cause catastrophic forgetting of its broader language capabilities. Stage 1 focuses exclusively on teaching the projection layer to speak the LLM's language.

Duration: Typically 1 epoch on the alignment dataset. Computationally cheap compared to Stage 2.

Stage 2: Visual Instruction Tuning

Goal: Teach the model to follow multimodal instructions, answer questions about images, and engage in visual dialogue.

Data: Multimodal instruction-following datasets: LLaVA-Instruct-150K, ShareGPT4V, VQA datasets, TextVQA, GQA, OCR-VQA, and document understanding datasets. LLaVA-1.5 uses approximately 665K instruction samples.

Setup: The vision encoder is frozen. The projection layer and the LLM are both trained (or the LLM is trained with LoRA adapters to reduce compute). The model is trained to generate correct responses to instructions like "Describe this image in detail", "What is the text in this sign?", "How many people are in the image?".

Why curriculum matters: Stage 1 must complete before Stage 2. If both are run together, the untrained projection layer produces garbage vectors, and the LLM's updates will attempt to compensate, degrading its language quality. The sequential curriculum cleanly separates the alignment problem from the instruction-following problem.

Data quality matters more than quantity: LLaVA-1.5 achieved state-of-the-art results with only 665K instruction samples by using GPT-4-generated high-quality conversation data, outperforming models trained on 10x more but lower-quality data.

Practical Example: "What Is in This Image?"

Let's trace exactly what happens when a user sends a photo of a busy street with the question "What is in this image?" to a LLaVA-1.5 (13B) model.

Step 1: Image Preprocessing

The image is resized and center-cropped to 336x336 pixels. Pixel values are normalised using CLIP's mean and std. The image tensor has shape (3, 336, 336).

Step 2: Patch Tokenisation

The ViT-L/14@336px divides the image into 24x24 = 576 patches, each 14x14 pixels. Each patch is linearly embedded to a 1024-dimensional vector. A [CLS] token is prepended, giving sequence length 577.

Step 3: Vision Encoder Processing

The 577-token sequence passes through 24 transformer layers (ViT-L configuration). Each layer applies multi-head self-attention (16 heads, dim 64 each) and an MLP. Patches corresponding to buildings, cars, people, and traffic lights develop specialised representations as higher layers encode increasingly abstract features. The output is 576 patch embeddings, each of shape (1024,). (The [CLS] token is discarded for LLaVA; some models use it for global context.)

Step 4: Projection to Language Space

The two-layer MLP projector maps each of the 576 patch embeddings from (1024,) to (5120,), matching LLaMA-2-13B's hidden dimension. Output: 576 visual tokens in LLM embedding space.

Step 5: Token Sequence Construction

The text question "What is in this image?" is tokenised to approximately 7 text tokens. A special placeholder in the prompt template is replaced by the 576 visual tokens. The final input sequence looks like:

The final input sequence fed to the LLM begins with 576 visual tokens (one per image patch), followed by the 7 text tokens that represent the question "What is in this image?". The total context length is approximately 583 tokens. Every transformer layer in the LLM can attend across this full sequence, meaning each word the model generates can be influenced by any image patch.

Step 6: LLM Autoregressive Generation

LLaMA-2-13B processes the 583-token sequence. All 40 transformer layers apply self-attention over the full sequence, meaning every text token can attend to every visual token. The model attends to the spatial regions relevant to each generated word: when generating "street", it attends heavily to road-patch tokens; when generating "buildings", it attends to upper-image patches.

The model generates tokens one at a time: "The", "image", "shows", "a", "busy", "city", "street", "with", ... until an end-of-sequence token is produced.

Python Implementation

The following example shows how to load LLaVA-1.5 using the HuggingFace transformers library and run visual inference.

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

# Load model and processor
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"  # LLaVA-NeXT, HF-compatible
processor = LlavaNextProcessor.from_pretrained(model_id)

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"          # automatically distributes across available GPUs/CPU
)

# Load an image from URL (or use PIL.Image.open for local files)
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/240px-PNG_transparency_demonstration_1.png"
image = Image.open(requests.get(url, stream=True).raw)

# Build the conversation prompt using the LLaVA chat template
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is in this image? Describe it in detail."},
        ],
    },
]

# Apply the processor's chat template to format the prompt
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Process inputs: tokenises text + encodes image into visual tokens
inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(model.device)

# Print token counts
num_image_tokens = (inputs["input_ids"] == processor.tokenizer.convert_tokens_to_ids("")).sum()
print(f"Total input tokens: {inputs['input_ids'].shape[1]}")

# Generate response
with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,           # greedy decoding for determinism
        temperature=1.0,
    )

# Decode only the newly generated tokens (not the prompt)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Model response:")
print(response)

For LLaVA-1.5 specifically (older API):

from transformers import AutoTokenizer, AutoModelForCausalLM, CLIPImageProcessor
import torch
from PIL import Image

# LLaVA-1.5 uses a slightly different loading pattern
model_path = "liuhaotian/llava-v1.5-7b"

# Load the vision processor (CLIP's image preprocessor)
image_processor = CLIPImageProcessor.from_pretrained(model_path)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

def run_llava_inference(image: Image.Image, question: str) -> str:
    """Run LLaVA-1.5 inference on a single image and question."""

    # Preprocess image: resize to 336x336, normalise with CLIP stats
    # Output shape: (1, 3, 336, 336)
    pixel_values = image_processor(
        images=image,
        return_tensors="pt"
    )["pixel_values"].to(model.device, dtype=torch.float16)

    # Format prompt with LLaVA's special image token
    # The model expects  placeholder where visual tokens will be inserted
    prompt = f"USER: \n{question}\nASSISTANT:"
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids=input_ids,
            images=pixel_values,       # passed separately; model inserts at  position
            max_new_tokens=256,
            use_cache=True,
        )

    # Decode only new tokens
    output_text = tokenizer.decode(
        output_ids[0, input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()

    return output_text

# Example usage
image = Image.open("street.jpg")
answer = run_llava_inference(image, "How many cars are in this image?")
print(answer)

For batch inference with multiple images (important for production throughput):

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

def batch_inference(images: list, questions: list, batch_size: int = 4):
    """Process multiple image-question pairs in batches."""
    results = []

    for i in range(0, len(images), batch_size):
        batch_images = images[i:i + batch_size]
        batch_questions = questions[i:i + batch_size]

        conversations = [
            [{"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": q}
            ]}]
            for q in batch_questions
        ]

        prompts = [
            processor.apply_chat_template(conv, add_generation_prompt=True)
            for conv in conversations
        ]

        # Padding is required for batch processing
        inputs = processor(
            images=batch_images,
            text=prompts,
            return_tensors="pt",
            padding=True
        ).to(model.device)

        with torch.inference_mode():
            output_ids = model.generate(**inputs, max_new_tokens=256)

        for j, out in enumerate(output_ids):
            prompt_len = inputs["input_ids"][j].shape[0]
            response = processor.decode(out[prompt_len:], skip_special_tokens=True)
            results.append(response)

    return results

Comparison of Major VLMs

Model	Architecture Type	Vision Encoder	LLM Backbone	Open / Closed	Best Use Case
LLaVA-1.5	Encoder + MLP Projector + LLM	CLIP ViT-L/14@336px	Vicuna-7B / 13B	Open (weights)	General VQA, baseline research, self-hosted deployment
LLaVA-NeXT	Encoder + MLP Projector + LLM (dynamic resolution)	CLIP ViT-L/14@336px (tiled)	Mistral-7B / LLaMA-3-8B / 70B	Open (weights)	High-res documents, OCR, chart understanding
BLIP-2	Encoder + Q-Former + LLM	CLIP ViT-L or EVA-ViT-G	OPT-2.7B / FlanT5-XXL	Open (weights)	Image captioning, zero-shot VQA
InstructBLIP	Encoder + Q-Former + LLM (instruction-tuned)	CLIP ViT-L or EVA-ViT-G	Vicuna-7B / 13B, FlanT5	Open (weights)	Instruction-following VQA, science diagrams
Flamingo	Cross-attention fusion (Perceiver Resampler)	NFNet-F6	Chinchilla-70B (frozen)	Closed (weights not released)	Few-shot multimodal reasoning, interleaved image-text
GPT-4V / GPT-4o	Native multimodal (unified tokenisation)	Undisclosed	GPT-4 class (undisclosed)	Closed (API only)	Complex visual reasoning, multimodal agents, fine-grained OCR
Claude 3 Vision	Native multimodal (undisclosed)	Undisclosed	Claude 3 class (undisclosed)	Closed (API only)	Document analysis, chart interpretation, long-form visual reasoning
Gemini Vision	Native multimodal (interleaved tokens)	Undisclosed (likely SigLIP-based)	Gemini 1.5 Pro / Flash	Closed (API only)	Long-context video understanding, document OCR at scale

Advantages of VLMs

Visual reasoning: VLMs can answer complex questions that require integrating visual evidence with world knowledge. "Is the food in this image appropriate for someone with celiac disease?" requires recognising food items, knowing their ingredients, and understanding dietary restrictions simultaneously.
Zero-shot generalisation: CLIP-pretrained VLMs generalise to visual concepts not explicitly seen in instruction tuning, because the vision encoder's representations already cover a vast range of visual categories.
Document understanding: Combining OCR capability with language understanding, VLMs can process contracts, forms, invoices, and research papers in a single pass, extracting structured information without explicit layout parsing.
Chart and table parsing: VLMs understand the visual grammar of charts (axes, legends, bars, lines) and can extract data, identify trends, and answer quantitative questions about plotted data.
Accessibility applications: Image captioning and visual question answering enable screen readers and assistive tools that describe images to visually impaired users in rich, contextual language.
Unified pipeline: A single VLM replaces a pipeline of specialised models (object detector, OCR engine, caption model, VQA model), reducing inference infrastructure complexity and the error propagation that occurs when chaining separate models.

Limitations and Trade-offs

Hallucination on fine-grained visual details: VLMs frequently hallucinate object attributes, counts, and spatial relationships. Asking "How many red cars are in the parking lot?" often yields plausible but incorrect numbers. The language model's priors about what is likely to appear in a scene can dominate over actual visual evidence.
Poor spatial reasoning: Tasks requiring precise spatial understanding ("Is the red ball to the left or right of the blue cube?") are systematically difficult because the patch tokenisation and self-attention mechanism do not preserve strong spatial inductive biases.
High token cost per image: A single 336x336 image consumes 576 context tokens in LLaVA-1.5. Processing 10 images in a conversation consumes 5,760 tokens before any text. This limits the number of images per conversation and drives up inference cost significantly.
Resolution constraints: CLIP ViT-L/14 was pretrained at 224x224. Fine-tuning at 336x336 helps but images with small text or fine detail (e.g., PCB diagrams, microscopy) still lose information. LLaVA-NeXT's dynamic tiling partially addresses this.
Text in images: While VLMs can read printed text in images, they struggle with handwriting, dense text layouts, non-Latin scripts, and low-contrast text. Dedicated OCR systems like Tesseract or cloud vision APIs still outperform general VLMs on heavy-OCR tasks.
No true image understanding in closed models: Proprietary VLMs cannot be audited for what they actually "see". Their visual capabilities are characterised only through benchmarks and empirical testing, not by examining internal representations.

Common Mistakes

Over-relying on VLMs for precise measurements: VLMs cannot reliably read exact numerical values from charts, measurements from photos, or precise coordinates. If your application requires precise numerical extraction, combine the VLM with specialised computer vision tools.
Ignoring resolution limits: Sending a 4000x3000 pixel image to LLaVA-1.5 will downsample it to 336x336 before processing, discarding most of the detail. If the task requires reading small text or detecting small objects, use a model with dynamic high-resolution support (LLaVA-NeXT, GPT-4o) or pre-crop the region of interest.
Not providing sufficient text context: VLMs perform significantly better when the text prompt provides context about the task. "What do you see?" is worse than "You are analysing a medical X-ray. Describe any abnormalities in the lung region." The instruction-tuned LLM backbone benefits from context just as it does in text-only tasks.
Using the wrong model for the task: A general VQA model is not the right tool for production OCR at scale. If you need to extract all text from thousands of scanned documents, use a document-specific model (PaddleOCR, AWS Textract, Azure Form Recognizer). Use VLMs where visual reasoning, not just text extraction, is needed.
Forgetting to benchmark on your specific distribution: A VLM that achieves 80% on VQAv2 may perform far worse on your domain-specific images (medical scans, satellite imagery, engineering drawings). Always evaluate on representative samples from your target distribution before production deployment.
Processing images sequentially when batching is available: For offline processing tasks, batching images together (with padding) achieves significantly higher GPU utilisation than one-at-a-time inference.

Best Practices

Image Resolution Selection

Match resolution to task requirements. For general scene understanding and conversational QA, 336x336 (LLaVA-1.5) is sufficient. For document parsing, dense text, or fine-grained recognition, use models with higher native resolution or dynamic tiling (LLaVA-NeXT supports up to 1344x336 via tiling). Never send images larger than the model's native resolution without checking how the library handles resizing.

Prompt Engineering for Visual Tasks

Structure prompts to specify: (1) what the image contains or what type it is, (2) what specific information you need, (3) the format of the answer. Example: "This image is a bar chart. Extract the numerical value for each bar and return them as a Python dictionary with bar labels as keys." is far more effective than "Read the chart."

When to Use VLMs vs Dedicated Models

Use VLMs when: the task requires combining visual evidence with reasoning or world knowledge, the task is too varied for a specialised model, or you need a conversational interface over visual content. Use dedicated models when: you need maximum accuracy on a well-defined narrow task (face detection, license plate OCR, medical image segmentation), latency is critical, or cost per image must be minimised.

Evaluation with Visual Benchmarks

Standard benchmarks for VLM evaluation:

MMBench: Multi-task visual understanding benchmark with objective multiple-choice questions across 20 ability dimensions.
MMMU: Massive Multidisciplinary Multimodal Understanding. College-level questions across 30 subjects requiring domain expertise and visual reasoning.
TextVQA: Questions that require reading and reasoning about text within images. Specifically targets OCR capability integrated with language understanding.
GQA: Real-world visual reasoning with compositional questions and scene graphs for structural evaluation.
MME: Perception and cognition benchmarks with binary yes/no answers, measuring specific fine-grained capabilities.
POPE: Polling-based Object Probing Evaluation, specifically designed to measure object hallucination rates.

Frequently Asked Questions

How is GPT-4o different from GPT-4V?

GPT-4V (the visual capability of GPT-4 Turbo) was an adaptation of GPT-4 to accept images, likely using a connector-based approach. GPT-4o was trained natively as a multimodal model from pretraining, processing images, text, and audio in a unified architecture. The key practical differences: GPT-4o is significantly faster (optimised for real-time use), has lower per-token cost, handles higher-resolution images more effectively, and supports native audio input/output in addition to vision. GPT-4o also reportedly tokenises images into discrete visual tokens natively rather than mapping through a separate encoder, enabling tighter integration between modalities.

Why do VLMs hallucinate about images?

VLM hallucination has several root causes. First, the language model backbone has strong prior distributions over co-occurring concepts: if the visual context suggests a kitchen, the model's language priors strongly prefer "refrigerator", "sink", "counter" over unusual objects even if they are not present. Second, the vision encoder produces continuous, compressed representations that lose fine-grained detail: two objects that look different to a human may produce similar patch embeddings. Third, training data often contains noisy or incomplete image-caption pairs, so the model learns to generate plausible descriptions rather than accurate ones. Fourth, the projection layer may not perfectly convey spatial and attribute information from vision to language space. Addressing hallucination requires special training techniques (RLHF-V, POPE-guided training) and careful evaluation.

Can VLMs understand video?

Yes, with varying approaches. The simplest method is to sample N frames from a video and concatenate their visual tokens, treating the video as a long image sequence. This is the approach used by Video-LLaVA, Video-ChatGPT, and similar models. The limitation is context length: even at 1 frame per second, a 30-second video produces 17,280 visual tokens at LLaVA's standard token count. Long-context models (Gemini 1.5 Pro with 1M token context) handle this better. More specialised video VLMs use temporal encoding mechanisms or hierarchical frame sampling to handle longer videos efficiently. GPT-4o and Gemini 1.5 Pro support native video input via their APIs.

How many tokens does one image use?

It depends on the model and resolution. Reference values: LLaVA-1.5 at 336x336 uses 576 tokens (24x24 patches). LLaVA-NeXT with dynamic tiling at high resolution can use up to 2880 tokens per image (5 tiles of 576 each). BLIP-2 with Q-Former uses 32 tokens regardless of resolution. GPT-4V/4o uses approximately 85 tokens for low-detail mode and 170 tokens per 512x512 tile for high-detail mode (so a 1024x1024 image in high detail uses approximately 765 tokens). Claude's API does not publicly disclose exact visual token counts but processes images up to 8000x8000 pixels with pricing based on image area.

Is CLIP the best vision encoder?

CLIP ViT-L/14 is the most commonly used encoder for open-source VLMs due to its strong semantic alignment and wide availability, but it is not universally the best. EVA-CLIP (from BAAI) is a stronger encoder with better performance on dense prediction tasks and is used in InstructBLIP and some LLaVA-NeXT variants. SigLIP (Google, sigmoid loss variant of CLIP) shows better performance on image-text retrieval and is used in PaliGemma. For domain-specific applications, specialised encoders (medical image encoders, satellite imagery encoders) will outperform general-purpose CLIP on their target domain. The trend in 2025-2026 is toward larger encoders (ViT-G, ViT-H class) trained on more diverse data.

References

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023. arXiv:2304.08485
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2024). Improved Baselines with Visual Instruction Tuning (LLaVA-1.5). CVPR 2024. arXiv:2310.03744
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020
Alayrac, J. B., Donahue, J., Luc, P., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022. arXiv:2204.14198
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. arXiv:2301.12597
Dai, W., Li, J., Li, D., et al. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. NeurIPS 2023. arXiv:2305.06500
OpenAI. (2023). GPT-4V(ision) System Card. openai.com
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. arXiv:2010.11929
Liu, H., Li, C., Li, Y., et al. (2024). LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. llava-vl.github.io

Key Takeaways

VLMs require three components: a vision encoder (converts pixels to semantic patch embeddings), a projection layer (maps vision space to language space), and an LLM backbone (reasons over the combined token sequence). Each component's quality limits overall performance.
CLIP is the dominant vision backbone for open-source VLMs because its contrastive training produces image representations that are already semantically aligned with language, making the projection learning task tractable.
A 336x336 image becomes 576 visual tokens in LLaVA-1.5's pipeline. This token cost is a first-class engineering concern: it determines context window usage, inference latency, and API cost. Dynamic tiling (LLaVA-NeXT) and Q-Former compression (BLIP-2) are the two main strategies for managing it.
Two-stage training is the standard recipe: Stage 1 aligns the projection layer by training on image-caption pairs with the LLM frozen; Stage 2 instills instruction-following via multimodal conversation data with the full model (or LoRA adapters) unfrozen. Skipping Stage 1 leads to poor alignment.
Hallucination is structural, not accidental. The language model's strong priors over plausible visual scenes can override actual visual evidence, especially for fine-grained counts, attributes, and spatial relationships. POPE is the standard benchmark for measuring hallucination rates.
The right tool for the task matters: Use VLMs for tasks requiring visual reasoning plus language understanding. Use dedicated OCR, detection, or segmentation models for tasks requiring maximum precision on narrow, well-defined visual subtasks. The best production systems often combine both.

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

LLM as Judge: How to Evaluate AI Models Automatically at Scale

Human evaluation of LLM outputs is slow and expensive. LLM-as-judge uses a...

Mixture of Experts (MoE): The Architecture Behind GPT-4, Mixtral, and Grok

2026-06-06T02:00:00+00:00

Mixture of Experts (MoE): The Architecture Behind Frontier LLMs

What you will learn: How MoE replaces dense feed-forward layers with banks of specialist networks, how the gating router works, and why this lets models scale capacity without scaling compute.

Why it matters: MoE is the architecture behind Mixtral, Grok-1, DeepSeek-V3, and the likely structure of GPT-4. Understanding it is essential for any engineer working with frontier-scale models.

Key insight: Only 2 of 8 experts (or similar ratios) activate per token. Total parameters are large; active parameters per forward pass are small. That gap is where MoE's efficiency comes from.

Watch out for: Load imbalance collapses all routing to a single expert without an auxiliary loss. Training MoE is more complex than training a dense model of equivalent active parameters.

Covered in depth: Gating mechanisms, token-choice vs expert-choice routing, load balancing, training challenges, a hand-worked routing example, PyTorch implementation, and a comparison of real-world MoE models.

When Mistral released Mixtral 8x7B in December 2023, it demonstrated something striking: a model with 46.7 billion total parameters that matched or outperformed LLaMA 2 70B on most benchmarks, while running at roughly twice the inference speed. The secret was not a better dataset or a bigger GPU budget. It was a fundamentally different architecture, one that had been theorised since the 1990s but only recently became practical at scale: Mixture of Experts.

The core insight behind MoE is deceptively simple. Not every token in a sequence requires the same kind of processing. A token like "photosynthesis" calls for biological and chemical knowledge; a token like "integrate" might call for mathematical reasoning. Why should both tokens activate exactly the same set of parameters? In a standard dense transformer, they do. Every token flows through every weight in every feed-forward layer, regardless of relevance.

MoE breaks this constraint. Instead of one monolithic feed-forward network (FFN) per transformer layer, MoE replaces it with multiple smaller networks called experts, and a lightweight router that decides, for each token, which subset of experts to activate. The result is a model with far greater total capacity, but no increase in the compute required per token.

This post gives you a rigorous, ground-up understanding of MoE: the theory, the architecture, the training challenges, a hand-worked numerical example, and a clean PyTorch implementation. By the end you will understand why this architecture dominates frontier model design from 2024 through 2026, and what trade-offs you accept when you use it.

The Problem MoE Solves

To appreciate why MoE exists, you need to understand the scaling wall that dense transformers hit.

The Dense Model Scaling Problem

Scaling laws, first documented rigorously in the Chinchilla paper, show that a dense transformer's loss decreases predictably as you increase parameters and training tokens. More parameters means more capacity to memorise facts, learn syntax, and generalise across domains. Larger models are simply better, and the industry spent 2020 to 2023 proving this empirically.

But the cost of running a dense model scales linearly with its parameter count. If you double the number of parameters, you roughly double the FLOPs per forward pass, double the memory bandwidth required, and double the GPU memory needed to store the model. At 7 billion parameters this is manageable. At 70 billion parameters it requires careful engineering. At 700 billion parameters it becomes financially brutal at inference time.

The fundamental tension is this: you want capacity at training time, but you want cheapness at inference time. In a dense model, these two goals are in direct conflict. Every parameter you add for better quality is another parameter you pay to run on every token at inference.

The MoE Escape Hatch

MoE breaks the tight coupling between model capacity and per-token compute. You can have a model with, say, 46 billion parameters, but only activate 12 billion of them for any given token. Training teaches all 46 billion parameters to specialise, so the total knowledge in the model is large. But inference only pays for the 12 billion parameters that are actually used.

This is analogous to a hospital. A hospital employs hundreds of specialists: cardiologists, neurologists, dermatologists, oncologists. When a patient arrives, only the relevant specialists are called in. You do not call every specialist for every patient just because they are all on staff. The total expertise of the hospital is large, but the cost of treating any one patient is bounded by the number of specialists actually needed.

Core Concepts and Terminology

Term	Definition
Expert	A distinct feed-forward network within an MoE layer. Each expert has its own weights and learns to specialise in a subset of the input distribution.
Router / Gating Network	A small learned linear layer that takes a token's hidden representation and produces a probability score for each expert. Determines which experts process each token.
Top-k Routing	The routing strategy where each token activates exactly k experts (typically k=1 or k=2). Only the top-k scoring experts receive the token; others are bypassed.
Sparse Activation	The property that only a small fraction of all parameters are activated for any given token. In Mixtral 8x7B, 2 of 8 experts fire per token: 25% of MoE parameters are active.
Load Balancing	The goal of distributing tokens roughly evenly across experts so no single expert becomes a bottleneck while others are idle.
Expert Capacity	A hard limit on how many tokens each expert can process in a single batch, expressed as a multiple of the average expected load (the capacity factor).
Auxiliary Loss	An additional loss term added during training to encourage balanced routing. Without it, experts collapse: the router learns to always pick the same one or two experts.
Token Dropping	When an expert's capacity buffer is full and a token that was routed to it gets discarded. The token then passes through unmodified (or is handled by a fallback mechanism).
Hard MoE	Routing with a discrete top-k selection. The router makes a hard binary decision: a token either goes to an expert or it does not. Most production MoE models use this.
Soft MoE	Routing where every expert processes a weighted combination of all tokens, with weights from the router. Differentiable but computationally expensive; used in research.

MoE Architecture Deep Dive

Where Experts Live in the Transformer

A standard transformer layer has two sub-layers: a multi-head self-attention (MHSA) module and a feed-forward network (FFN). In a dense model, both are present in every layer, and they run for every token in every batch.

In an MoE model, the FFN sub-layer in selected layers (typically every other layer, or every layer) is replaced by an MoE layer. The MHSA sub-layer is kept as-is. The MoE layer contains N independent FFN experts plus a gating network. The gating network routes each token to k of the N experts, runs only those k experts, and combines their outputs.

In a dense transformer, every token flows through the same feed-forward network weights. In an MoE layer, a lightweight gating network acts like a dispatcher: it evaluates each token, selects its top-k expert networks, runs only those forward passes, and combines their outputs using the gating scores as weights. Experts not selected perform no computation at all. That is how MoE achieves higher capacity without proportional compute cost.

The Gating Network

The gating network is a simple linear projection: a weight matrix of shape [hidden_dim, num_experts]. Given a token's hidden state vector of dimension hidden_dim, the gating network computes one logit per expert via a matrix multiply and then applies a softmax to get routing probabilities.

Concretely, for a token with hidden state h and gating weight matrix W_g:

Compute logits: logits = h @ W_g (shape: [num_experts])
Apply softmax: scores = softmax(logits) (shape: [num_experts])
Select top-k indices by score
Renormalise the top-k scores so they sum to 1
These renormalised scores become the mixing weights

The gating network has very few parameters relative to the experts. In Mixtral 8x7B, each expert is a standard 7B-class FFN with two linear layers. The gating matrix adds only 4096 * 8 = 32,768 parameters per layer, negligible compared to billions of expert parameters.

Top-k Selection: Why k=1 or k=2

The choice of k has significant practical consequences.

k=1 (Switch Transformer style): Each token activates exactly one expert. This minimises compute but also means the model cannot hedge. If the router is slightly wrong, there is no fallback. The expert must handle the token entirely. Training with k=1 tends to be less stable because the gating network receives high-variance gradient signals.

k=2 (Mixtral style): Each token activates two experts, and their outputs are combined with weighted averaging. This is more robust: if expert A and expert B both partially specialise in the token's domain, both contribute. Training is more stable than k=1 because the gradient can flow through two paths. The cost is that you activate twice the expert FLOPs per token compared to k=1.

k>2: Diminishing returns. Each additional expert adds compute and reduces specialisation pressure. Models rarely use k>2 in practice for dense inference settings.

Expert Capacity Buffer

During batched training and inference, multiple tokens in the same batch may route to the same expert. If 50% of tokens in a batch all want expert 3, expert 3 cannot process them all efficiently without becoming a serial bottleneck.

The capacity buffer solves this. Each expert is assigned a capacity: the maximum number of tokens it will process in one forward pass. The capacity is typically set as:

capacity = (batch_tokens / num_experts) * capacity_factor

A capacity factor of 1.0 means each expert handles exactly its fair share. A capacity factor of 1.25 gives a 25% buffer to absorb natural load variation. If more tokens are routed to an expert than its capacity allows, the excess tokens are dropped: they bypass that expert and their hidden state is passed through unchanged. During training, token dropping is tolerable if rare; during inference, it degrades output quality.

Data Flow for One Token Through an MoE Layer

Let us trace a single token through an MoE layer step by step. Assume 8 experts, k=2 routing, and the token's hidden state is a vector of dimension 4096.

Gating computation: The hidden state (shape [4096]) is multiplied by the gating weight matrix (shape [4096, 8]) to produce 8 logits.
Softmax: The 8 logits become 8 probabilities summing to 1.0.
Top-2 selection: The two highest probabilities are identified, say expert 3 (score 0.41) and expert 7 (score 0.33).
Score renormalisation: The two selected scores are renormalised: expert 3 gets weight 0.41/(0.41+0.33) = 0.554, expert 7 gets weight 0.446.
Capacity check: Both experts check whether their capacity buffers have room. If yes, the token is added to their input buffers.
Expert forward passes: Expert 3 and expert 7 each run their FFN independently on the token's hidden state, producing two output vectors.
Weighted combination: The two output vectors are combined: output = 0.554 * expert3_output + 0.446 * expert7_output.
Residual add: The combined output is added back to the input hidden state (standard transformer residual connection).

Routing Mechanisms

The gating function is the heart of MoE. Different routing strategies trade off between training stability, load balance, and computational tractability.

Token-Choice Routing (Standard)

In token-choice routing, each token independently selects its top-k experts. The router processes each token and outputs a distribution over experts; the top-k are activated. This is the most common scheme, used in Mixtral, Switch Transformer, and most other production MoE models.

Mechanism: For each token, compute gating scores for all N experts, take the top-k, renormalise, and combine expert outputs with those weights.

Advantages: Simple to implement. Each token gets its preferred experts. Easy to understand.

Disadvantages: Load imbalance is common. Popular experts get overloaded; unpopular experts starve. Requires auxiliary loss to prevent collapse. Token dropping is necessary when capacity is exceeded.

Expert-Choice Routing

In expert-choice routing, the perspective is flipped. Instead of each token choosing its top-k experts, each expert chooses its top-k tokens from the batch. Each expert is guaranteed to process exactly k tokens, eliminating capacity overflow by construction.

Mechanism: For each expert, compute affinity scores between that expert and all tokens, take the top-k tokens, and process them. Each expert processes exactly k tokens regardless of batch composition.

Advantages: Perfect load balance. No token dropping. No auxiliary loss needed for balancing.

Disadvantages: Some tokens may not be processed by any expert (if no expert selects them), or may be selected by multiple experts (redundant compute). Variable coverage per token makes masking and loss computation more complex. Not used in most production models at scale, though it appeared in Google's research.

Soft MoE

Soft MoE, proposed by Google in 2023, avoids the hard top-k selection entirely. Instead of routing each token to a discrete set of experts, Soft MoE constructs a weighted "slot" for each expert that is a convex combination of all tokens, weighted by routing scores. Each expert then processes its slot, and the outputs are recombined.

Mechanism: For each expert, compute a softmax-weighted sum of all token representations. This "input slot" is processed by the expert. The output slot is then distributed back to tokens via another softmax weighting.

Advantages: Fully differentiable. No discrete routing decisions, so no gradient estimation issues. No token dropping by construction.

Disadvantages: Computationally expensive. Every expert sees a contribution from every token, so the total compute is closer to dense than sparse. Better thought of as a research baseline than a practical scaling strategy.

Training Challenges

Expert Collapse

The most serious failure mode in MoE training is expert collapse. Early in training, by random chance, one expert produces slightly better outputs than the others. The router's gradient signal reinforces sending tokens to that expert. That expert then receives more training signal and improves faster, widening the gap. Eventually, nearly all tokens route to one or two experts, and the rest are effectively unused.

A collapsed MoE model has the compute cost of the full model at training time, but the effective capacity of only one or two experts. It is the worst of both worlds.

Load Imbalance and the Capacity Factor

Even without full collapse, natural imbalance degrades efficiency. If 40% of tokens route to expert 1 and only 5% to expert 8, expert 1 overflows its buffer while expert 8 sits idle. The capacity factor must be set high enough to absorb real-world imbalance without excessive token dropping.

Setting the capacity factor too high wastes memory (pre-allocated buffers that go unused). Setting it too low causes token dropping and quality degradation. A common default is 1.25, but this requires tuning per-model.

The Auxiliary Load Balancing Loss

To counteract collapse and imbalance, practitioners add an auxiliary loss term to the total training objective. The idea is to penalise the router whenever its routing decisions are unequal across experts.

Conceptually, the auxiliary loss works as follows. For each expert, you compute two quantities: the fraction of tokens routed to it (call this the load fraction) and the average routing probability assigned to it across all tokens. You then multiply these two quantities for each expert and sum the results. This sum is minimised when routing is perfectly uniform.

In plain English: if expert 3 always gets high routing scores AND always gets chosen, the product is large and the loss penalises this. The router is pushed toward distributing both scores and selections more evenly.

The auxiliary loss is added to the main cross-entropy loss with a small coefficient, typically 0.01 or 0.001. Too large a coefficient over-regularises and prevents experts from specialising; too small and collapse occurs anyway. This coefficient is one of the most sensitive hyperparameters in MoE training.

Communication Overhead in Distributed Training

In dense models, tensor parallelism and pipeline parallelism distribute the computation of each layer across devices. In MoE models, experts naturally map to expert parallelism: different experts live on different devices. This is efficient when routing is balanced.

However, when a token on device A is routed to an expert on device B, the token's hidden state must be transferred across the network interconnect. This all-to-all communication is a latency bottleneck, especially at large scales. The Switch Transformer paper dedicated significant engineering effort to this problem. DeepSeek-V3 introduced novel communication-compute overlap techniques to mitigate it.

Why MoE Models Are Harder to Fine-Tune

Fine-tuning an MoE model presents unique challenges. First, the full model must fit in GPU memory to allow gradient computation through all experts, which is expensive. Second, parameter-efficient fine-tuning (PEFT) methods like LoRA, when applied only to attention or dense layers, leave the expert weights frozen and may not adapt domain-specific knowledge effectively. Third, the routing distribution learned during pre-training may be miscalibrated for a new domain, causing suboptimal expert utilisation during fine-tuning. Finally, training instability (gradients through the sparse discrete routing) is more pronounced with smaller fine-tuning datasets.

Practical Example: Hand-Worked MoE Layer

Let us work through a concrete numerical example. We have a 4-expert MoE layer and a batch of 3 tokens. We use k=2 routing.

Step 1: Router Computes Scores

The gating network takes each token's hidden state and produces a score for each of the 4 experts. After applying softmax, we get the following routing probabilities:

Token	Expert 1 Score	Expert 2 Score	Expert 3 Score	Expert 4 Score
Token A	0.10	0.55	0.25	0.10
Token B	0.40	0.08	0.12	0.40
Token C	0.05	0.60	0.30	0.05

Step 2: Top-2 Selection Per Token

Token	Selected Experts (top 2)	Raw Scores	Renormalised Weights
Token A	Expert 2, Expert 3	0.55, 0.25	0.688, 0.312
Token B	Expert 1, Expert 4	0.40, 0.40	0.500, 0.500
Token C	Expert 2, Expert 3	0.60, 0.30	0.667, 0.333

Step 3: Expert Activation Count

Expert	Tokens Assigned	Load (of 3 tokens, k=2 so 6 total assignments)
Expert 1	Token B	1 token (16.7% of assignments)
Expert 2	Token A, Token C	2 tokens (33.3% of assignments)
Expert 3	Token A, Token C	2 tokens (33.3% of assignments)
Expert 4	Token B	1 token (16.7% of assignments)

Expert 2 and Expert 3 are more loaded than Expert 1 and Expert 4. In a large training run, this imbalance would grow without the auxiliary loss pushing toward uniformity.

Step 4: Output Combination

After all activated experts run their forward passes, outputs are combined:

Token A output = 0.688 * Expert2(h_A) + 0.312 * Expert3(h_A)
Token B output = 0.500 * Expert1(h_B) + 0.500 * Expert4(h_B)
Token C output = 0.667 * Expert2(h_C) + 0.333 * Expert3(h_C)

Each expert ran exactly once (processing the tokens assigned to it in a batched forward pass), and the weighted sum reconstructs the token-level output. Notice that Expert 2 processed both Token A and Token C in a single batched operation, which is computationally efficient.

Python Implementation

The following implementation covers the core MoE layer: gating, top-k routing with a capacity buffer, expert forward passes, and the auxiliary load balancing loss.

import torch
import torch.nn as nn
import torch.nn.functional as F


class ExpertFFN(nn.Module):
    """A single expert: a standard two-layer FFN with SiLU activation."""

    def __init__(self, hidden_dim: int, ffn_dim: int):
        super().__init__()
        self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
        self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)
        self.w3 = nn.Linear(hidden_dim, ffn_dim, bias=False)  # gate projection (SwiGLU)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # SwiGLU: element-wise product of SiLU(w1(x)) and w3(x), then projected back
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


class MoELayer(nn.Module):
    """
    Sparse Mixture of Experts layer.

    Args:
        hidden_dim:      Dimension of the token hidden states.
        ffn_dim:         Inner dimension of each expert FFN.
        num_experts:     Total number of experts (N).
        top_k:           Number of experts activated per token (k).
        capacity_factor: Multiplier on the average expert load to set capacity.
        aux_loss_coef:   Weight for the auxiliary load-balancing loss.
    """

    def __init__(
        self,
        hidden_dim: int,
        ffn_dim: int,
        num_experts: int = 8,
        top_k: int = 2,
        capacity_factor: float = 1.25,
        aux_loss_coef: float = 0.01,
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        self.aux_loss_coef = aux_loss_coef

        # Gating network: projects hidden_dim to num_experts logits
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)

        # Expert networks
        self.experts = nn.ModuleList(
            [ExpertFFN(hidden_dim, ffn_dim) for _ in range(num_experts)]
        )

    def forward(self, x: torch.Tensor):
        """
        Args:
            x: Token hidden states, shape [batch_size, seq_len, hidden_dim].

        Returns:
            output:    Same shape as x.
            aux_loss:  Scalar auxiliary load-balancing loss.
        """
        batch_size, seq_len, hidden_dim = x.shape

        # Flatten tokens: treat batch and sequence as one dimension
        # Shape: [batch_size * seq_len, hidden_dim]
        x_flat = x.view(-1, hidden_dim)
        num_tokens = x_flat.shape[0]

        # ── Gating ──────────────────────────────────────────────────────────
        # Raw logits from the gating network
        gate_logits = self.gate(x_flat)                    # [num_tokens, num_experts]
        gate_scores = F.softmax(gate_logits, dim=-1)       # [num_tokens, num_experts]

        # ── Top-k selection ──────────────────────────────────────────────────
        # Select the top-k experts for each token
        topk_scores, topk_indices = torch.topk(gate_scores, self.top_k, dim=-1)
        # topk_scores:  [num_tokens, top_k]
        # topk_indices: [num_tokens, top_k]

        # Renormalise the top-k scores so they sum to 1 per token
        topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)

        # ── Auxiliary load-balancing loss ────────────────────────────────────
        # Fraction of tokens routed to each expert (discrete indicator)
        expert_mask = F.one_hot(topk_indices, num_classes=self.num_experts).float()
        # expert_mask: [num_tokens, top_k, num_experts]
        tokens_per_expert = expert_mask.sum(dim=[0, 1])            # [num_experts]
        fraction_routed = tokens_per_expert / (num_tokens * self.top_k)

        # Average routing probability for each expert
        mean_gate_scores = gate_scores.mean(dim=0)                 # [num_experts]

        # Auxiliary loss: dot product of fraction routed and mean scores,
        # scaled by num_experts so the target value is ~1.0 at perfect balance
        aux_loss = self.aux_loss_coef * self.num_experts * (
            fraction_routed * mean_gate_scores
        ).sum()

        # ── Expert capacity ──────────────────────────────────────────────────
        # Average tokens per expert (with top_k factor)
        avg_load = (num_tokens * self.top_k) / self.num_experts
        capacity = int(avg_load * self.capacity_factor)

        # ── Expert forward passes ────────────────────────────────────────────
        output_flat = torch.zeros_like(x_flat)

        for expert_idx, expert in enumerate(self.experts):
            # Find all (token, k_slot) positions assigned to this expert
            # expert_mask[:, :, expert_idx]: [num_tokens, top_k]
            token_positions = expert_mask[:, :, expert_idx].nonzero(as_tuple=False)
            # token_positions[:, 0] are token indices
            # token_positions[:, 1] are the k-slot indices (0 or 1 for top-2)

            if token_positions.numel() == 0:
                continue

            token_indices = token_positions[:, 0]

            # Apply capacity: drop tokens beyond capacity
            if token_indices.shape[0] > capacity:
                token_indices = token_indices[:capacity]
                token_positions = token_positions[:capacity]

            # Gather the routing weights for these (token, expert) pairs
            k_slot_indices = token_positions[:, 1]
            routing_weights = topk_scores[token_indices, k_slot_indices]  # [n_assigned]

            # Run expert on the selected tokens
            expert_inputs = x_flat[token_indices]                  # [n_assigned, hidden_dim]
            expert_outputs = expert(expert_inputs)                 # [n_assigned, hidden_dim]

            # Weight outputs and accumulate
            weighted_outputs = expert_outputs * routing_weights.unsqueeze(-1)
            output_flat.index_add_(0, token_indices, weighted_outputs)

        # Reshape back to [batch_size, seq_len, hidden_dim]
        output = output_flat.view(batch_size, seq_len, hidden_dim)

        return output, aux_loss


# ── Usage example ────────────────────────────────────────────────────────────

def demo():
    batch_size, seq_len, hidden_dim, ffn_dim = 2, 128, 512, 2048

    moe = MoELayer(
        hidden_dim=hidden_dim,
        ffn_dim=ffn_dim,
        num_experts=8,
        top_k=2,
        capacity_factor=1.25,
        aux_loss_coef=0.01,
    )

    x = torch.randn(batch_size, seq_len, hidden_dim)
    output, aux_loss = moe(x)

    print(f"Input shape:   {x.shape}")        # [2, 128, 512]
    print(f"Output shape:  {output.shape}")    # [2, 128, 512]
    print(f"Aux loss:      {aux_loss.item():.4f}")

    # In a training loop, add aux_loss to the main loss:
    # total_loss = cross_entropy_loss + aux_loss
    # total_loss.backward()

demo()

A few notes on the implementation above. The index_add_ operation accumulates weighted expert outputs into the output tensor, handling the case where multiple token-expert pairs share the same output slot. The capacity check truncates the token list to capacity tokens; in a production implementation, you would track dropped tokens for monitoring. The auxiliary loss computation follows the formulation from the Switch Transformer paper but is applied per-call rather than accumulated across steps.

Real-World Models Using MoE

Model	Organisation	Total Parameters	Active Parameters (per token)	Total Experts	Active Experts (k)	Notes
Switch Transformer	Google	Up to 1.6T	~1/N of MoE params	Up to 2048	1	First large-scale MoE transformer; k=1 routing
GLaM	Google	1.2T	~96B	64	2	Matched GPT-3 quality at 1/3 the training energy
GShard	Google	600B	~13B	2048	2	Multilingual translation; scaled to 2048 experts
Mixtral 8x7B	Mistral AI	46.7B	~12.9B	8	2	Open weights; matched LLaMA 2 70B at lower cost
Mixtral 8x22B	Mistral AI	141B	~39B	8	2	Strongest open-weights MoE at release in 2024
GPT-4	OpenAI	~1.8T (rumoured)	~220B (rumoured)	~16 (rumoured)	2 (rumoured)	Architecture unconfirmed; MoE widely reported by insiders
Grok-1	xAI	314B	~86B	8	2	Open weights released March 2024; MoE confirmed
DeepSeek-V2	DeepSeek	236B	~21B	160	6	Fine-grained MoE; also uses Multi-head Latent Attention
DeepSeek-V3	DeepSeek	671B	~37B	256	8	Trained for $5.5M; auxiliary-loss-free load balancing

Note on GPT-4: OpenAI has not officially confirmed GPT-4's architecture. The MoE figures cited above originate from reporting by George Hotz and others, and should be treated as credible rumour rather than confirmed fact.

Advantages

Compute efficiency at scale. Mixtral 8x7B matches LLaMA 2 70B in quality but costs roughly 6x less compute per inference token. This is not a minor optimisation; at production scale it changes the economics entirely.
Better scaling laws. MoE models follow more favourable scaling curves than dense models when parameter count is measured against compute budget. You get more capability per FLOP spent on training.
Expert specialisation. Empirical studies show that individual experts develop preferences for particular token types: syntax-heavy text, mathematical expressions, code, specific languages. The model learns a natural division of labour.
Parallelism-friendly architecture. Expert parallelism maps cleanly to multi-device setups. Each expert can live on a separate GPU or node, making very large models tractable to train and serve.
Knowledge capacity. Total parameter count determines how much factual knowledge a model can store. MoE lets you grow this capacity cheaply, since adding experts does not increase per-token inference cost proportionally.
Proven at frontier scale. Every credible frontier lab (OpenAI, Google, Mistral, xAI, DeepSeek) now uses MoE or MoE-inspired architectures. The technique has been validated across dozens of independent training runs at different scales.

Limitations and Trade-offs

Memory vs compute trade-off. The full model must be loaded into memory even though only a fraction of parameters are active per token. Serving Mixtral 8x7B requires loading all 46.7B parameters, not just the 12.9B that run for any given token. This requires significantly more RAM than a comparably-performing dense model.
Communication costs in distributed inference. Serving an MoE model at scale with expert parallelism requires token-to-expert routing across devices, which introduces network latency. For latency-sensitive applications, this can be worse than a dense model served on a single large GPU.
Training instability. MoE models are more sensitive to hyperparameters than dense models. The auxiliary loss coefficient, the learning rate schedule, and the warmup period all interact in complex ways. A misconfigured run can produce a collapsed model with poor quality.
Fine-tuning difficulty. Full fine-tuning requires loading and updating all expert weights. PEFT methods that bypass expert weights may miss important domain adaptation. Routing distributions shift during fine-tuning and may diverge from the pre-training distribution in ways that hurt generalisation.
Token dropping. When experts are overloaded, tokens are dropped. Dropped tokens receive no expert processing, which degrades output quality. Monitoring and minimising token dropping is essential for production systems.
Reproducibility and debugging complexity. The non-deterministic routing (token permutations, capacity overflows) makes debugging MoE models harder than dense models. Bugs in the routing logic can silently degrade quality without obvious error signals.

Common Mistakes

Ignoring the auxiliary loss entirely. Some practitioners omit the load balancing loss, assuming the model will naturally distribute load. It will not. Expert collapse is the default outcome without explicit regularisation. Always include the auxiliary loss and monitor expert utilisation during training.
Setting the capacity factor too close to 1.0. A capacity factor of 1.0 means any imbalance causes token dropping. Real routing distributions are never perfectly uniform. Use at least 1.1, and prefer 1.25 as a starting point. Reduce only if memory is severely constrained.
Applying PEFT only to attention and dense layers. LoRA or adapters applied exclusively to attention weights will not adapt the experts, which contain the bulk of domain-specific knowledge in an MoE model. Either fine-tune expert weights directly or apply LoRA to expert FFN weights as well.
Confusing total parameters with active parameters. Reporting Mixtral 8x7B as a "7B model" is inaccurate (it has 46.7B parameters). Reporting it as a "46B model" overstates inference cost (only 12.9B parameters are active per token). Distinguish clearly between total parameter count (relevant for memory and storage) and active parameter count (relevant for compute and latency).
Assuming MoE expert specialisation is guaranteed. Experts develop soft specialisation during training, but this is an emergent property, not a guaranteed one. If the auxiliary loss is too strong, experts become nearly identical to ensure balanced load, losing the benefit of specialisation.
Underestimating the impact of token dropping at inference. Token dropping during training is a controlled regulariser. Token dropping during inference is a quality bug. Evaluate your model's drop rate on representative inference workloads and increase capacity factor if drops exceed 1-2%.

Best Practices

Start with a well-validated configuration. If building on open-source infrastructure, start with Mixtral's published hyperparameters (8 experts, k=2, capacity factor 1.25, auxiliary loss coefficient 0.02) and deviate only when you have a specific reason. Validated configurations save weeks of debugging.
Monitor expert utilisation throughout training. Log the fraction of tokens routed to each expert at regular intervals. A healthy training run should show relatively uniform utilisation (no expert above 25-30% of load for 8-expert k=2). Early detection of imbalance allows you to adjust the auxiliary loss coefficient before the run completes.
Tune the auxiliary loss coefficient carefully. Too high and experts become identical; too low and collapse occurs. Start at 0.01. If utilisation is uneven after 10% of training, increase to 0.02. If experts are identical (measuring by cosine similarity of weights), reduce to 0.005.
Use expert parallelism for models with many experts. If you have 8 experts and 8 GPUs, assign one expert per GPU. This minimises cross-device communication. For models with more experts than GPUs, use expert groups and profile the all-to-all communication overhead carefully.
Prefer k=2 over k=1 for better training stability. k=1 routing (Switch Transformer style) is computationally cheaper but prone to instability. For most use cases, k=2 provides a better quality-stability balance and is used by every major open-weights MoE model.
Use MoE when parameter count is the primary bottleneck. MoE is the right choice when you need to store more knowledge than a dense model can hold within your compute budget. If you need a small, fast, cheap model for latency-sensitive production, a well-distilled dense model is usually preferable. MoE excels at frontier-scale pretraining and large-scale inference services where throughput (tokens/second across many requests) matters more than per-request latency.

Frequently Asked Questions

Does GPT-4 really use MoE?

OpenAI has never officially confirmed GPT-4's architecture. The widespread belief that it uses MoE originates from reporting by George Hotz in August 2023, who claimed GPT-4 consists of 8 MoE experts each around 220B parameters, with 2 activated per token. This figure has been cited and repeated enough to become widely accepted, but it remains unverified by OpenAI. What we can say with confidence is that the compute economics and performance profile of GPT-4 are consistent with a large MoE architecture, and that OpenAI had access to all the prior MoE research that would make this choice natural.

Why is Mixtral 8x7B not 56 billion parameters effectively?

The "8x7B" naming is somewhat misleading. Mixtral 8x7B has 8 experts, each with roughly the FFN capacity of a 7B model. But the model is not simply 8 independent 7B models stacked together. The attention layers are shared across all experts, and there is only one set of attention weights per transformer layer, not 8. The total parameter count is approximately 46.7B because the non-MoE components (embeddings, attention, layer norms) are counted only once. Of those 46.7B parameters, roughly 12.9B are active for any given token (the shared components plus 2 of the 8 expert FFN blocks).

Why can't I just run more experts for better quality?

Adding more experts helps only up to a point. First, more experts means more total parameters, which increases memory requirements even if active compute stays the same. Second, with fixed k, more experts means each expert sees fewer tokens per training step, which slows expert learning. Third, more experts require larger all-to-all communication overhead in distributed settings. Fourth, load balancing becomes harder with more experts, as rare experts may be poorly trained. DeepSeek-V2 showed that fine-grained MoE with many small experts (160 experts, k=6) can outperform coarse-grained MoE, but this comes with significant engineering complexity.

Is MoE better than dense for all tasks?

No. MoE is better when you need to maximise quality for a given training compute budget and can tolerate higher memory requirements. Dense models are preferable when you need the lowest possible inference latency (no routing overhead, no all-to-all communication), when you have very limited serving memory, when you need to fine-tune the model frequently on small datasets, or when you are operating at a scale where the memory bandwidth cost of a sparse model outweighs the compute savings. Many production deployments serve distilled dense models that were trained using larger MoE teacher models, combining the best of both approaches.

What is expert specialisation, and can I observe it?

Expert specialisation refers to the phenomenon where different experts in a trained MoE model develop preferences for different types of tokens. Studies of trained models have found that some experts preferentially handle punctuation and formatting, others handle numeric tokens, others activate for specific languages, and others handle domain-specific vocabulary. You can observe this by tracking, for each expert, which tokens most frequently route to it and analysing their linguistic properties. The degree of specialisation varies: models with stronger auxiliary loss (ensuring balance) tend to show weaker specialisation, while models with more lenient load balancing often develop more distinct expert personas.

References

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). "Adaptive mixtures of local experts." Neural Computation, 3(1), 79-87. The original MoE paper.
Shazeer, N., et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. First application of MoE to large-scale NLP with LSTMs.
Lepikhin, D., et al. (2021). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." ICLR 2021. Scaled MoE to 600B parameters for multilingual translation.
Fedus, W., Zoph, B., and Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022. k=1 routing; demonstrated 1.6T parameter MoE transformers.
Du, N., et al. (2022). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022. 1.2T parameter MoE model matching GPT-3 at 1/3 the training energy.
Zoph, B., et al. (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906. Comprehensive study of MoE training stability and fine-tuning.
Mistral AI (2024). "Mixtral of Experts." arXiv:2401.04088. Technical report for Mixtral 8x7B; first major open-weights MoE LLM.
DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. Fine-grained MoE with 160 experts; introduces Multi-head Latent Attention.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. 671B MoE model with auxiliary-loss-free load balancing and multi-token prediction.
Puigcerver, J., et al. (2023). "From Sparse to Soft Mixtures of Experts." arXiv:2308.00951. Proposes Soft MoE as a fully differentiable alternative to hard top-k routing.

Key Takeaways

MoE decouples model capacity from per-token compute. You can have a model with 46B total parameters that activates only 13B per token. Total parameter count and active parameter count are two separate, independently important metrics.
The router is the critical component. A well-trained router that achieves balanced expert utilisation is what separates a good MoE model from one that collapses to using a single expert. The auxiliary load balancing loss is not optional.
Top-2 routing is the current practical sweet spot. k=1 is cheaper but unstable; k>2 provides diminishing returns at increasing compute cost. Almost every production MoE model from 2024 through 2026 uses k=2.
Memory is the price you pay for compute efficiency. MoE models require loading all expert weights into memory even though only a fraction are active per token. This trade-off is worth it at large scale but may not be at smaller scales.
Expert specialisation is emergent, not programmed. You do not explicitly assign domains to experts. The model learns its own division of labour through gradient descent. This specialisation is real and measurable, but it is fragile and can be destroyed by overly aggressive load balancing.
MoE is now the dominant frontier architecture. GPT-4, Grok-1, Mixtral, and DeepSeek-V3 all use or are credibly reported to use MoE. Understanding this architecture is no longer optional for practitioners working at the frontier of language model engineering.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Decision Trees: A Complete Guide with Hand-Worked Examples

Decision trees split data by finding the best question at each node....

Diffusion Models Explained: The Math-Free Guide to How Stable Diffusion and DALL-E Work

2026-06-05T02:00:00+00:00

Diffusion Models Explained: The Math-Free Guide to How Stable Diffusion and DALL-E Work

Introduction

In the span of just a few years, AI-generated images went from a niche curiosity to a technology that genuinely fools the human eye. Type a sentence into a text box and seconds later you have a photorealistic oil painting, a surrealist fantasy landscape, or a product photograph that never existed. The technology making this possible, in almost every major system from Stable Diffusion to DALL-E 2 to Midjourney, is called a diffusion model.

The name sounds technical, and the original papers are dense with probability theory. But the underlying idea is one of the most intuitive in all of machine learning. This guide strips away the math and gives you a clear mental model of what is actually happening when you press "generate." You will understand why these systems produce such high-quality images, why they are slow, why your prompt wording matters so much, and why this approach beat a decade of competing research.

No prior knowledge of neural networks is required, though familiarity with the general idea of machine learning (a model learns from examples) will help.

Problem Statement: What Came Before, and Why It Was Hard

Before diffusion models dominated the field, two approaches shared the spotlight: Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). Both had real strengths, and both had frustrating, sometimes fundamental weaknesses.

GANs work through competition. A generator network tries to produce convincing fake images, and a discriminator network tries to catch them. They train together, each improving in response to the other, like a forger and an art detective locked in an arms race. When it works, the results are spectacular. GAN-generated faces reached photorealistic quality years before diffusion models existed. But training a GAN is notoriously fragile. The generator and discriminator can fall into unstable feedback loops. One of the most common failure modes is called mode collapse, where the generator learns to produce only a narrow range of outputs that reliably fool the discriminator, ignoring the full diversity of the real data. Getting a GAN to produce a wide variety of high-quality images across many categories, rather than a narrow slice of them, was a persistent unsolved problem.

VAEs take a different approach. They compress images into a compact numerical summary (a latent vector) and then learn to reconstruct them. Because they explicitly model uncertainty in that compression, you can sample from the learned space to generate new images. VAEs are stable to train and produce diverse outputs, but the images tend to be blurry. The compression step throws away detail, and the reconstruction step cannot recover it perfectly.

Autoregressive models, the kind that power text generation, were also applied to images by generating them one pixel (or patch) at a time. This produced high-quality results but was extremely slow, and scaling to high resolutions was computationally punishing.

In short: the field could get sharpness without diversity (GANs at their best), diversity without sharpness (VAEs), or quality at the cost of speed (autoregressive). Diffusion models, introduced in their modern form by Ho et al. in 2020, found a way to get all three by reframing the problem entirely.

Core Concepts and Terminology

Term	Plain English Definition
Diffusion process	The overall framework: gradually destroy an image by adding noise, then train a model to reverse that destruction.
Forward process	The noise-adding direction. A real training image is progressively corrupted, step by step, until it is indistinguishable from random noise. The model does not learn this; it is a fixed mathematical procedure.
Reverse process	The direction the model learns. Starting from pure noise, the model predicts and removes a small amount of noise at each step, gradually revealing a coherent image.
Noise schedule	A plan that controls how much noise is added at each step of the forward process, typically starting with very little and ramping up until the image is completely destroyed.
Denoising	The act of predicting and subtracting noise from a partially noisy image. This is what the neural network learns to do.
U-Net	The architecture most commonly used for the denoising network. It has an encoder that compresses the noisy image and a decoder that rebuilds it, with shortcut connections that preserve fine-grained detail.
Latent diffusion	A faster variant where the diffusion process happens in a compressed latent space rather than on full-resolution pixels. Stable Diffusion uses this approach, which is why it is more efficient than operating directly on pixels.
CLIP	A model from OpenAI trained to understand the relationship between images and text. In text-to-image systems, CLIP (or a similar encoder) converts your text prompt into a numerical representation that guides the denoising network.
Conditioning	The mechanism by which external information, such as a text prompt, an edge map, or a reference image, is fed into the denoising network to steer what kind of image gets generated.
Classifier-free guidance	A technique that strengthens the influence of your prompt on the generated image. The model runs two denoising predictions at each step, one with the prompt and one without, and amplifies the difference. Higher guidance scale means stronger prompt adherence, but too high and quality suffers.

How It Works: The Four Phases

Understanding diffusion models means understanding four distinct phases: forward corruption during training data preparation, the training objective itself, inference (generating new images), and text conditioning. Each phase builds on the last.

Phase 1: The Forward Process (Destroying Images to Build a Teacher)

Imagine you have a beautiful photograph of a mountain at dawn. Now imagine someone sprinkles a light dusting of television static over it. You can still make out the mountain, but there is noise. They add more static. More still. After hundreds of rounds, the original photograph is completely buried. You are left with a grey, featureless haze.

This is the forward process. For every image in the training dataset, the system creates a long sequence of progressively noisier versions of that image, from the original all the way to pure random noise. The crucial insight is that every step of this destruction is precisely known. At step 50 out of 1,000, you know exactly how much noise was added and exactly what the partially noisy image looks like. This is not learned; it is a fixed recipe.

This gives the training process an enormous, free supply of labelled examples: for every image at every noise level, we know exactly what noise was added.

Phase 2: Training (Teaching the Model to See Through Noise)

The neural network, typically a U-Net, is handed a noisy image and told what noise level it is at. Its job is to predict the noise that was added. If it can do that accurately, it can subtract the noise and recover a cleaner version.

Think of it like an art restoration expert who has seen thousands of damaged paintings. They have learned the patterns of how deterioration works, what canvas looks like under grime, what brush strokes suggest beneath a layer of varnish. Given a damaged painting and told roughly how degraded it is, they can make educated guesses about what to clean away.

Because we have millions of training images and hundreds of noise levels per image, the model sees hundreds of millions of training examples and builds a deep, rich understanding of what makes images look coherent. Importantly, there is no adversary, no discriminator, and no fragile balancing act. The loss function is straightforward: how close was the model's noise prediction to the actual noise? This stability is one reason diffusion models train so reliably.

Phase 3: Inference (Sculpting from Static)

Once trained, the model can generate new images from scratch. You start with an image of pure random noise, the equivalent of a block of unmarked marble. You ask the model: "if this were a noisy image at step 1,000, what noise would you predict?" The model makes a prediction, you subtract a small amount of that predicted noise, and you have a slightly less noisy image. Repeat this for all 1,000 steps and, like a sculptor progressively revealing a form, a coherent image emerges.

At first the image will just look like a blurry blob with vague structure. By the midpoint you might see rough shapes and colour zones. In the final steps, fine details snap into focus: textures, edges, facial features. The process is like developing a photograph in a darkroom, where the image gradually materialises out of the chemical solution.

This sequential nature is why diffusion models are slow. There is no shortcut to skip from noise to finished image in one step (though recent research has reduced step counts dramatically, from 1,000 to as few as 4 or 8 with certain samplers).

Phase 4: Text Conditioning (How Your Prompt Steers the Process)

A diffusion model trained only on images without any guidance will generate images at random. To steer it toward a specific subject, you need conditioning.

In systems like Stable Diffusion and DALL-E 2, your text prompt is passed through a text encoder, most often one trained with CLIP, which converts the words into a rich numerical representation. This representation is fed into the U-Net at every denoising step, nudging the predicted noise in a direction that makes the emerging image more consistent with the prompt.

Think of it as the sculptor having a reference photograph on the table while they work. Each time they pick up the chisel, they glance at the reference and make sure the form they are revealing is moving toward the intended subject. The guidance scale controls how tightly they follow that reference. At a low guidance scale, the sculptor feels free to improvise. At a high guidance scale, they stick closely to the reference, sometimes at the cost of a natural, flowing finish.

Practical Example: "A Red Fox Sitting in a Snowy Forest at Sunset"

Let us walk through exactly what happens, step by step, when you type this prompt into a system like Stable Diffusion.

Prompt encoding. Your text is tokenised and passed through the CLIP text encoder. The output is a sequence of vectors, each capturing the meaning and relationships between the words: red, fox, sitting, snowy, forest, sunset, and the relationships between them.
Sampling the starting noise. The system draws a random sample of pure Gaussian noise. This is your blank canvas. Every pixel is an independent random value. There is no image here yet.
First denoising step. The U-Net receives the noisy canvas, the CLIP encoding of your prompt, and the current timestep (1,000 out of 1,000). It predicts the noise component. Because of the prompt conditioning, the predicted noise is not neutral; it is biased toward removing noise in ways that would move the remaining signal toward a fox in a snowy setting.
Gradual refinement. Over many steps (say, 50 steps with a modern sampler), the same process repeats. By step 15 or so, you might see an orange-tinged blob against a pale background. By step 30, the shape of an animal begins to distinguish itself. By step 45, fur texture, snow detail, and the warm glow of a low sun start to appear.
Latent to pixel space. In Stable Diffusion specifically, all of the above happens in a compressed latent space (roughly 64x64 for a 512x512 output). Once denoising is complete, a separate decoder network (the VAE decoder) expands this compressed representation back to full-resolution pixels, recovering fine texture and colour detail.
Final image. You see a 512x512 (or higher) image of a fox in a winter forest, lit by sunset light, that did not exist before you pressed generate.

The entire process, from random noise to rendered image, typically takes one to ten seconds on modern hardware, depending on step count and resolution.

Advantages: Why Diffusion Models Beat GANs

For years, GANs held the crown for image generation quality. Diffusion models displaced them for several interconnected reasons.

Training Stability

GANs require careful balancing of two networks that are in direct competition. If the discriminator gets too strong too fast, the generator receives no useful gradient signal and stops learning. If the generator improves too quickly, the discriminator collapses. Practitioners spend enormous effort tuning learning rates, regularisation techniques, and architectural choices just to keep training from diverging.

Diffusion models have none of this. The training objective, predict the noise that was added, is a straightforward supervised learning problem. There is a single network, a single loss, and gradients flow cleanly. Training a diffusion model is about as stable as training a standard image classifier.

Mode Coverage and Diversity

Because GANs optimise for fooling a discriminator, they are prone to finding and exploiting gaps in the discriminator's knowledge, rather than learning a complete model of the data distribution. Mode collapse, where the generator produces only a subset of the possible outputs, is a persistent problem.

Diffusion models learn to model the full data distribution by training on all noise levels simultaneously. They must learn what coherent images look like across all scales, from broad composition to fine texture. The result is dramatically better diversity: ask for "a dog" and you might get a poodle, a labrador, a terrier, a cartoon dog, or a painterly dog, not the same GAN-optimal dog face every time.

Image Quality and Resolution

When combined with latent diffusion (operating in compressed space) and large-scale training, diffusion models produce images that surpass the sharpest GANs on standard benchmarks and, perhaps more importantly, hold up to close human inspection. The iterative refinement process allows the model to add detail progressively, without having to commit to fine structure before the broader composition is established.

Controllability

Because conditioning is built into the architecture at a fundamental level, diffusion models accept a rich variety of guidance signals: text prompts, reference images, depth maps, edge maps, pose skeletons. ControlNet extensions, for example, allow you to specify the exact pose of a figure while letting the model freely generate the appearance. This kind of fine-grained control was significantly harder to achieve with GANs.

Limitations and Trade-offs

Diffusion models are not without significant costs and weaknesses.

Slow Inference

Generating one image requires running the neural network hundreds of times, once per denoising step. Compare this to a GAN, which makes a single forward pass. Even with modern fast samplers (DDIM, DPM-Solver, LCM) that reduce step counts from 1,000 to 20 or fewer, diffusion models are still fundamentally sequential. Each step depends on the result of the previous one, so you cannot parallelise the process.

Compute Cost

Training a large diffusion model requires enormous computational resources. Stable Diffusion's training run cost hundreds of thousands of dollars in GPU time. Running inference, while cheap per image on consumer hardware, becomes expensive when generating thousands of images for commercial applications.

Prompt Sensitivity

Small changes in wording can produce dramatically different outputs. Adding or removing a single word, reordering phrases, or using synonyms can shift the image significantly. This makes diffusion models powerful but somewhat unpredictable for users who have not developed intuition for prompt engineering. The relationship between prompt and output is not always transparent or consistent.

Memorisation Concerns

Research has shown that diffusion models can, in certain conditions, reproduce near-exact copies of training images, particularly for images that appeared many times in the training set. This raises intellectual property and privacy concerns, especially for models trained on internet-scraped data without explicit consent from image creators. The legal and ethical landscape around this remains unsettled.

Compositionality Failures

Diffusion models sometimes struggle with prompts that require precise spatial relationships or counting. "Three red balls on a blue shelf with a green lamp to the left" may produce something that captures the gist but misplaces elements. Compositional reasoning, which comes naturally to language models, does not translate perfectly to the image generation process.

Common Mistakes

Misunderstanding What "Steps" Means

Many new users assume that more steps always means better quality, without limit. In practice, returns diminish quickly. Going from 10 to 30 steps makes a large visual difference. Going from 50 to 200 steps in most samplers makes almost no perceptible difference and just wastes time. The right step count depends on the sampler being used: DDIM and DPM-Solver converge faster than the original DDPM sampler.

Over-Prompting and Under-Prompting

Over-prompting means stuffing your prompt with every adjective and style keyword you can think of, hoping more instructions equals better results. In practice, overly long prompts can cause the model to pay uneven attention to different parts, sometimes ignoring important elements entirely. Under-prompting means giving so little information that the model defaults to its most average interpretation. Effective prompts are specific where it matters and concise where detail is not needed.

Treating Guidance Scale as "Quality"

Guidance scale is often described as a "quality" or "prompt adherence" slider, which leads users to push it to extreme values. Very high guidance scale (above 15 or 20, depending on the model) tends to produce over-saturated, artificial-looking images with distorted details, because the model is being pushed too hard away from naturalness and toward prompt matching. A guidance scale between 7 and 12 is a reasonable starting range for most models.

Using the Wrong Model for the Task

Different models have different strengths. A model fine-tuned for photorealism will produce poor anime-style images. A model fine-tuned for concept art may not produce accurate text overlays. Using the base Stable Diffusion model for a task that a specialised fine-tune handles much better is a common mistake when starting out.

Ignoring the Negative Prompt

The negative prompt field in most UIs tells the model what to avoid generating. Ignoring it means accepting whatever artifacts, watermarks, or compositional issues the model defaults to. Using a basic negative prompt like "blurry, low quality, deformed hands, watermark" can substantially improve output quality with no extra effort.

Best Practices

Choosing Step Count

Start with 20 to 30 steps for rapid iteration when exploring prompts. Increase to 40 to 50 for final outputs. With LCM (Latent Consistency Models) or Turbo variants, 4 to 8 steps can produce surprisingly strong results. Avoid spending compute budget on step counts above 50 unless you are using a specific sampler known to benefit from them.

Setting Guidance Scale

For photorealistic models, try guidance scale 7 to 9 as a default. For artistic or stylised models, 5 to 7 often feels more natural. If your image looks plastic, oversaturated, or has strange edge artifacts, lower the guidance scale before trying anything else.

Model Selection: Stable Diffusion vs DALL-E vs Midjourney

System	Best For	Key Strength	Key Weakness
Stable Diffusion (open-source)	Custom workflows, fine-tuning, local use	Fully open, extensible, large community ecosystem of fine-tunes	Requires technical setup; quality varies widely by model version
DALL-E 3 (OpenAI)	Prompt-accurate generation, text in images	Best prompt-following of any major system; handles complex instructions well	Closed API only; less stylistic flexibility
Midjourney	Aesthetic, editorial, and artistic images	Consistently beautiful default outputs; strong stylistic coherence	Less controllable; Discord-based interface; closed
Adobe Firefly	Commercial use with IP safety	Trained on licensed content; safe for commercial projects	More conservative outputs; less cutting-edge quality

Using ControlNet for Compositional Control

If you need control over the layout of an image rather than just the content, ControlNet extensions for Stable Diffusion let you provide a skeleton, depth map, or edge map that the model must respect. This is the most reliable way to specify exact spatial arrangement without fighting the model's own compositional tendencies.

Seeding for Reproducibility

Every image generation starts from a random noise sample. Setting a fixed seed lets you reproduce a result exactly, or vary just one element (the prompt, the guidance scale) while keeping everything else constant. This is invaluable for iterative refinement.

Comparison: Diffusion vs GAN vs VAE vs Autoregressive

Property	Diffusion Model	GAN	VAE	Autoregressive (e.g. DALL-E 1)
Image Quality	Very high; rivals or exceeds human photography	High; best GANs are photorealistic	Moderate; tends toward blurriness	High for its era; can be sharp
Diversity	Very high; covers the full data distribution well	Low to moderate; mode collapse is common	High; samples from a well-defined latent space	High; sequential generation naturally explores diversity
Training Stability	High; single supervised objective, no adversarial games	Low; adversarial balance is fragile	High; straightforward reconstruction loss	High; standard cross-entropy training
Inference Speed	Slow; hundreds of sequential neural network calls	Fast; single forward pass	Fast; single forward pass	Very slow; generates one token at a time
Controllability	Very high; rich conditioning (text, image, depth, pose)	Moderate; conditioning possible but complex	Moderate; latent space interpolation works well	Moderate; token-level control of attributes
Notable Systems	Stable Diffusion, DALL-E 2/3, Midjourney, Imagen	StyleGAN, BigGAN, CycleGAN	VQVAE, early image synthesis experiments	DALL-E 1, ImageGPT, PixelCNN

Frequently Asked Questions

Is Midjourney a diffusion model?

Midjourney has not published technical details about its architecture, so we cannot say with certainty. However, the behaviour of Midjourney outputs, the iterative refinement process visible when you watch a generation, the response to prompt guidance, and the general output characteristics, are all consistent with a diffusion-based approach. The overwhelming majority of production text-to-image systems built after 2022 use diffusion as their core mechanism, and Midjourney almost certainly does too, possibly with proprietary modifications.

Why do more steps improve quality up to a point?

Each denoising step is an approximation. The model predicts the noise at the current noise level, removes a portion of it, and hands off to the next step. With very few steps, each approximation is large and can accumulate errors, leading to artifacts and incoherence. With more steps, each individual approximation is smaller and more accurate. Beyond a certain threshold, the approximations are already accurate enough that adding more steps does not meaningfully reduce error, which is why quality plateaus. The exact threshold depends on the sampler: some samplers are mathematically designed to converge faster and require fewer steps.

What is LoRA for image models?

LoRA stands for Low-Rank Adaptation. It is a fine-tuning technique that allows you to teach a pre-trained model new concepts (a specific person's face, a particular art style, a custom object) without retraining the entire model. Instead of updating all of a model's billions of parameters, LoRA adds a small set of new parameters that modify specific layers. The resulting LoRA file is tiny (often just a few megabytes) compared to the full model. You can download community-created LoRAs to add a character, a painting style, or a photography aesthetic to an otherwise general-purpose base model.

Can diffusion models generate video?

Yes. Extending diffusion models to video is an active and fast-moving research area. Systems like Sora (OpenAI), Stable Video Diffusion, and others treat video frames as sequences and apply diffusion across both the spatial (pixel) and temporal (frame) dimensions. The core mechanism, learn to reverse a noising process, applies directly. The main challenge is the vastly increased computational cost: generating even a few seconds of video requires orders of magnitude more compute than a single image.

Are the images generated by diffusion models copyrightable?

This is an active legal question with no definitive global answer as of mid-2026. In the United States, the Copyright Office has held that purely AI-generated content without meaningful human authorship is not copyrightable, but that images where a human made substantial creative choices in the process may be eligible for some protection. The situation varies by jurisdiction. Additionally, lawsuits are ongoing in multiple countries regarding whether training on copyrighted images without consent constitutes infringement. Anyone using AI-generated images commercially should consult legal advice specific to their jurisdiction and intended use.

References

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS) 33. The original paper establishing the modern DDPM framework.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). The paper introducing latent diffusion, the foundation of Stable Diffusion.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125. The DALL-E 2 paper describing the use of CLIP embeddings for text-conditioned diffusion.
Song, J., Meng, C., and Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502. Introduced DDIM, a faster sampler that reduced required inference steps from thousands to dozens.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Gontijo-Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS. The Imagen paper from Google Brain, demonstrating the importance of large language models for text understanding in image generation.
Ho, J., and Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598. Introduced the guidance technique that most production systems use to balance prompt adherence and image quality.

Key Takeaways

Diffusion models generate images by learning to reverse a carefully structured noise-adding process. The core loop is simple: destroy images with noise during training, learn to undo that destruction, then apply that knowledge starting from pure noise at inference time.
The training stability of diffusion models, rooted in a straightforward supervised objective rather than an adversarial game, is a primary reason they outpaced GANs in quality, diversity, and reliability.
Text prompts guide generation through a CLIP encoder that translates language into a numerical representation. Classifier-free guidance amplifies the influence of this representation, and the guidance scale controls that amplification.
Latent diffusion, used in Stable Diffusion, dramatically reduces compute by running the denoising process in a compressed space and only expanding to full resolution at the final step.
The main trade-off is inference speed: sequential denoising steps cannot be parallelised, making image generation fundamentally slower than GAN alternatives, though modern samplers have reduced this cost significantly.
Understanding step count, guidance scale, model selection, and negative prompting gives you practical leverage over outputs and helps you diagnose quality issues when they arise.

AI in Finance: ML for Trading, Risk, and Fraud Detection

Machine learning powers fraud detection, credit scoring, and algorithmic trading. Learn how...

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge distillation trains a small student model to learn from a large...

Perivitta Rajendran

AI in Finance: ML for Trading, Risk, and Fraud Detection

AI in Finance: ML for Trading, Risk, and Fraud Detection

Introduction

Problem Statement: Why Finance Was an Early Adopter

Core Concepts and Terminology

How It Works: The Four Core Applications

Practical Example: Real-Time Fraud at a Major Bank

Advantages

Speed and Scale Impossible for Humans

Pattern Detection Beyond Human Intuition

Consistent and Auditable Decisions

Continuous Improvement from Feedback

Limitations and Trade-offs

Adversarial Adaptation

Regulatory Constraints on Explainability

Historical Data Encodes Historical Biases

Model Risk in Trading

Common Mistakes

Training on Biased Historical Labels

Ignoring Class Imbalance in Fraud Detection

Overfitting to Market Regime in Trading

Treating Compliance as an Afterthought

Best Practices

Separate Detection and Explanation Layers

Monitor for Distribution Shift

Run Regular Bias Audits

Stress Test Against Adversarial Examples

Comparison: AI Applications Across Financial Domains

Frequently Asked Questions

Will AI replace financial analysts and traders?

How does AI detect fraud it has never seen before?

Are AI-driven credit decisions fair?

What happened in the 2010 Flash Crash and can AI prevent that?

What is alternative data and how is it used in finance?

References

Key Takeaways

Related Articles

Knowledge Distillation: How Small Models Learn from Big Ones

Knowledge Distillation: How Small Models Learn from Big Ones

Introduction

Problem Statement: The Cost Gap Between Training and Deployment

Core Concepts and Terminology

How It Works: The Distillation Process

Practical Example: Distilling a Sentiment Classifier

Advantages

Smaller Models Than Training from Scratch Justifies

Faster Inference at Deployment

Works Across Modalities

Preserves Model Interpretability Options

Enables On-Device and Edge Deployment

Limitations and Trade-offs

Performance Gap Below Teacher

Requires Access to Teacher Outputs

Training Cost Is Not Zero

Distribution Shift Sensitivity

Hyperparameter Sensitivity

Common Mistakes

Using Temperature 1 for the Soft Labels

Choosing a Student Architecture That Is Too Small

Distilling to a Completely Different Architecture Without Feature Matching

Skipping Evaluation on Task-Specific Metrics

Assuming Distillation Fixes a Bad Teacher

Best Practices

Start with Output Distillation, Add Feature Distillation If Needed

Tune Temperature with a Validation Set

Use the Teacher for Data Augmentation

Consider Progressive Distillation for Very Large Compression Ratios

Comparison: Model Compression Approaches

Frequently Asked Questions

Does distillation always make a worse model than the teacher?

What is "dark knowledge" and why does it matter?

Can you distill from a model you do not have weights for, like GPT-4?

How is Phi-3 related to distillation?

When should I use distillation versus quantization?

References

Key Takeaways

Related Articles

Decision Trees: A Complete Guide with Hand-Worked Examples

Decision Trees: A Complete Guide with Hand-Worked Examples