How AI Image Generators Actually Work.
The three dominant architectures, explained from first principles. Citations to the original papers, not paraphrased Medium posts.
The clearest analogy for the dominant architecture is sculpting an image out of noise. Imagine a block of static. The model has been trained to take that static and, in repeated small steps, remove the parts that don't belong, until what is left is the image you described. That is diffusion. Behind the analogy sits a precise mathematical procedure published in 2020. Two other architectures preceded it (GANs) and run alongside it (autoregressive token models). Understanding all three lets you read a vendor's docs page and predict, before you ever generate an image, what the tool will be good and bad at.
Diffusion models.
The current dominant architecture. Stable Diffusion, Imagen, Flux, DALL-E 3 are all diffusion-based.
Training a diffusion model requires a dataset of image-caption pairs and a fixed schedule for adding Gaussian noise. The image at step t=0 is the original. At each step a small amount of noise is added; by step t=T (often 1000 in the original formulation) the image is indistinguishable from pure noise. The model is a neural network trained to predict, given a noised image and the timestep, what noise was added, equivalently, to predict the denoised image at the previous timestep.
At generation time the process runs in reverse. Start with pure Gaussian noise. The trained model predicts the noise to remove for one step. Subtract it. Repeat. By the end of the schedule you have an image. To condition on a text prompt, the prompt is encoded by a separate text model (CLIP, T5, or a similar encoder) and the resulting embedding is fed into the denoiser at each step, so the denoising trajectory bends toward images consistent with the prompt.
The original formulation is in Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (2020) verified April 2026. Score-based generative modelling, a closely related framework, is in Song and Ermon's earlier work and is mathematically equivalent under reasonable assumptions.
A key efficiency advance is latent diffusion. Rombach et al. observed that running diffusion in pixel space is wasteful, most of the computation happens at high resolutions where image structure is dominated by texture rather than content. They introduced an autoencoder that compresses images into a much smaller latent space (typically 64×64 latent for a 512×512 image), runs the diffusion process there, and decodes back to pixels. This is the architecture behind Stable Diffusion. The paper is Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (2022) verified April 2026.
A further architectural shift is to replace the conventional UNet denoiser with a transformer. Peebles and Xie's "Scalable Diffusion Models with Transformers" (DiT, 2022) verified April 2026 showed that transformer denoisers scale better with model size. Several 2024-2026 production generators are diffusion transformers.
Generative Adversarial Networks.
The previous generation. Largely supplanted by diffusion for general text-to-image, still useful in narrow tasks.
A GAN is two networks trained against each other. The generator produces images. The discriminator is trained to tell real images from generated ones. Each network's loss depends on the other: the generator improves when it fools the discriminator; the discriminator improves when it correctly identifies fakes. At equilibrium the generator produces images indistinguishable from the training distribution. The architecture was introduced in Goodfellow et al., "Generative Adversarial Nets" (2014) verified April 2026.
GANs were the dominant image-generation architecture from roughly 2014 to 2020. StyleGAN and BigGAN produced impressive results in narrow domains (faces, animals, specific style categories). Two practical problems pushed the field toward diffusion: training instability, GANs collapse, oscillate, or fail to converge in ways that are difficult to diagnose, and mode collapse, where the generator finds a small subset of outputs that consistently fool the discriminator and stops exploring the rest of the distribution.
GANs remain in use for tasks where the training-instability problem is manageable: face generation, super-resolution, image-to-image translation, certain styles of artistic transfer. They are not the architecture you find behind a 2026 text-to-image generator marketed for general use.
Autoregressive token-based models.
Treats an image as a sequence of tokens, like a language model treats text. Used by Parti, early DALL-E, and several recent multimodal models.
The autoregressive approach treats image generation as a language problem. First, an encoder is trained to map image patches to a fixed vocabulary of discrete tokens, a process called vector quantisation. Each image becomes a sequence of tokens drawn from this vocabulary. Then a transformer is trained to predict the next token given the previous tokens and a conditioning signal (the text prompt). At generation time the transformer produces a token sequence, which the decoder converts back to pixels.
The technique was systematised in Esser, Rombach, Ommer, "Taming Transformers" (2020) verified April 2026. Google's Parti scaled the approach: Yu et al., "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation" (2022) verified April 2026. OpenAI's original DALL-E (2021) used an earlier variant.
Autoregressive models historically had an edge on text rendering and structured layout, because the sequential left-to-right generation matches how text is read. Diffusion models have closed most of that gap with larger text encoders. The current frontier is hybrid: some recent multimodal models use autoregressive heads for high-level layout and diffusion for refinement, or use unified token vocabularies that span text and image.
Architecture comparison at a glance.
| Aspect | Diffusion | GAN | Autoregressive |
|---|---|---|---|
| Training stability | High | Low (mode collapse) | Medium |
| Inference speed | Slow (many steps) | Fast (single pass) | Slow (token-by-token) |
| Sample diversity | Excellent | Limited | Excellent |
| Text rendering | Improving rapidly | Weak | Historically strong |
| Open-weight options | Many (SD, Flux) | Some (StyleGAN) | Few |
| Local-run feasibility | High (with right GPU) | High | Low (large models) |
Why this matters for choosing a generator.
Open-weight vs closed-weight. Diffusion models are well-represented in both camps. Stable Diffusion variants and Black Forest Labs' Flux are open-weight and runnable locally. DALL-E, Imagen, Midjourney, and Adobe Firefly are closed proprietary. If you need to fine-tune, run on-prem for compliance, or self-host for cost reasons, the architecture choice constrains you.
Inference speed. Diffusion is iterative, a typical generation runs 20-50 denoising steps. Faster scheduling techniques (DPM-Solver, Euler) and distillation (consistency models, latent consistency models) have brought generation down to 1-4 steps for some models, but most production usage still trades latency for quality.
Text in images. Diffusion historically struggled with rendering legible text. The fix has been larger text encoders (T5-XXL, large CLIP variants) and character-aware fine-tuning, not architectural change. Generators that emphasise text rendering (Imagen 3, Ideogram, recent SD3 variants) typically discuss their text-encoder size in the docs. See /capabilities.
Reproducibility. Diffusion outputs are deterministic given a seed and the same model weights. This matters for design workflows where you want to iterate on a prompt while keeping the underlying composition fixed. GAN outputs are also seed-deterministic. Autoregressive outputs are deterministic with greedy or fixed-temperature sampling, less so with stochastic sampling.
The architecture choice is one input. The capability framework on /capabilities turns these architectural facts into a buying decision.