Foundations · 2,400 words

How AI Image Generators Actually Work.

The three dominant architectures, explained from first principles. Citations to the original papers, not paraphrased Medium posts.

The clearest analogy for the dominant architecture is sculpting an image out of noise. Imagine a block of static. The model has been trained to take that static and, in repeated small steps, remove the parts that don't belong, until what is left is the image you described. That is diffusion. Behind the analogy sits a precise mathematical procedure published in 2020. Two other architectures preceded it (GANs) and run alongside it (autoregressive token models). Understanding all three lets you read a vendor's docs page and predict, before you ever generate an image, what the tool will be good and bad at.

01 · Architecture one

Diffusion models.

The current dominant architecture. Stable Diffusion, Imagen, Flux, DALL-E 3 are all diffusion-based.

Training a diffusion model requires a dataset of image-caption pairs and a fixed schedule for adding Gaussian noise. The image at step t=0 is the original. At each step a small amount of noise is added; by step t=T (often 1000 in the original formulation) the image is indistinguishable from pure noise. The model is a neural network trained to predict, given a noised image and the timestep, what noise was added, equivalently, to predict the denoised image at the previous timestep.

At generation time the process runs in reverse. Start with pure Gaussian noise. The trained model predicts the noise to remove for one step. Subtract it. Repeat. By the end of the schedule you have an image. To condition on a text prompt, the prompt is encoded by a separate text model (CLIP, T5, or a similar encoder) and the resulting embedding is fed into the denoiser at each step, so the denoising trajectory bends toward images consistent with the prompt.

The original formulation is in Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (2020) verified April 2026. Score-based generative modelling, a closely related framework, is in Song and Ermon's earlier work and is mathematically equivalent under reasonable assumptions.

A key efficiency advance is latent diffusion. Rombach et al. observed that running diffusion in pixel space is wasteful, most of the computation happens at high resolutions where image structure is dominated by texture rather than content. They introduced an autoencoder that compresses images into a much smaller latent space (typically 64×64 latent for a 512×512 image), runs the diffusion process there, and decodes back to pixels. This is the architecture behind Stable Diffusion. The paper is Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (2022) verified April 2026.

A further architectural shift is to replace the conventional UNet denoiser with a transformer. Peebles and Xie's "Scalable Diffusion Models with Transformers" (DiT, 2022) verified April 2026 showed that transformer denoisers scale better with model size. Several 2024-2026 production generators are diffusion transformers.

02 · Architecture two

Generative Adversarial Networks.

The previous generation. Largely supplanted by diffusion for general text-to-image, still useful in narrow tasks.

A GAN is two networks trained against each other. The generator produces images. The discriminator is trained to tell real images from generated ones. Each network's loss depends on the other: the generator improves when it fools the discriminator; the discriminator improves when it correctly identifies fakes. At equilibrium the generator produces images indistinguishable from the training distribution. The architecture was introduced in Goodfellow et al., "Generative Adversarial Nets" (2014) verified April 2026.

GANs were the dominant image-generation architecture from roughly 2014 to 2020. StyleGAN and BigGAN produced impressive results in narrow domains (faces, animals, specific style categories). Two practical problems pushed the field toward diffusion: training instability, GANs collapse, oscillate, or fail to converge in ways that are difficult to diagnose, and mode collapse, where the generator finds a small subset of outputs that consistently fool the discriminator and stops exploring the rest of the distribution.

GANs remain in use for tasks where the training-instability problem is manageable: face generation, super-resolution, image-to-image translation, certain styles of artistic transfer. They are not the architecture you find behind a 2026 text-to-image generator marketed for general use.

03 · Architecture three

Autoregressive token-based models.

Treats an image as a sequence of tokens, like a language model treats text. Used by Parti, early DALL-E, and several recent multimodal models.

The autoregressive approach treats image generation as a language problem. First, an encoder is trained to map image patches to a fixed vocabulary of discrete tokens, a process called vector quantisation. Each image becomes a sequence of tokens drawn from this vocabulary. Then a transformer is trained to predict the next token given the previous tokens and a conditioning signal (the text prompt). At generation time the transformer produces a token sequence, which the decoder converts back to pixels.

The technique was systematised in Esser, Rombach, Ommer, "Taming Transformers" (2020) verified April 2026. Google's Parti scaled the approach: Yu et al., "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation" (2022) verified April 2026. OpenAI's original DALL-E (2021) used an earlier variant.

Autoregressive models historically had an edge on text rendering and structured layout, because the sequential left-to-right generation matches how text is read. Diffusion models have closed most of that gap with larger text encoders. The current frontier is hybrid: some recent multimodal models use autoregressive heads for high-level layout and diffusion for refinement, or use unified token vocabularies that span text and image.

Architecture comparison at a glance.

Aspect	Diffusion	GAN	Autoregressive
Training stability	High	Low (mode collapse)	Medium
Inference speed	Slow (many steps)	Fast (single pass)	Slow (token-by-token)
Sample diversity	Excellent	Limited	Excellent
Text rendering	Improving rapidly	Weak	Historically strong
Open-weight options	Many (SD, Flux)	Some (StyleGAN)	Few
Local-run feasibility	High (with right GPU)	High	Low (large models)

Why this matters for choosing a generator.

Open-weight vs closed-weight. Diffusion models are well-represented in both camps. Stable Diffusion variants and Black Forest Labs' Flux are open-weight and runnable locally. DALL-E, Imagen, Midjourney, and Adobe Firefly are closed proprietary. If you need to fine-tune, run on-prem for compliance, or self-host for cost reasons, the architecture choice constrains you.

Inference speed. Diffusion is iterative, a typical generation runs 20-50 denoising steps. Faster scheduling techniques (DPM-Solver, Euler) and distillation (consistency models, latent consistency models) have brought generation down to 1-4 steps for some models, but most production usage still trades latency for quality.

Text in images. Diffusion historically struggled with rendering legible text. The fix has been larger text encoders (T5-XXL, large CLIP variants) and character-aware fine-tuning, not architectural change. Generators that emphasise text rendering (Imagen 3, Ideogram, recent SD3 variants) typically discuss their text-encoder size in the docs. See /capabilities.

Reproducibility. Diffusion outputs are deterministic given a seed and the same model weights. This matters for design workflows where you want to iterate on a prompt while keeping the underlying composition fixed. GAN outputs are also seed-deterministic. Autoregressive outputs are deterministic with greedy or fixed-temperature sampling, less so with stochastic sampling.

The architecture choice is one input. The capability framework on /capabilities turns these architectural facts into a buying decision.

Frequently asked questions.

What is a diffusion model in plain English?

A diffusion model learns to undo noise. During training the model is shown an image, noise is added in steps, and the model learns to predict the noise so it can be removed. At generation time the model starts from pure noise and runs the trained denoising process backwards conditioned on your prompt, producing an image at the end. The original formulation is in Ho, Jain, and Abbeel's 2020 paper.

What is the difference between diffusion and latent diffusion?

Plain diffusion runs the denoising process in pixel space, which is computationally expensive at high resolutions. Latent diffusion, introduced by Rombach et al. in 2022, uses an autoencoder to compress images into a smaller latent space, runs diffusion there, then decodes back to pixels. This is the architecture behind Stable Diffusion and most production generators today.

Are GANs still used for image generation?

GANs are largely out of favour for general text-to-image generation because they were difficult to train at scale and suffered from mode collapse, where the model produces a narrow range of outputs. They remain in use for specific tasks like face generation, style transfer, and super-resolution where the constraints suit the architecture. Goodfellow et al. introduced the architecture in 2014.

What is an autoregressive image model?

An autoregressive model treats an image as a sequence of tokens, similar to how a language model treats text as a sequence of word tokens. The model predicts each image token one at a time conditioned on the previous tokens and the text prompt. Parti from Google and earlier versions of OpenAI's DALL-E used this approach. Esser, Rombach, and Ommer's 2020 work on tokenising images underpins the technique.

What is a diffusion transformer?

A diffusion transformer replaces the conventional UNet backbone of a diffusion model with a transformer architecture. Peebles and Xie's 2022 paper showed that this scales well with model size. Several recent generators use diffusion transformers, including some Stable Diffusion 3 variants and OpenAI's Sora-class video models.

Why does the architecture matter for choosing a generator?

Architecture predicts what the model can do well. Diffusion models tend to handle photorealism and stylistic variety; autoregressive token models historically handled text-in-image and structured layouts better; open-weight architectures like Stable Diffusion can be fine-tuned and run locally; closed proprietary models cannot. Reading the architecture choice on a vendor's docs page tells you a lot about the workflow constraints.

Next →
The capability framework Related →
Training data and provenance Apply →
The decision framework