The framework · 2,200 words

Six capabilities that separate AI image generators.

The dimensions that actually matter, and a self-test protocol for each. Score any generator on these axes and you can decide for your own use case without anyone ranking products for you.

How to use this page.

Each capability has four parts. What it is. Plain English plus the technical grounding. Why it matters. Which use cases lean on this axis. How to test it. A three-prompt protocol you can run on any generator. Research. Published benchmarks or papers where they exist. The combined result is a reusable scorecard. Pair it with the 15-question evaluation checklist for an end-to-end review.

Resolution.

Native output size and the upscaling pipeline.

Resolution is the simplest axis but the most often misunderstood. A generator's native resolution is the size of image its model was trained to produce directly. A generator's effective resolution after upscaling is something else entirely. Many generators bolt on a separate super-resolution step that interpolates a 1024×1024 native output up to 4096×4096, which is fine for some uses (web) and inadequate for others (print).

What to read on a vendor's docs page: the native model resolution, supported aspect ratios, the upscaling method (latent upscale, ESRGAN, separate model), and any megapixel ceiling on the API plan.

Why it matters

Print work, large-format displays, billboards, and high-DPI marketing assets fail at 1024px native even if upscaled. Concept artists working at native resolution care about prompt-to-pixel fidelity at that resolution.

Test protocol

Generate the same prompt at three resolutions: native, 2× upscale, 4× upscale.
Compare fine detail (skin texture, fabric weave, foliage) for over-smoothing or hallucination.
Try an aspect ratio off the trained defaults (e.g., 21:9 ultrawide). Does composition collapse?

What the research says

PartiPrompts and DrawBench include resolution-related stress prompts. Most current production diffusion generators have native resolutions in the 1024-2048 range; some recent SD3-class models go to 4096 native.

Prompt adherence.

Does the model produce what you described, or what it has seen most often?

Prompt adherence is the gap between what you wrote and what came out. Models trained on large datasets develop strong priors, generate "a businessman" and you get a man in a suit, even if you didn't specify either. The further your prompt strays from the training distribution's centre of mass, the more the model's priors take over.

Three benchmarks are commonly cited. PartiPrompts verified April 2026 covers 1,600 prompts across categories testing object counts, spatial relationships, world knowledge. DrawBench verified April 2026 from the Imagen team is similar in spirit. GenEval verified April 2026 from 2023 is more compositional, scoring object counts, colours, positions.

Why it matters

Marketing teams writing structured briefs care about adherence to brand colour, layout, and explicit attributes. Concept artists exploring ideas care less.

Test protocol

Compositional: "three red apples and two green pears arranged in a row, the green pears on the left". Score: object count, colour, position.
Negation: "a beach scene without any people". Diffusion struggles with negation; some generators handle it better than others.
Counterfactual: "a giraffe wearing a tuxedo eating spaghetti". Score: do all three concepts appear coherently?

What the research says

GenEval scores at the model level have risen from <30% accuracy on early SD1.5 to >70% on top 2024-2026 models. Vendor-published results should be read with the usual scepticism but absolute numbers are useful.

Text rendering.

Readable text in the image. Architecturally hard until recent generations.

Text rendering was the canonical embarrassment of early diffusion models. The model knew "a sign that says HELLO" meant something in the image should resemble letters but produced shapes that looked like an alien alphabet. The technical reason was that early models used CLIP ViT-L/14 as the text encoder, which compresses text aggressively and loses character-level information.

The fix has been bigger text encoders (T5-XXL with 4.6B parameters in Imagen and SD3) and character-aware training data (rendering text as image-level features the model can see). Imagen and Ideogram both publish their text-rendering improvements as a marketing point; SD3 made T5 conditioning a notable shift.

Why it matters

Marketing assets with logos, posters, signage, packaging mockups, social-card text, infographic labels. If your use case has any text-on-image requirement this is a top-three axis.

Test protocol

Short word: "a coffee cup with the word ESPRESSO printed on it".
Phrase: "a vintage poster reading 'GRAND OPENING SATURDAY' in art deco lettering".
Mixed case and numbers: "a storefront sign reading 'Open from 9:00 to 18:30'".

What the research says

No standardised benchmark dominates. Anecdotal community testing on Reddit and HackerNews tracks state of the art month-to-month. Vendor docs describing the text encoder (T5-XXL vs CLIP-L) is a useful proxy.

Style control.

Presets, LoRAs, reference images, style transfer APIs.

Style control covers the ways a generator lets you pin output to a specific aesthetic. Five mechanisms are common: prompt-engineered style tokens ("in the style of art nouveau"), preset style packs (vendor-curated and selectable from a UI), reference images (you upload an example and the generator matches the style), LoRAs and other lightweight fine-tunes (small adapters trained on a specific style), and full fine-tuning (training a custom variant of the model).

Open-weight models support LoRAs natively because the architecture is exposed. Closed-weight models offer reference images and presets as the available levers. If your team needs a consistent visual identity across hundreds of images, the depth of style-control mechanisms matters more than the model's out-of-box quality.

Why it matters

Brand teams running a campaign. Concept artists with a defined aesthetic. Anyone producing volume content with consistent look-and-feel.

Test protocol

Style transfer: same prompt, three different style references. Does the generator faithfully apply each?
Style preservation: same style reference, three different subject prompts. Does the style stay constant or get overwritten by subject?
Style stacking: layer multiple style modifiers ("art nouveau" + "sepia tones" + "detailed line work"). Does it cohere or muddle?

What the research says

LoRA architecture is described in Hu et al., "LoRA: Low-Rank Adaptation" (2021). The technique has become standard for community-trained style adapters on open-weight models.

Photorealism.

Lighting, materials, depth, photographic-style fine-tuning.

Photorealism in 2026 is essentially solved at the close-up level, generated portraits, products, food, isolated objects look photographic. The harder problems are scene-scale: depth-of-field consistency, accurate shadows from a specified light source, reflective surfaces, and material physics under unusual lighting (a translucent object backlit, for example). These remain inconsistent across all current generators.

The architectural levers are training data composition (heavy photographic content vs broader artistic content), photographic fine-tuning, and prompt vocabularies for photographic concepts (camera, lens, film stock, lighting setup terminology). Generators positioned for photorealistic output typically train on filtered photographic subsets and surface camera-style prompt parameters in their UI.

Why it matters

E-commerce product photography, lifestyle marketing imagery, real-estate visualisation, automotive marketing.

Test protocol

Material rendering: "a glass of water on a marble countertop, late afternoon sunlight from the left, shallow depth of field". Look at glass refraction, water meniscus, marble veining.
Skin and texture: "close-up portrait, natural light, no make-up, freckles, 50mm lens". Look at pore detail and the uncanny-valley at the edges.
Scene-scale depth: "a street view in Lisbon, three layers of depth, golden hour". Check shadow direction consistency across layers.

What the research says

No single accepted photorealism benchmark dominates. The Geneval and PartiPrompts subsets cover photographic prompts. Community side-by-side comparisons on Reddit's /r/StableDiffusion and similar venues track evolving state of the art.

Subject consistency.

Same character or product across multiple images. Architecturally difficult.

Subject consistency means generating the same character (face, body, costume) or the same product (geometry, materials, colour) across multiple images so the result reads as a coherent series. Diffusion models are not natively consistent, small prompt changes shift the output. The mechanisms developed to enforce consistency are reference images (IP-Adapter, ControlNet variants), seed locking with prompt scaffolding, character-trained LoRAs, and recent "character sheet" modes.

For commercial concept-art workflows this is often the binding constraint. A studio commissioning fifty illustrations of the same protagonist needs a workflow, not just a model.

Why it matters

Concept art for film, games, comics. Brand mascots and product hero shots reused across assets. Storyboarding and narrative illustration.

Test protocol

Character: train a LoRA or use a reference-image feature on a single defined character; generate ten variations across different poses and lighting; count the breaks in geometry, costume, hair detail.
Product: same approach with a product hero image; check material colour drift, logo distortion, geometry flips.
Scene continuity: generate three views of the same scene from different angles; check whether elements stay in their right positions.

What the research says

IP-Adapter (2023) is one of the more cited reference-image conditioning techniques. ControlNet variants for pose and depth conditioning extend the toolkit.

Score, don't rank.

Run the test protocols on two or three generators that fit your access model and budget. Score each axis on a 1-5 scale. Weight the axes by what your use case actually demands. The result is a personal recommendation built from observable evidence, not someone else's subjective ranking. Then read /licensing to confirm the commercial-use side, and /training-data to check provenance.

Apply →
15-question checklist Foundation →
Architectures explained By role →
Use cases