Provenance · 2,000 words

Training data and provenance.

The question no listicle asks. The answer that separates commercially-safe generators from the rest.

What training data is, for a text-to-image model.

A text-to-image model learns by being shown billions of pairs: an image, and the caption or alt-text that describes it. Across the dataset the model learns associations between linguistic concepts and visual features. The aesthetic range, stylistic coverage, ability to render text, and cultural breadth of the model trace directly back to what was in the training corpus.

Datasets are sourced in four broad ways: web crawls (scraping publicly accessible images and their HTML alt-text), licensed catalogues (commercial stock libraries, photo agencies), proprietary first-party collections (a vendor's own image archive), and public-domain archives (museum collections, government works, expired-copyright corpora). Most production models combine sources.

The composition of the dataset is the single most important determinant of what the model can do, and it is the question with the most legal and commercial weight. A generator that trained on web-scraped art carries different liability tail risk from one that trained on licensed stock. Both can produce excellent images. They sit in different commercial postures.

Public datasets.

LAION is the most discussed open dataset. The non-profit Large-scale Artificial Intelligence Open Network publishes corpora compiled from Common Crawl. LAION-5B verified April 2026 contains roughly 5.85 billion image-text pairs filtered for image-text similarity by CLIP. LAION-400M verified April 2026 is an earlier 400 million-pair dataset.

LAION's corpora are not images; they are URLs and metadata. Researchers download images at training time. The dataset is therefore a directory, but the images it points to remain on the original hosts under their original copyright. Training a model on LAION involves accessing those images. Whether that constitutes infringement depends on jurisdiction, the training methodology, and the applicable text-and-data-mining exceptions.

Stable Diffusion, the original Imagen, and many open-weight community models trained on LAION subsets. The legal challenges to web-scraped training (Andersen v Stability, Getty v Stability) are partly tests of whether LAION-style training is lawful in the relevant jurisdictions.

Other public datasets include Conceptual Captions (Google's release), YFCC100M (Yahoo's Flickr Creative Commons dataset), and COCO (Common Objects in Context, MIT-licensed). These are smaller and used more for evaluation than for primary training.

Licensed-data models.

Several vendors position around training exclusively on data they have rights to.

Adobe Firefly. Adobe states that Firefly is trained on Adobe Stock, openly licensed content, and public-domain works. The position is part of Adobe's "commercially safe" positioning and is reinforced by the indemnification offered on enterprise plans. The Adobe gen-AI user guidelines verified April 2026 and Firefly FAQ verified April 2026 are the canonical sources.

Getty Generative AI. Getty Images launched its own generator trained on its licensed library. Outputs are commercially licensed and indemnified within the contract terms. The Getty Generative AI page verified April 2026 describes the service.

Shutterstock + OpenAI. Shutterstock licensed its catalogue to OpenAI for training and offers a generator embedded in its platform. Shutterstock's AI generator terms verified April 2026 describe the licence.

Vendors that disclose training-data sourcing publicly are signalling confidence in their position. Vendors that do not disclose are not necessarily exposed; they may have negotiated licences or trained on web-scraped data without published disclosure. The disclosure itself is the buyer-relevant signal.

Opt-out mechanisms.

Rights holders who wish to exclude their work from AI training have several mechanisms, each with varying levels of vendor compliance.

Spawning and HaveIBeenTrained. The HaveIBeenTrained verified April 2026 service lets creators search whether their work appears in major training datasets and submit opt-out registrations. Spawning publishes an API verified April 2026 that vendors can query at training time to honour opt-outs. Stability AI and others have publicly committed to honouring Spawning opt-outs.

robots.txt and crawler signals. The robots.txt protocol has been extended with crawler-specific user-agent strings. Common AI crawlers include Google-Extended (Google's AI training), GPTBot (OpenAI), CCBot (Common Crawl, used by many), and ClaudeBot (Anthropic). Disallowing these in robots.txt is the most basic opt-out signal. It depends on the crawler honouring the directive, which is not legally required.

NoAI metadata tags. DeviantArt introduced an HTML meta tag (noai and noimageai) that signals do-not-train. C2PA verified April 2026 (Coalition for Content Provenance and Authenticity) is developing more comprehensive provenance metadata that travels with the image and signals usage permissions.

Platform-side opt-outs. Several vendors offer opt-out forms for their training. Stability AI's opt-out verified April 2026, OpenAI's data controls verified April 2026, and Meta's data-objection forms have published mechanisms; the operational reality varies.

EU TDM opt-out. Under Article 4 of the Copyright Directive, rights holders can reserve their rights against text-and-data mining in machine-readable form. The reservation must be explicit and machine-readable; robots.txt and dedicated metadata are the typical mechanisms. The EU AI Act reinforces the obligation on AI providers to respect the reservation.

How to evaluate a generator's training-data disclosure.

Five things to look for on a vendor's docs or legal pages:

Explicit sourcing statement. "Trained on Adobe Stock, public-domain works, and openly licensed content" is a precise statement. "Trained on a carefully curated dataset" is not.
Opt-out mechanism. Does the vendor publish a way for rights-holders to opt out? Has it committed publicly to honouring Spawning, robots.txt, or the EU TDM opt-out?
Training-data summary under EU AI Act. EU-applicable models are required to publish a sufficiently detailed summary of training content. Some non-EU vendors publish the summary anyway.
Indemnification scope tied to data. If the vendor offers indemnification, does it tie the scope to the training-data sourcing? "We indemnify because our training data is licensed" is a stronger position than "we indemnify subject to standard exceptions".
Output watermarking and provenance. Does the vendor embed C2PA-style provenance metadata in outputs? This is unrelated to training data but is a parallel signal of the vendor's overall posture on provenance.

A generator that scores well on these five points is publishing the information you need to make an informed commercial-use decision. A generator that does not is not necessarily exposed, but you bear the burden of asking the questions yourself before relying on the model for high-stakes work.

Apply →
Licensing stack Background →
Cases and rulings Apply →
15-question checklist