Over the past two years, copyright owners have filed dozens of lawsuits against AI companies, arguing their work was scraped and fed into models without permission. As of late 2025, at least 63 copyright cases have been filed against AI developers in the U.S. alone, with more abroad.
Some of those lawsuits revolved around text. Increasingly, they revolve around image and video. The big takeaway for companies: scraped visual data is no longer a safe foundation for commercial products.
The licensed visual data bottleneck
Advanced vision models need three things at once: specific content, diversity, and legal clarity. Today, most datasets miss at least one.
Scraped web images are broad but messy and risky. Legacy stock archives are clean but often skewed toward Western, commercial, and studio settings. Bespoke shoots are accurate but slow and expensive.
Licensing deals are now the center of many high-profile partnerships. Getty Images’ multi-year agreement with Perplexity, for example, gives the startup access to Getty’s creative and editorial visuals for AI search, with attribution and compensation.
Scarcity of specific content
Developers can find plenty of generic lifestyle imagery. The trouble starts when they need niche or rare scenarios.
Think of:
- Industrial faults on specific machines
- Region-specific infrastructure and public services
- Cultural and religious settings that rarely appear in Western stock archives
- Edge cases in safety, accessibility, or disability contexts
When those scenes don’t exist at scale, models hallucinate or fail. Models trained on that develop a skewed view of the truth. They underperform when it comes to people and places that were barely present in the data, and they generate visuals that feel off, or outright offensive, to anyone outside the dominant frame.
Data quality and missing metadata
Even when teams have the rights, the files themselves often aren’t ready for training. Images arrive with incomplete tags, inconsistent categories, or no labels at all. Crucial context is missing, and this leaves engineers guessing or relabeling by hand.
How the industry is responding
Under pressure from both performance and regulation, the sector is converging on three main responses.
- Licensing platforms as data infrastructure
To replace scraped web images, AI teams are increasingly buying access to licensed archives. Large content companies now sell training-ready image and video packages with clear rights and metadata, instead of leaving customers to reverse-engineer consent after the fact.
Alongside those incumbents, newer platforms are built directly around AI training use cases. Wirestock aggregates creator content, handles licensing, and supplies visual datasets under explicit AI-training terms (learn more about wirestock here).
For creators, this work appears less as “upload and hope” stock and more as defined projects. Through AI freelance photography jobs, creators receive briefs and are paid for accepted sets that go into training.
Synthetic data to fill the gaps
Where real-world images are hard to collect, teams are turning to synthetic data. They use simulation tools, 3D pipelines, or generative models to produce task-specific visuals, then mix those with real, licensed content.
Synthetic datasets can cover edge cases and balance distributions, but they still depend on real imagery as a reference point. Without that anchor, models risk learning from a closed loop of their own outputs.
Regulation that demands transparency
Lawmakers are starting to demand visibility into training sources. California’s AB-2013, for example, will require many generative AI developers serving the state to disclose what kinds of data they used and where it came from.
Training data can no longer sit in an unnamed bucket; it has to be documented well enough that regulators, customers, and creators can see how it was assembled.
What this means for AI builders
Scraped, anonymous image folders are now a liability. They slow teams down, attract legal scrutiny, and make every new product conversation harder than it needs to be.
The safer pattern is to train on visual data you can explain. Someone on your team should be able to say, in one sentence, what a dataset contains, where it came from, and what the license allows. If that’s impossible, the model is sitting on borrowed time.
Make a short list of the models that matter for revenue or reputation, and document their main training sources. Treat anything scraped or undocumented as “under review,” then start replacing those sets with licensed or commissioned data.
FAQs
We’re not a big AI lab. Do we really need to worry about this now?
If you’re shipping AI features to customers, yes. Enterprise buyers, regulators, and partners are starting to ask where training data comes from, regardless of company size.
What’s a realistic first step to de-risk our visual data?
Start with a spreadsheet. List your key models, the datasets you used, and how those datasets were acquired: licensed archive, internal content, public scrape, or “not sure.” From there, pick one or two high-impact models and start seeking out licensed datasets for replacement.
Can synthetic data solve this on its own?
No. Synthetic images help with coverage and rare scenarios, but they still need real, licensed imagery as a reference. Without that anchor, models risk drifting into a closed loop of their own outputs and failing on real scenes.
Read More From Techbullion