Home Artificial Intelligence Generative AI training data sets are now trackable – and often legally complicated

by Jon Gold

Senior Writer

Generative AI training data sets are now trackable – and often legally complicated

news

Oct 26, 20233 mins

Enterprise ApplicationsGenerative AILegal

A new tool, Data Provenance Explorer, lets users pick through the questionable provenance of many large data sets used for AI training.

ai artificial intelligence law copyright legal

Credit: Shutterstock

A new online tool allows users to identify, track and learn about the legal status of training data sets for generative AI, and a quick glance shows that many may have licensing issues.

The tool, dubbed the Data Provenance Explorer, is the result of a joint effort between machine learning and legal experts from MIT, generative AI API provider Cohere, and 11 other organizations — Harvard Law School, Carnegie Mellon University and Apple are all among the contributors. The Data Provenance Explorer lets researchers, journalists and anyone else search through thousands of AI training databases and trace the “lineage” of widely used data sets.

The idea is to provide a way to explore the sometimes murky world of training data used to develop generative AI. In an official statement announcing the Data Provenance Explorer, the team behind it described a “data transparency crisis” that could complicate the development and commercial use of generative AI systems.

Crowdsourced data sets lack licenses

“Crowdsourced aggregators like GitHub, Papers with Code, and many of the open source LLMs [large language models] trained from data on these aggregators, have an extremely high proportion of missing data licenses … ranging from 72% to 83%,” the group said. “In addition, the licenses that are assigned by crowdsourced aggregators frequently allow broader use than the original intent expressed by the authors of a data set.”

The need for responsibly developed AI is something that the industry appears to be well aware of, according to Kathy Lange, a research director for IDC. The headlong rush to deploy generative AI has created a public focus on the safe and legal use of data, she said.

“Understanding the provenance of the data; how it was collected, processed, and transformed can impact the trust in AI model results,” Lange said. “AI vendors prioritizing data provenance will have a leg-up in the market for customers requiring transparency, accountability, and compliance initiatives.”

AI data has become nothing less than a battleground, in certain respects. Lange highlighted the recent introduction of the Nightshade tool, which subtly changes digital art in such a way as to confuse AI creators attempting to use copyrighted works for training data. Moreover, authors and other copyright holders have begun to take legal action against the use of their works in generative AI training – comedian and author Sarah Silverman is among those suing OpenAI for this reason. However, the legal landscape for those claims remains murky in many respects.

by Jon Gold

Senior Writer

Jon Gold covers IoT and wireless networking for Network World. He can be reached at jon_gold@ifoundrycodg.com.

Americas

Asia

Europe

Oceania

Topics

About

Policies

Our Network

More

Generative AI training data sets are now trackable – and often legally complicated

A new tool, Data Provenance Explorer, lets users pick through the questionable provenance of many large data sets used for AI training.

Crowdsourced data sets lack licenses

More from this author

Report: Microsoft-OpenAI ownership might get conditional OK from EU regulators

Anthropic’s latest version of Claude comes to Amazon Bedrock

AMD’s new mobile and desktop chips push hard into AI

Report: Scale cuts off subsidiary’s remote workers in several countries

Most popular authors

Show me more

Why you’ll soon have a digital clone of your own

Workers with these AI skills are getting cash premiums

Atlassian Rovo brings AI smarts to enterprise search

Why tech workers are struggling to find jobs

TikTok ban in place, but how long before it's gone?

After 10 years of progress, does mixed reality (XR) have a future? | Ep. 147

Why tech workers are struggling to find jobs

TikTok ban in place, but how long before it's gone?

After 10 years of progress, does mixed reality (XR) have a future?

Generative AI training data sets are now trackable – and often legally complicated

A new tool, Data Provenance Explorer, lets users pick through the questionable provenance of many large data sets used for AI training.

Crowdsourced data sets lack licenses

Related content

Apple earnings: About that iPhone 'slump' in China

Microsoft begins to phase out ‘classic’ Teams

Apple confirms it will open up the iPad in Europe this fall

Udacity offers laid-off US workers free access to its courses for 30 days

From our editors straight to your inbox

More from this author

Report: Microsoft-OpenAI ownership might get conditional OK from EU regulators

Anthropic’s latest version of Claude comes to Amazon Bedrock

AMD’s new mobile and desktop chips push hard into AI

Report: Scale cuts off subsidiary’s remote workers in several countries

Most popular authors

Show me more

Why you’ll soon have a digital clone of your own

Workers with these AI skills are getting cash premiums

Atlassian Rovo brings AI smarts to enterprise search

Why tech workers are struggling to find jobs

TikTok ban in place, but how long before it's gone?

After 10 years of progress, does mixed reality (XR) have a future? | Ep. 147

Why tech workers are struggling to find jobs

TikTok ban in place, but how long before it's gone?

After 10 years of progress, does mixed reality (XR) have a future?