A new tool, Data Provenance Explorer, lets users pick through the questionable provenance of many large data sets used for AI training. Credit: Shutterstock A new online tool allows users to identify, track and learn about the legal status of training data sets for generative AI, and a quick glance shows that many may have licensing issues. The tool, dubbed the Data Provenance Explorer, is the result of a joint effort between machine learning and legal experts from MIT, generative AI API provider Cohere, and 11 other organizations — Harvard Law School, Carnegie Mellon University and Apple are all among the contributors. The Data Provenance Explorer lets researchers, journalists and anyone else search through thousands of AI training databases and trace the “lineage” of widely used data sets. The idea is to provide a way to explore the sometimes murky world of training data used to develop generative AI. In an official statement announcing the Data Provenance Explorer, the team behind it described a “data transparency crisis” that could complicate the development and commercial use of generative AI systems. Crowdsourced data sets lack licenses “Crowdsourced aggregators like GitHub, Papers with Code, and many of the open source LLMs [large language models] trained from data on these aggregators, have an extremely high proportion of missing data licenses … ranging from 72% to 83%,” the group said. “In addition, the licenses that are assigned by crowdsourced aggregators frequently allow broader use than the original intent expressed by the authors of a data set.” The need for responsibly developed AI is something that the industry appears to be well aware of, according to Kathy Lange, a research director for IDC. The headlong rush to deploy generative AI has created a public focus on the safe and legal use of data, she said. “Understanding the provenance of the data; how it was collected, processed, and transformed can impact the trust in AI model results,” Lange said. “AI vendors prioritizing data provenance will have a leg-up in the market for customers requiring transparency, accountability, and compliance initiatives.” AI data has become nothing less than a battleground, in certain respects. Lange highlighted the recent introduction of the Nightshade tool, which subtly changes digital art in such a way as to confuse AI creators attempting to use copyrighted works for training data. Moreover, authors and other copyright holders have begun to take legal action against the use of their works in generative AI training – comedian and author Sarah Silverman is among those suing OpenAI for this reason. However, the legal landscape for those claims remains murky in many respects. Related content news analysis Apple earnings: About that iPhone 'slump' in China Based on information from Thursday's earnings report, it seems that data pointing to an iPhone slump in China were over-baked. By Jonny Evans May 03, 2024 9 mins iMac iPhone Apple news Microsoft begins to phase out ‘classic’ Teams Microsoft is encouraging Teams customers to move to the new, faster version of the collaboration app; the older version will be switched off next year. By Matthew Finnegan May 03, 2024 3 mins Microsoft Teams Collaboration Software Productivity Software news analysis Apple confirms it will open up the iPad in Europe this fall The latest efforts to comply with Europe’s Digital Markets Act mean developers can offer to side load apps to both iPhones and iPads in the EU. Apple has also taken steps to improve what it offers to smaller and non-commercial developers in the By Jonny Evans May 02, 2024 6 mins iPad Apple Mobile Apps news Udacity offers laid-off US workers free access to its courses for 30 days Sign-ups will be available over the next 30 days By Lucas Mearian May 02, 2024 4 mins Technology Industry IT Jobs IT Skills Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe