A recent piece by John Herrman begins with a familiar, reassuring claim, that tech companies may run out of training data for their AI models. He cites a research paper co-authored by 49 researchers (a ratio of more than one author per page!) called “Consent in Crisis: The Rapid Decline of the AI Data Commons.” As that title suggests, the paper takes off from the questionable proposition that something like a “data commons” can exist in the same way a village may share common lands — as though the data were simply there and what is important is an equitable way of maintaining it for use. “The web has become the primary communal source of data, or ‘data commons’, for general-purpose and multi-modal AI systems.”
What community are they talking about? And what aspect of the “web” do they mean? The text, the interactions, the metadata, the behavioral data, the location data, or what? Since when is the “web” owned in common? It verges on tautology, but if a piece of information is valuable, someone will assert property rights over it, and if property rights are invoked, you can assume the information involved can be exploited. Once machine learning models make seemingly valueless data purportedly useful again, property rights begin to be claimed again. There will likewise be an incentive for more data to be made, with or without the consent of the objects of that data.
“Data” is made, not found and depleted like a natural resource.
Keep reading with a 7-day free trial
Subscribe to Internal exile to keep reading this post and get 7 days of free access to the full post archives.