I follow lots of Tumblrs that are basically archives of old images associated with a particular place or time. A typical one will share nothing but magazine ads from the 1950s, or nothing but 20th century tourist photos of Las Vegas. As pleasant as it is to see the specific images pop up in my feed from time to time, it also evokes the effort of the archivists involved, the work they must put in trying to access and sift through other archives, the kinds of serendipity they must feel in coming across something worth adding, the social interactions they must have when people send in their own examples and leads, and so on. An ethic of care comes across in every random picture of Fremont Street after dark, of signs advertising nightclub acts of the 1970s, of casinos imploding in the 1990s.
Obviously if those archivists decided to stop working to find more examples from history and instead use the material they had already collected to train generative models to produce an endless series of ersatz images in the same vein, I would not detect that same ethos. It would seem like those archivists had suddenly decided to destroy the concept they loved by turning it into a banal abstraction, an averaged-out set of visual indicators rather than a kaleidoscope of perceptions on an inexhaustible subject of inspiration.
It is easy to take for granted the value of data. It has come to seem self-evidently useful, as necessary and natural as water. It doesn’t even matter what has been measured and datafied; data in the abstract, as an idea, is taken to be a good thing, and of course there should be more of it, to enrich our knowledge of the world and to make anything that is “data-driven” work better. If data is being collected but not leveraged, why bother? Why have an archive of implosion images if not to simulate any implosion image imaginable?
But to accept that at face value would be to neglect the vast infrastructure involved not merely in collecting it and making it useful and tradable, but also establishing its reputation for objectivity. Measurement is an ideology; among its central tenets is that there is no such thing as datafication but just data itself, naturally given by the things in themselves. It presents itself as a form of representation that transcends representation: Data is no longer about the world but is instead taken to be the world itself, as though materiality were a matter not of atoms but of information. The image of an implosion is an implosion.
Likewise, this ideology would persuade us to ignore the market for data, which shapes what is measured and how, and have us believe it is more like a natural resource, a found material waiting for refinement rather than a structured informational good without any natural status at all. Implosions just happen.
Calls to measure everything and collect as much data as possible are offered as efficient strategies to better grasp the world as it is. But measurement is an act of power, not observation. Datafication always reifies an existing distribution of power that grants the measurers the ability to decide which aspects of the world count and which ones don’t. Having measurements taken as objective — having representations be treated as realities — requires power and recurrent processes of legitimation.
As tempting as it is to say that not everything can or should be measured, it is insufficient, as it suggests that there is one way to measure things, a true and neutral method when used on appropriate things. In reality, anything can be measured in any number of ways; given the right sociopolitical conditions any kind of data can be legitimized or can be made to appear spurious. Data is not the alternative or answer to the vagaries of representation; measurements are just another form of representation and are entangled with all the problems than any other kind of mediation brings about. But enough with the platitudes against positivism.
Nonetheless, they seem pertinent to a useful distinction Eryk Salvaggio makes in this Flickr-sponsored essay about generated images, between “archives” and “datasets.” If “an archive is a place where the context of an individual item is preserved,” datasets are archives “viewed through the abstraction of scale.” They are made up of relabeled “items” whose original contexts have been stripped away to make them more readily commensurable and amenable to aggregation.
Restating that in the terms outlined above, an archive recognizes the power relations intrinsic to measurement (and representation in general) whereas a dataset suppresses them (helping entrench the power relations that underwrite the data it assembles). An archive attempts to retain how and why representations were made, and a dataset disregards all that to allow representations to masquerade as universal facts. When representations become data, they reinforce the utility of the infrastructures (algorithmic decision-making systems, AI models, etc.) developed to exploit them. And that infrastructure in turn reinforces the power relations authorizing the data.
Archives demand and invite interpretation, prompting the elaboration of social processes for constructing and debating knowledge. Datasets are often meant to end interpretation and circumvent social sense-making processes. By eliminating context (the subjective dimension of representation), data can be used to manipulate a world conceived as entirely lacking subjects and containing only objects. Those doing the manipulating can conceive of themselves as escaping that world, of remaining subjects themselves, though their subjecthood is conditional on their being able to manipulate others as objects.
The output of generative models reflects and reinforces the importance of datasets at the expense of archives; its purpose is to pollute and eventually extinguish archives so that datasets can reign unquestioned. Generative models’ output blurs and obscures the power relations that underlie models’ training data while also asserting the unchallengeable authority of sheer scale. Taking that output at face value would reinforce the power of the tech companies that make the models, as their efforts to ram AI down people’s throats should make plain. In this assessment of Google’s “AI Overview” strategy, Max Read points out that Google is using generative models to replace the pseudo-marketplace-of-ideas offered by search results with something more overtly authoritarian. “There is not even the veneer of transparency or choice: The workings of the LLM are constitutively opaque and what you are given is not a scroll of posts or search results to pick from but a single-source, semi-regurgitated synthesis to be treated as mystically authoritative.”
The reduction of search results coming out of a range of different contexts and intentions to a single answer, a false unity premised on undisclosed priorities, exemplifies the implosion of archives into the repurposed rubble of datasets. If turning archives into datasets is how tech companies pursue domination, treating models’ outputs as archival and insisting on interpreting them as expressions of sociopolitical relations can possibly serve as a kind of resistance (one that is more inspiring than demanding a better “neoliberal software marketplace,” as Read puts it). Salvaggio suggests that the “movement from archive to dataset in the eyes of AI training isn’t permanent,” and that a pointed analysis of generative images can restore insight into sorts of power relations they are otherwise being deployed to mystify.
While this can be revealing about some of the social conditions and politics that have informed the construction of a dataset — Salvaggio highlights how omissions may become visible as a kind of ghostly trace; how series of generated images can be read negatively, revealing what has been systematically excluded, if not why — it seems like an oblique way of addressing the politics of Flickr aggregating the archives developed by its users into a dataset to be exploited by industry. “Most images shared online were never intended to become ‘data,’ and in many ways, this transformation into data is at odds with the real value at the heart of what these archives preserve,” Salvaggio writes. Perhaps that is what we should primarily see in all AI output, regardless of what is depicted. Every generated image reveals an archive turned into a dataset and the local decisions and negotiations about what should count being subsumed by data aggregators cutting their own deals.
This is excellent. I write about human reproduction and abortion in my newsletter, and the "public fetus" as an entity constructed by ultrasound data, out of context photos, and other data points is a figure I'm constantly returning to. There's a lot of great phrasing here about how that process works for any subject.
Yes, indeed. Excellent as always, Rob.
Though I find it notable that Tumblr now stands as a site of relative "authenticity" (a slippery word you're well-skeptical of, I know, as we should be). But it's ironic to me that what previously seemed like a vast self-service laundry for orphaned gif-socks is, in hindsight, a place of relative meaning and even, as you say, community. Amazing to think how quickly things devolved even further, after I wrote this for LARB: https://lareviewofbooks.org/article/tumblrst-tumbl-ever-tumbld-found-angel-history-trapped-flypaper-social-media/