Images are just videos with a single frame
Last week, OpenAI posted this “technical report” (a.k.a an advertisement) for its video generation model in development, named Sora, complete with cherrypicked examples of its capabilities and bombastic claims of how it will someday overcome its intrinsic limitations and become a “world simulator” rather than a more capable mimic of its training data. At this stage in the hype cycle, we have been invited to ooh and aah about the model without being able to use it ourselves, and as Brian Merchant details here, many have leaped at the chance to be part of OpenAI’s buzz-marketing squad.
As with all the other generative models, it is possible to imagine artists using Sora to help them make compelling work, but it is much easier imagining corporations and assorted unaffiliated grifters using it to flood the existing spaces for video with material that no one wants to put any effort into making, let alone watch. If generative models make video creation easier, it will mean we see more unwanted video in more and more places, as well as the further development of techniques to use video to confuse and con people. In automating some aspects of video production, it will mean the elimination of some jobs in the field altogether and the immiseration of those jobs that remain, as those workers will have to make up for the lost ingenuity and camaraderie of their terminated colleagues while accommodating the relentlessness and mindlessness of their replacement. They will have to produce more with less while taking on all the responsibility for the end results (the one thing models will never be able to do).
Sora’s primary purpose, as Merchant suggests, is to attract investment. When Open AI declares, “We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them,” they are primarily saying, “Please trust us and give us trillions of dollars of funding.” What they mean by “capable simulators of the physical and digital world, and the objects, animals and people that live within them” is “simulators capable of securing more funding for their continued scaling.” They don’t even bother to situate it in some larger context; they just proceed as though the criteria of “capability” is self-evident, manifest in some supposed wow factor in the mundanity of the model’s pattern recognitions. “Sample quality improves markedly as training compute increases,” OpenAI asserts, and then shows three equally pointless, context-free gifs of a dog playing in snow. Some appear to be more photorealistic, but is that really the universal measure of “quality” in all instances? It seems more like an obtuse or insistent literalism.
Consider this passage, which falls under the ambiguously parsable heading “Language Understanding”:
Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.
This glosses over some issues that, if Foucault is to be believed, have proven very significant for our current episteme. In what sense can text “correspond” with video, such that these researchers are okay to just take it for granted? A picture is worth 1,000 words — no more, no less. Developing a model to produce textual captions compounds the issue, withdrawing one layer further into representations of representations whose foundations are insecure, apparently untheorized. From whose perspective are these textual versions complete or sufficient or even “highly descriptive” accounts of what is transpiring in a video? Again, what does “quality” mean, what does “fidelity” mean if you are systematically working to eliminate the subject, for whom such words can have any meaning?
I was especially struck by how, in the midst of some technical generalities about how the model works on “compressed input video” and “spacetime patches,” the OpenAI researchers declare that “images are just video with a single frame.” That seems like a banal truism from the perspective of our screen-saturated lives, where all images are in constant circulation, constant motion, but it also inverts the chronology to shift the ontological emphasis: Video is the “normal” or “natural” equivalent of our sensory experience, and images are just stunted, very short videos, not a totally different kind of representation aiming at something irreducible to a simulation of “experience.”
If once there were images and then moving images, dependent on being printed in some definitive form for their existence, now there is only video, the phenomenology of the screen. Screens organize the full breadth of our existence, so video can therefore be a “world simulator” — nothing (at least nothing that matters) is left out, and it can all be capture in the language of a prompt, which is of course perfect transparent to itself.
In this short thread in response to the Sora announcement, Roland Meyer notes that “an aesthetic of continuous flow unites many of today's most prominent visual phenomena” in which “time can be stretched, reversed, looped, and modulated without consequences; visual coherence trumps the logic of cause and effect, and any temporal sequence can be seamlessly connected to any other sequence.” Anything that can appear on experience is essentially the same; it can be sutured together without violating the implicit sense of how the world hangs together now. The Sora demo illustrates this by merging different videos together in various ways, or animating still images with predictive renderings of what sorts of movement should have preceded and succeeded what was depicted (a.k.a. “simulating the world”).
Visual coherence is OpenAI’s only apparent criterion for “fidelity”; it simulates a world where causal relations are fully reduced to that coherence — but coherent for whom? Who is looking and why? Even if we accept the replacement of cause and effect with statistical generation and the reduction of experience to a 2-D video simulation on demand, we still can’t get around how it all still requires subjectivity. Instead we get to see the world through the meager stunted perspective afforded by a company and its investors who can see nothing but money.
The dopamine culture
A recent post by Ted Gioia, modestly titled “The State of Culture, 2024,” seemed to resonate with a lot of people this past week. It purports to explain “why entertainment is dead” — why the entertainment industry has become less profitable — by claiming it has been swallowed by a larger “distraction” industry, namely various tech platforms that offer doses of immediacy. But rather than take a political economy–based approach to this development, as Anna Kornbluh does in her recent book, Gioia settles on a reductive behaviorist account “based on body chemistry, not fashion or aesthetics,” as he says. The premise is that human brains are “hardwired” to be abjectly vulnerable to certain manipulative forms of media that can trigger secretions in our lizard brain that leave us hopelessly addicted and incapable of slow traditional concentration. Wicked technologists have developed sly means for insinuating these irresistible media products into our lives, and more and more of the erstwhile humans walking among us have been ensnared by them and are now helplessly prostrate before the evils of clickbait, short texts, and sports betting. Help me now, I’m scrolling.
This description makes comforting sense to me, in part because it absolves me of any responsibility for myself, even as it fails to offer much reason for hope. If dopamine dictates our behavior, who is in a position to resist it and thereby lead the political charge against the “dopamine cartel”? Who are the Ulysseses among us who are tied to the mast? Maybe it’s us, the wise readers and writers of this kind of tech criticism? After all, Gioia can’t be talking about any of his readers when he complains that the victims of the dopamine culture “don’t need Hamlet, a photo of a hamburger will suffice.” (Obviously no one could want both a hamburger and Hamlet; that is why Hamburger Hamlet went out of business.)
Instead of movies, users get served up an endless sequence of 15-second videos. Instead of symphonies, listeners hear bite-sized melodies, usually accompanied by one of these tiny videos—just enough for a dopamine hit, and no more.
Who are these users? We should really take control of their lives for them and rescue them from their body chemistry.
Adrian Hon offered this helpful corrective to the “dopamine” framing, arguing that it is basically used to reserve the term addiction for marginalized people.
It is wrong to use dopamine in place of “addiction” or “habit-forming” but everyone does it anyway. Why? It lends your argument a pinch of authority and style. It also sounds softer than “addiction”, even though that is precisely what most people mean but cannot bring themselves to say. For many, addiction is too harsh a word to use outside of drugs and maybe gambling. It’s a more comforting elision to say you just need the “dopamine”.
Hon also mentions Natasha Dow Schüll’s Addiction by Design, a far better exploration of addictive user interfaces and other systematic techniques companies use to hamper their customers’ judgment.
It also seems incoherent to me to isolate and pathologize “distraction” as a force that somehow obliterates rather than redirects attention. Distraction and absorption imply each other; they are the conditions of each other’s existence. You can’t even recognize “distraction” without some backdrop of absorption. Gioia figures distraction as the pathway to pure passivity and helplessness, as the surrender of “executive focus” and the ultimate submission to algorithmic forces and so on. But that makes distractibility a strictly individual problem, one that you can solve individually by “unplugging yourself from time to time.” Positing a dopamine culture opens an inquiry into how society establishes and sustains its definition of attention only to immediately slam it shut again. Dopamine is a nonexplanation that renames the facts of social predation and exploitation and naturalizes them, no matter how much we might complain about it.
Regarding your last point on dopamine and the addiction to quick bites of content, we need some sort of medical or psychological breakthrough to help us there. Addiction of all other sorts is so poorly treated in society at large that I really worry this new kind, whether it's social media or "being on my phone too much", will go not only untreated, but totally undiagnosed as we adapt to its existence and admit that we have a problem. There are flashes of these admissions in books and blog posts and such lately, but very few people seem to be doing much about them.
Just last week, I got multiple ads on Instagram (which I make it a point to only open on my computer, once or twice a day at the most) for a company that is adding a new layer of business and profit to the mess: Make money off people who need to recover from their addiction to the thing you're marketing to them on. And beyond that, today my friend remarked that he wanted to start a group called "14 hours", where the goal is to spend 2 hours per day on your phone and 2 hours per day doing some sort of physical activity. He said this could be a Discord community. And immediately I saw its failure, because solving the problems of the Internet with the tools of the Internet is, to me, is like trying to use a car to fix a car. It's an ecosystem of technology – not a tool. I wish I could think of a better analogy, though.
Does anyone have solutions? Is there any hope? Perhaps the third generation of Ozempic will be our savior, til it's not.
https://donotresearch.substack.com/p/john-menick-the-narco-image -- an interesting complement to this essay