Garbage island

Feb 10, 2023

This week, in a widely anticipated move, Google and Microsoft both announced their intention to augment search engines with large-language models. It marks a shift away from “organizing the world’s information,” to borrow from Google’s mission statement, to simulating it. Whereas web crawlers once attempted to index the internet to keep up with its expansion, offering an updating map to an actual evolving territory, large-language models arrest and ingest as much of the internet’s content as possible at one particular moment to process it into a static set of statistical probabilities, which can then offer a procedurally generated terrain that is produced only as you navigate through it.

In this New Yorker piece, Ted Chiang describes ChatGPT and other LLMs as “lossy text-compression algorithms” that have tried to shrink the entire internet into what amounts to a discrete zip file, which can then stand alone as a replacement for the internet that a single company can privately own. Rather than point you directly to original documents — and allowing whoever made those documents to possibly get some compensation for it — an LLM search replacement would reconstruct the gist of those documents from its compressed file, filling in the blanks of what was compressed out with its best guesses derived from its statistical analysis. If those lossy 128kbps MP3s were good enough for constricted sound capabilities of iPod earbuds, then why wouldn’t lossy versions of knowledge be good enough for the diminished truth capacities of our fallen world?

Of course, the simulated internet could be injected into the actual internet to pollute it and make it less useful, more difficult to navigate: As Chiang notes, “the more that text generated by large-language models gets published on the Web, the more the Web becomes a blurrier version of itself.” The original zip file then becomes a pristine archive by comparison, a lifeboat of truth in an ocean filled with plastic garbage. Chiang assumes that future LLMs will make every effort to keep previous generated material out of the training data — to keep from “repeatedly making photocopies of photocopies” (the eternal return of the “poor image”!) — but I wonder if the opposite were true: If you owned that first zip file, you would have a good long-term incentive to ship out as much plastic as possible, and to get as many people as possible to use your plastic-making machine.

Search engines as we currently know them are already rife with incompatible incentives: They interpret users’ search for knowledge as their desire to be sold something. As is frequently pointed out, when Google’s founders first launched their search engine, they noted that “advertising-funded search engines will be inherently biased toward the advertisers and away from the needs of the consumers” and that “advertising income often provides an incentive to provide poor quality search results.” They even go so far as to suggest “from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want” — that is, that ads are the disinformation or non-information that users want search engines to try to circumvent. Ads are much of the trash heap we are sifting through to find what is valuable.

But now, as John Herrman points out here, search engines mainly serve ads and actively obfuscate information, creating the problem that AI-assisted search is supposedly going to solve.

The conventional search-engine interface is terrible on its own terms — the first thing users see, even in this technology demo, is a row of purchase links and an advertising pseudo-result leading to a website I’ve never heard of. The “search,” such as it is, is immediately and constantly interrupted by the company helping you conduct it.
The chat results are, by contrast, approximately what the user asked for: a list with no ads and a bunch of links, and a summary of the sorts of articles that current Bing users would have encountered eventually, after scrolling past and fending off the various obstacles and distractions and misdirections that are typical of modern search, as designed by Google and Microsoft.

Herrman’s point is that you don’t need AI technology or any kind of new technology at all to improve search; you just need to rewind to Google’s ideology circa 1998 and remove the advertising.

In a piece for Neiman Lab, Joshua Benton suggests that “Google has built an empire by ‘polluting’ its search results with ads. Its defenses may seem impenetrable, but it has one potential point of weakness: a competitor who’d be able to give users answers rather than links to websites that might have answers.” From this view, answers and ads are held to be obviously different, but Google has established its dominance precisely by blurring them together, by making ads seem like the answer. Wouldn’t AI be even more helpful in empire-building if they generated ads as answers to meet whatever prompt a specific situation presents? The advantageous thing about being a search company is that your customers don’t know exactly what they are looking for; they just want answers. LLMs provide answers perfectly suited to the context in which they are asked, contoured to fit any specifications, “accurate” to whatever motives are in play, if not to some idea of objective truth.

It seems unlikely that Google would break its own advertising model with AI-assisted search, or that it feels so threatened by Bing+ChatGPT that it had no choice but take the drastic step of adding AI to its established business. It seems more likely that their demos are a bait-and-switch, meant to make AI-generated results seem less like a floating garbage island and more like a MP3 that is convenient and “good enough.” Users could be thereby encouraged to abandon the open internet for the company-owned zip file. Ads could still surround or be injected into generated material, perhaps with an added air of credibility and faux objectivity. Rather than simply auction off ad placements, search companies may sell the rights to tamper with a model’s training data, writing commercial messages and ideologies directly into the source code of the simulated reality itself. One can even imagine the model being personalized to each user’s account, and advertisers bidding on the opportunity to pollute those models on a bespoke basis.

In other words, even though tech companies may initially frame AI as advertising’s adversary, they will use AI to develop new ways of helping ads defeat information for users who feel as though they have no alternative means of navigation.

Chiang’s analogy is premised on fidelity, on the idea of an original and a copy and of a clear distinction between them. But the premise of AI is to blur that distinction, to make it more difficult or ultimately irrelevant to distinguish between generated and quoted material. No copies and no originals, just the “simulacrum,” as Baudrillard argued, “the hallucinatory resemblance of the real to itself.” That maps onto tech companies’ incentive to make ads and content indistinguishable. Generative models will be put to work to perform the same obfuscation that search engine interfaces already perform, hiding the source and motivation behind how a piece of information is presented so that commercially funded speech can better masquerade as some kind of truth, or at least as the only pile of garbage on which we can build any kind of understanding.

rhy7s

Mar 11, 2023

This is overall a good analogy, and I share the trepidation of answers without citations stifling further curiosity - but the filtering and paraphrasing is not new to LLMs. There is plenty of misleading compression and reconstruction of source material undertaken by human authors.

Expand full comment

Anton Cebalo

Feb 15, 2023

Hey, really enjoyed this piece. Clarified some things for me. I mentioned it in my last weekly retrospective: https://novum.substack.com/i/101103540/big-tech-stagnation

Internal exile

Discussion about this post