Do you want new wave or do you want the truth?
This short piece in the New Yorker by Sue Halpern gives an overview of GPT-4 and OpenAI’s refusal to release information about the data on which it was trained. It’s fine as far as those sorts of pieces go, but I bring it up here because I wanted to make a point about this paragraph, and particularly the sentence I bolded below:
The opacity of GPT-4 and, by extension, of other A.I. systems that are trained on enormous datasets and are known as large language models exacerbates these dangers. It is not hard to imagine an A.I. model that has absorbed tremendous amounts of ideological falsehoods injecting them into the Zeitgeist with impunity. And even a large language model like GPT, trained on billions of words, is not immune from reinforcing social inequities. As researchers pointed out when GPT-3 was released, much of its training data was drawn from Internet forums, where the voices of women, people of color, and older folks are underrepresented, leading to implicit biases in its output.
This sort of point is frequently made: Models trained on biased data will produce biased output, and naive users are at risk of overlooking this, because of the authoritative and “confident” way LLMs present information. That is accurate, but it tends to frame the problem as one of bad data, as if the “ideological falsehoods” were simply contained there and could be removed, perhaps by hiring even more workers in low-wage regions to function as a mercenary truth brigade. The implication is that “bad” training data can be processed into better training data, and then LLMs will start to generate nonideological truth-hoods.
Keep reading with a 7-day free trial
Subscribe to Internal exile to keep reading this post and get 7 days of free access to the full post archives.