What generative AI can learn from the primordial swamp

Stay informed with free updates

First, we learn that generative AI models can “hallucinate”, an elegant way of saying that large language models make stuff up. As ChatGPT itself informed me (in this case reliably), LLMs can generate fake historical events, non-existent people, false scientific theories and imaginary books and articles. Now, researchers tell us that some LLMs might collapse under the weight of their own imperfections. Is this really the wonder technology of our age on which hundreds of billions of dollars have been spent?

In a paper published in Nature last week, a team of researchers explored the dangers of “data pollution” in training AI systems and the risks of model collapse. Having already ingested most of the trillions of human-generated words on the internet, the latest generative AI models are now increasingly reliant on synthetic data created by AI models themselves. However, this bot-generated data can compromise the integrity of the training sets because of the loss of variance and the replication of errors. “We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models,” the authors concluded.

Like the mythical ancient serpent Ouroboros, it seems, these models are eating their own tails.

Ilia Shumailov, who was the paper’s lead author while a researcher at Oxford university, tells me that the main takeaway from the research is that the rate of development in generative AI is likely to slow as high-quality data becomes more scarce. “The main premise of the paper is that the systems we are currently building will degrade,” he says.

The research company Epoch AI estimates that there are currently 300tn tokens (small units of data) of human-generated public text good enough to be used for training purposes. According to its forecasts, that stock of data might be exhausted by 2028. Then, there will not be enough fresh high-quality human-generated data to feed into the hopper and an over-reliance on synthetic data may become problematic, as the Nature paper suggests.

That does not mean that existing models mostly trained on human-generated data will become useless. Despite their hallucinatory habits, they can still be applied to myriad uses. Indeed, researchers say there may be a first-mover advantage for early LLMs trained on unpolluted data that is now unavailable to next-generation models. Logic suggests that this will also increase the value of fresh, private, human-generated data — publishers take note.

The theoretical dangers of model collapse have been discussed for years and researchers still argue that the discriminate use of synthetic data can be invaluable. Even so, it is clear that AI researchers will have to spend much more time and money on scrubbing their data. One company exploring the best ways of doing so is Hugging Face, the collaborative machine learning platform used by the research community.

Hugging Face has been creating highly curated training sets including synthetic data. It has also been focusing on small language models in specific domains, such as medicine and science, that are easier to control. “Most researchers despise cleaning the data. But you have to eat your vegetables. At some point, everyone has to dedicate their time to it,” says Anton Lozhkov, a machine learning engineer at Hugging Face.

Although the limitations of generative AI models are becoming more apparent, they are unlikely to derail the AI revolution. Indeed, there may now be renewed focus on adjacent AI research fields, which have been comparatively neglected of late but may lead to new advances. Some generative AI researchers are particularly intrigued by the progress made in embodied AI, as in robots and autonomous vehicles.

When I interviewed the cognitive scientist Alison Gopnik earlier this year, she suggested that it was the roboticists who were really building foundational AI: their systems were not captive on the internet but were venturing into the real world, extracting information from their interactions and adapting their responses as a result.

“That’s the route you’d need to take if you were really trying to design something that was genuinely intelligent,” she suggested.

After all, as Gopnik pointed out, that was exactly how biological intelligence originally emerged from the primordial swamp. Our latest generative AI models may captivate us with their capabilities. But they still have much to learn from the evolution of the most primitive worms and sponges more than half a billion years ago.

[email protected]

Read the full article here

What generative AI can learn from the primordial swamp

Leave a Reply Cancel reply

Finance Weekly Newsletter

Why investors are still betting big on ETFs

Can Trump And His Policies Turn The Economy Around Before The 2026 Midterm Elections

Columbia Seligman Global Technology Fund Q4 2025 Commentary (SHGTX)

2026 market rally: Earnings, opportunities, and other reasons to get bullish

How DoorDash, OpenTable, And Resy Are Battling For Tables

Company

More Info

Sign Up For Free