Updater
February 06, 2025 , in technology

Can GenAI improve indefinitely?

For a time in the world of GenAI, bigger meant better – often spectacularly so. But development is beginning to run up against a number of limits.

Eidosmedia GenAI

Has GenAI Reached Its Limit? The Scalability Challenge Explained | Eidosmedia

  • STOP PRESS: At the end of January Chinese developer DeepSeek released a GenAI model delivering comparable performance to current 'foundation' models, but apparently requiring a small fraction of the 'compute' and energy resources. If confirmed, the release indicates that GenAI models may be trained effectively using significantly less hardware and computing time. It remains to be seen, however, if the model represents a reduction in the quantity of data required or mitigates the theoretical limitations of the GenAI model.

Many observers believe that current Generative AI (GenAI) models are beginning to hit a wall regarding scalability. Some of the limits are resource constraints: the scarcity of available space at data centers and the power to keep these models running. At the same time there is also the very real problem of a lack of quality data to train the models on. And, looking to the longer term, critics point to intrinsic limitations in GenAI technology itself.

We take a look at how serious these obstacles are and how they might be overcome.

Reaching the end of the web?

Much of the early conversation around GenAI centered on the data used to train these models, and whether or not this was a fair use of the content. At first blush, it may seem like the internet offers unlimited amounts of data that GenAI could continue to consume forever, but as it turns out, it’s already reached the end of the web.

Ilya Sutskever, OpenAI co-founder and former chief scientist, spoke at the Conference on Neural Information Processing Systems (NeurIPS) in December, and Observer reported that he said, “We’ve achieved peak data and there will be no more.” This is due, in part, to digital publishers who now restrict their publicly available data as a response to GenAI using the publishers’ work to train models and then compete with the publishers and other creators.

A Data Provenance report, Consent in Crisis, says, “We find a rapid proliferation of restrictions on web crawlers associated with AI development in both websites’ robots.txt and Terms of Service. We estimate, in one year (2023-04 to 2024-04), ~25%+ of tokens from the most critical domains … have since become restricted by robots.txt”. In other words, the humans who do the work of generating “data” are revolting.

According to Observer, Sutskever says this will mean pre-training models will end, and GenAI will need to use “synthetic data or models that improve responses by taking longer to think about potential answers.”

AI’s scalability problem

The initial leap in performance of GenAI tools like ChatGPT was dramatic enough to spark enormous excitement among investors and equally extravagant expectations, but new models simply are not making the same great strides over their predecessors, leading many to observe, as 2024 drew to a close, that the industry is stagnating.

Allison Morrow from CNN took a look at the hype versus the current reality of GenAI, and the consensus seems to be that data is the lifeblood of AI, and all the computing power in the world won’t change that. Optimists say that some companies may have simply over-invested and will need to scale back — and stop promising that AI will solve all the world’s problems.

Not everyone is so measured. Quoted by Morrow, Gary Marcus, NYU professor emeritus and AI critic, said, “LLMs will not disappear, even if improvements diminish, but the economics will likely never make sense… When everyone realizes this, the financial bubble may burst quickly; even Nvidia might take a hit, when people realize the extent to which its valuation was based on a false premise.”

Of course, where there is potential profit to be made, Silicon Valley is not known for proceeding with caution or even rationality. So, it’s looking for ways out of its scalability problems.

Is synthetic data the way forward?

AI and synthetic data are not strangers. For instance, it has been integral in training autonomous driving systems as companies lacked sufficient training data for these systems. However, when it comes to GenAI, it seems that synthetic data is not likely to create bigger, more impressive systems. Rather, Observer reports, “Synthetic data generated by large models like OpenAI’s GPT-4 can potentially be used to fine-tune smaller, more specialized models, according to [Kjell Carlsson, head of AI strategy at Domino Data Lab].”

What does that look like in the real world? Carlsson’s example suggests “advertisers may use ChatGPT to generate customer profiles of middle-aged women from Minneapolis who own cars. That data can then be used to train a smaller model representing that customer segment to create targeted ads.”

This is hardly a world-shattering prospect. However, there may be more important ways for synthetic data to fill in the gaps when sensitive data is not available. For instance, Observer reports, AI could synthetically generate X-ray images at different angles to train AI models and ultimately help doctors identify tumors.

While this may be a solution in certain cases, there are plenty of reasons to believe synthetic data will not, in the end, provide an answer. A study from Nature says, “We discover that indiscriminately learning from data produced by other models causes ‘model collapse’—a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.”

The beginning of the dead end?

As developers continue to seek incremental improvements in performance by tweaking their models, there remains a more fundamental question: is the plateauing in GenAI capabilities a sign of a fundamental limitation in the model itself?

We explored this possibility in an earlier post Is GenAI a Dead End?. Long-term GenAI critic Gary Marcus certainly thinks so and in the post CONFIRMED: LLMs have indeed reached a point of diminishing returns he indulges in an "I told you so" moment.

In the meantime: AI does useful work

While GenAI models may not have yet delivered the visionary goals their exponents have been predicting, they are certainly doing useful work driving higher productivity in many areas where natural language management is key.

In fast-paced sectors like news publishing and financial services they are enhancing both the quantity and quality of the content produced by human authors and editors.

Find out more about Eidosmedia AI tools for authors and editors

Interested?

Find out more about Eidosmedia products and technology.

GET IN TOUCH