The paradox of LLM models: Will it be victim of it's own success?
As the popularity of ChatGPT grows and as it beats the sources where it received it's original training data, will the quality of the models deteriorate?
LLM models, such as ChatGPT and Github Copilot, have gained significant popularity and are revolutionizing the way we access information and generate code. However, their success raises an intriguing paradox: will they ultimately become victims of their own success?
One of the key aspects of LLM models is their reliance on user-generated data to train their models. Platforms like Stack Overflow, Quora, WebMD, Twitter, and various forums serve as valuable sources of information for these models. But what happens if users increasingly rely on ChatGPT for their answers and stop using traditional search engines? This shift in user behavior could have a detrimental impact on the source sites, as we are already witnessing a drastic decline in activity on platforms like Stack Overflow1. There is whole economy of content sites that rely on the search traffic. While ChatGPT might not completely kill this industry, but it’s very likely the quality and number of content generated will go down significantly as the users seek for instant answers through ChatGPT. The question then arises: where will the data for future models come from?
As the usage of LLM models increases, there is a real possibility that the quality and diversity of user-generated data may decline. With fewer users turning to traditional search engines and source sites, the availability of quality data for training LLM models could diminish significantly.
This phenomenon extends beyond just information retrieval. Take Github Copilot, for instance. As users rely more on code generated by GPT models, there is a legitimate concern about the quality of the code that is published. If developers increasingly depend on code suggestions generated by LLM models, it is reasonable to question whether the overall standard of open-source code will decline. Consequently, future LLM models may have to rely on their own generated data to train themselves, further exacerbating the potential deterioration in quality.
Arguably, OpenAI might not have to train GPT models again. As per my understanding, they already stopped training their data after Sept 2021, as we keep seeing this prompt from ChatGPT - ‘As of my last update in September 2021…’. It can use the embeddings to use the indexed data to refer
nce the latest information, but there are use cases where this approach has it’s own limits. For example, in Github Copilot open-source libraries and various frameworks is evolving so fast, its suggestions often are outdated. I suspect, OpenAI or other LLM models will have to frequently train again with the latest data. Regardless, even if it’s embeddings based approach, will there be enough reliable and quality information out there to reference?
Only time will tell whether or how this whole challenge will affect the future models, but I suspect in the world where LLM or transformer based AGI models dominate almost all use cases, this paradox is a real risk. You have to ask where would the actual reliable and quality information would come from?