The Curse of Recursion: Training on Generated Data Makes Models Forget

In May this year, AI experts from the United States and Canada jointly published a paper, “The Curse of Recursion: Training on Generated Data Makes Models Forget”. The paper concludes that the data model may collapse when all the online language data has been used and this could finally lead to the AI collapse.

When online text, images, videos run out, AI training will inevitably use content generated by AI itself. This AI-generated content may finally cause “information pollution”. Researchers refer to this learning method as moving forward by stepping on one’s foot with another. In other words, AI learns based on the content it generates, just like a person stepping on one foot with the other, they can’t move forward, they can only fall.

AI data pollution is not a distant issue, it has already started. Few months ago, someone asked an AI App for travel advice to Elephant Trunk Hill and received a bunch of incorrect information. The answer was marked as coming from a highly upvoted response on Zhihu App. Following this clue, it was found that the related answers on Zhihu were generated by Chat GPT, and the account that answered had answered more than just this question. It probably answers two questions per minute, non-stop throughout the day. Zhihu later permanently banned this account.

For example, as early as February this year, the famous American science fiction magazine “Clarkesworld” announced a temporary suspension of submissions. The reason is that the editorial department received a large number of novels generated by Chat GPT.

AI data pollution has sporadically occurred. Of course, where there is a problem, there is an opportunity. This opportunity is that any original content posted by real people will become increasingly valuable.

In May this year, Elon Musk announced that Twitter would stop providing free APIs, i.e., data interfaces for developers. All institutions using Twitter data are required to pay a usage fee of $42,000 per month. If they do not pay, Twitter will require these institutions to delete the previously downloaded data. At the same time, for registered users, if they are not verified, they can only view 1,000 tweets and comments per day.

Reddit, the largest forum in the United States, also announced that from June 19, all developers who want to access data have to pay, and the price is high.

Google might be the biggest winner in the future of AI development. Not because of advanced technology, but because Google owns YouTube. YouTube is one of the world’s largest video platforms. Video is data, is language materials. In the future, the most valuable thing in AI will be language data.