An upgraded version of the R1 reasoning AI model, developed by the Chinese lab DeepSeek, was revealed last week. It does well on several coding and math tests. Some AI researchers surmise that at least some of the data used to train the model originated from Google’s Gemini family of AI, although the business did not disclose the source of the data.
Melbourne-based developer Sam Paech, who develops “emotional intelligence” assessments for AI, released what he says is proof that DeepSeek’s most recent model was trained using Gemini results. In an X post, Paech stated that DeepSeek’s model, known as R1-0528, favours terms and expressions that are comparable to those that Google’s Gemini 2.5 Pro favours.
Not a smoking gun, that is. The traces of the DeepSeek model, or the “thoughts” the model produces as it moves toward a conclusion, “read like Gemini traces,” according to another developer, the anonymous creator of SpeechMap, a “free speech eval” for AI.
DeepSeek had already been charged with using data from competing AI models for training. Developers noticed in December that DeepSeek’s V3 model frequently recognized itself as ChatGPT, OpenAI’s AI-powered chatbot platform, indicating that ChatGPT chat records may have been used for training.
OpenAI told the Financial Times earlier this year that it had discovered evidence connecting DeepSeek to the use of distillation, a method for training AI models by taking data from larger, more powerful ones. In late 2024, Microsoft, a close partner and investor in OpenAI, discovered that massive volumes of data were being stolen through OpenAI developer accounts, which OpenAI suspects are connected to DeepSeek, according to Bloomberg.
Although distillation is a widespread method, OpenAI’s terms of service forbid users from using its model results to create rival artificial intelligence.
To put it plainly, a lot of models misidentify themselves and end up using the same terms and phrase constructions. This is due to the fact that AI slop is increasingly present on the open web, which is where AI businesses obtain the majority of their training data. Bots are overrunning Reddit and X, and content farms are utilizing AI to produce clickbait.
It has been quite challenging to completely filter AI outputs from training datasets because of this “contamination,” if you will.
DeepSeek may have been taught using data from Google’s Gemini, according to AI experts like Nathan Lambert, a researcher at the nonprofit AI research firm AI2.
In a post on X, Lambert stated, “If I were DeepSeek, I would definitely create a ton of synthetic data from the best API model out there.” “[DeepSeek] has plenty of money and few GPUs. For them, it’s literally more compute.
AI companies have been boosting up security safeguards, in part, to stop distillation.
To access some advanced models, OpenAI started requiring businesses to go through an ID verification process in April. China is not on the list of nations that OpenAI’s API supports, hence the procedure requires a government-issued ID from one of those nations.
In another move, Google started “summarizing” the traces produced by models that are accessible through its AI Studio developer platform. This makes it harder to train competitive models that are more effective on Gemini traces. Citing a need to safeguard its “competitive advantages,” Anthropic announced in May that it will begin to compile the traces of its own model.
Google has been contacted for comment; if we hear back, we’ll update this article.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.