DeepSeek R1 came out on January 20, 2025, and triggered strong reactions throughout the world. Nvidia stock dropped more than 17% in one day; in total market cap, the company recorded the biggest loss in American history. The panic seemed to be sparked by DeepSeek’s claim that they trained R1 on last-generation GPU’s for only $5.6 million. Was the market for ever-faster GPUs, and ever-growing, energy-consuming, expensive compute going to dry up?
The drama didn’t end there. The next day, OpenAI accused DeepSeek of trying to “replicate” their model by building a massive dataset of GPT4o responses as training data – a process known as model distillation, which OpenAI’s spokesperson said “inappropriately” violated their terms of service. Some people thought it was karma that OpenAI had their data stolen when they (allegedly) stole training data to begin with. But what is distillation? Does it use the same “data” GPT4o was trained on? Is it really “stealing”? And whether it is or not, can OpenAI enforce its terms of service?
CITP scholars discussed these questions all week, including at a special session of our Tech and Society reading group on Tuesday, February 4, 2025.
Is the Era of Expensive Hardware and Compute Over? Infrastructure vs. Deployment
CITP scholars were skeptical, reserving judgment until we see more tests emerge.
Assistant Professor Manoel Horta Ribeiro highlights a crucial distinction that many might be overlooking: the difference between having the infrastructure to deploy large language models (LLMs) at scale versus having the infrastructure to train them. Even if R1 was trained on de minimis hardware, deploying it might imply hundreds of thousands of users interacting with it remotely. This computation would be most efficient in large server farms with the latest chips. Manoel points out, “Part of the reason why these tech companies are building infrastructure is also for inference (using the model, not training it), which should be kind of orthogonal to the DeepSeek breakthroughs.”
It remains to be seen how efficient a model DeepSeek can make, and how small a GPU it can run on. Assistant Professor of Computer Science Peter Henderson agrees with Ribeiro, emphasizing the importance of understanding the hardware requirements for inference. He states, “The optimal chipset for inference will evolve as more specialized hardware and optimization techniques are developed. But, for now, NVIDIA chips remain essential for both training runs and deploying models to serve customers at scale.” Although these can be run on a Raspberry Pi, CITP postdoctoral researcher Dominik Stammbach remains skeptical about the speed and whether this makes for a good user experience.
This discussion led to the question posed by Assistant Professor of Computer Science Andrés Monroy-Hernández: “Is there any reason to be skeptical of the claim that they spent only ~$6 million?” Sure enough, reports soon emerged that DeepSeek’s total cost to get to R1 was orders of magnitude higher at perhaps $1.6 billion.
Still, Mihir Kshirsagar, CITP Clinic Lead, noted that there was a prevailing belief that building capable AI models would cost well over the amount claimed by DeepSeek. He comments, “There was a ‘sense’ that building capable models would cost north of $100m and therefore would only be available to hyperscalers. This development calls that received wisdom into question.”
Data Theft Allegations, or What’s Good For the Goose…
The controversy over whether DeepSeek’s data was stolen from OpenAI won’t be resolved any time soon, as we wait for regulators to act and courts to decide. But there are serious doubts under current law whether OpenAI ever “stole” anything, or ever had anything “stolen” from it.
Peter Henderson and his colleagues have worked on all aspects of this issue. First, there’s the question of whether OpenAI violated copyright in scraping training data. The U.S. copyright “fair use” defense, they argue, might allow OpenAI to train on copyrighted data. However, if their models generate outputs substantially similar to the data they trained on, it may make it more difficult for them to argue fair use. But there’s also the question of whether OpenAI can claim some sort of legal control over the outputs of its model (the outputs DeepSeek used to train V3, a precursor to R1). That’s a stretch, as Lemley and Henderson argue in this paper. The US Copyright Office has recently agreed, throwing cold water on the idea that model outputs are copyrightable unless there was a “sufficient level of human contribution” to the output to make it copyrightable. This would put copyright closer in reach to the person prompting the model than to the company whose model it was.
Does this mean DeepSeek was free to use OpenAI’s model outputs? Ribeiro explains, “The technique DeepSeek (probably) used is called ‘model distillation.’ It is fairly common, and one way of doing it involves generating data with a stronger model (e.g., OpenAI’s o1) and using it to train a weaker model. OpenAI forbids using the model this way, but it is unclear whether this matters at all.”
CITP held a special deep dive on DeepSeek, dedicating a Tuesday session of the Technology and Society reading group (Tech Soc) to reading DeepSeek’s research papers and understanding their training innovations. Postdoctoral researcher Dominik Stammbach walked others through the process – from initial training data, to Group Relative Policy Optimization, to Supervised Fine Tuning. Asked about his judgment about DeepSeek’s training data, Stammach expressed frustration about the lack of detail on this part in the otherwise rather comprehensive tech report released by DeepSeek. “At this point, we can only speculate what the model has been trained on, but I wouldn’t be surprised if some of the training data for DeepSeek V3 or R1, for example, long chain of thoughts, were actually distilled from proprietary language models. The legal consequences of this, and whether this changes how consumers and researchers can interact with such models, remains to be seen.”
Implications for AI Companies
Manoel summarizes his perspective, “Overall, this could be a systematic problem for companies whose strategy is to train large frontier models and make them available to the general public (e.g., OpenAI). If people can always ‘distill’ a model similar to the large one, this could allow for a substantial ‘second mover advantage.’ Companies that move second could get ‘good enough’ models for a fraction of the cost.”
Inyoung Cheong, Postdoctoral Research Associate at CITP, points out, “OpenAI has argued that they used publicly available data on the internet, while DeepSeek has allegedly distilled information, which could constitute a violation of the OpenAI’s Terms of Service. But the New York Times has claimed that OpenAI collected and used their proprietary articles, so maybe the distinction between publicly available vs. proprietary is not very convincing here.” Inyoung’s own research works to bridge law and computer science, focusing on establishing guiding principles for responsible AI systems.
She goes on to say, “But definitely I 100% disagree with DeepSeek’s Privacy Policy and their potential misuse of personal data. I think it could have resonated more if Sam Altman had reacted with ‘I hate that they utilized GPT to develop exploitative data practices,’ instead of ‘You stole my property.’”
TechTakes is a series where we ask members of the CITP community to comment on tech and tech policy-related news. TechTakes is moderated by Steven Kelts, CITP Associated Faculty and lecturer in the Princeton School of Public and International Affairs (SPIA), and Lydia Owens, CITP Outreach and Programming Coordinator.
Speak Your Mind