China’s DeepSeek Adds a Weird New Data Point to The AI Race
V3 and R1 are Impressive Work, With Many Implications – but not "China Has Caught Up"
[If you’ve already read a lot about r1, you might want to skip halfway down, to “A Perfect Storm Led To DeepSeek’s Apparent Upset”. The first half of this post is a recap of the known facts; the second half includes some analysis you may not have seen.]
In the last week, there’s been a crazy level of buzz around a new AI model released by DeepSeek, a Chinese company that has been doing impressive work recently.
The new model is DeepSeek-R1 – “r1” for short. It’s notable for having reasoning capabilities which seem somewhat comparable to OpenAI’s pace-setting o1, despite coming from a much smaller organization (reportedly 100-150 engineers) with correspondingly smaller resources.
People seem to be focusing on the implications for various horse races: “China is catching up with the US”, “open models are going to stay in the race”. But r1 isn’t just an update in the race to AGI; it’s a weird update. For the last few years, the story was that progress in AI was about ever-increasing scale; there was serious talk that by 2027, companies might spend 100 billion dollars on a single training run. Then OpenAI upped the ante with September’s announcement of the o1 “reasoning” model, which extends those giant training runs with new techniques that emerged from years of work by elite researchers.
If state-of-the-art AI requires multi-billion dollar budgets and enormously ambitious multiyear R&D projects, how does a scrappy upstart catch up in a matter of months?
Before diving into that, let’s review what we actually know about r1.
Why R1 is a Big Deal
R1 is based on another model, DeepSeek-V3 (or “v3”), released last month. V3 is a “conventional” large language model (LLM), trained to imitate human writing, without any special reasoning capabilities. In other words, it was an attempt to catch up with GPT-4. The impressive thing is that it largely succeeded: v3 leapfrogged Meta’s Llama-3.3 to become the leading open model, and is not terribly far behind the best closed models.
V3 was widely acknowledged as an impressive accomplishment, especially given that it was trained on a relatively low budget (see below). However, the buzz was muted by the fact that the field had already moved on: the cutting edge is now OpenAI’s new line of models – o1, o3, and variants – that are specially trained to be able to reason their way through complex tasks. Then DeepSeek dropped the real bombshell, announcing r1, which – like o1 – is a “reasoning model”.
The early consensus is that r1 is (mostly) less capable than o1, but that for many tasks, its reasoning abilities put it well ahead of any model other than o1/o31. Zvi Mowshowitz summarized reactions as follows:
Taking into account everything I’ve seen, r1 is still a notch below o1 in terms of quality of output, and further behind o1 Pro and the future o3-mini and o3. But it is a highly legitimate reasoning model … The vibes on r1 are very good.
r1 does have limitations. It’s not optimized for software engineering. It’s not optimized for answering followup questions (“multi-turn conversations”). There are limits on the “context window”, meaning that you can’t give it very large inputs (such as a long technical paper or a big pile of documents) or expect it to think for a long time. And it doesn’t support certain important technical features people have come to support from polished models from vendors like OpenAI, such as “function calling” and “JSON output”.
So, r1 is a very capable reasoning model, but less capable than o1. Here is why people are so excited about it.
Like v3, it was developed on a comparatively small budget, suggesting that the future of AI may not depend on ever-increasing budgets (and hence may not be controlled by a few giant tech companies). Again, more on this below.
DeepSeek released the model weights (commonly, if incorrectly, described as the model being “open source”). This means that anyone can run the model on their own hardware; there are no restrictions on how it can be used; anyone can “fine tune” the model to adjust its behavior or train new capabilities; and researchers can study and build on the model. r1 is now the most powerful open model, by a fair margin. DeepSeek also published a technical paper describing many of the techniques they used to create it (though there are plenty of details they haven’t shared). Their papers on the development of v3 and r1 introduce a number of major innovations to the open literature.
It’s 30 times cheaper than o1. There are services which offer access to r1 at about 1/30 the price OpenAI charges for o1. And if you run r1 on your own hardware, it’s free (setting aside the cost of the hardware and its operating expenses).
DeepSeek released smaller versions that can run on a personal computer, and should be very cheap to operate, though less capable than the full model.
You can see the model’s chain of thought. The whole point of these new “reasoning models” like o1 and r1 is that they will think through a problem before coming up with an answer. OpenAI only lets you see o1’s final answer – all you see of the chain of thought is a brief, possibly untrustworthy summary. With r1, you get the full chain of thought. People are finding this fascinating to examine, and sometimes get more value out of the reasoning chain than the final answer. Access to the raw chain of thought also makes it easier to see where the model goes wrong and coax better answers from it.
Perhaps the most interesting thing about r1 is that it was created by a relatively small team on a relatively small budget. Let’s review what we know about that.
What Went Into Creating R1?
To build o1, OpenAI started with a conventional LLM (perhaps GPT-4o), and applied additional training to enable it to carry out long chains of error-free reasoning. DeepSeek did the same thing, starting with v3 and training it up to create r1. So first we should look into the cost of creating GPT-4o and v3.
It’s not known how much OpenAI spends training their models, but by the time they released GPT-4, they had raised billions of dollars. v3 is probably superior to the original GPT-4 in most ways, and was reported to have been trained for the jaw-droppingly low figure of $5.5 million. However, Nathan Lambert published a nice analysis explaining that v3’s true all-in cost was much higher. Also DeepSeek’s job was substantially easier than OpenAIs: the state of the art has advanced dramatically since the March 2023 launch of GPT-4, and imitation is always easier than trailblazing2. Even so, DeepSeek has a much smaller engineering team and compute budget than the major AI labs and their ability to produce a competitive model like v3 is impressive.
That brings us up to the point where OpenAI and DeepSeek had to start training their models to reason. Both companies used “reinforcement learning”, which basically means that they gave the model millions of reasoning tasks, checked its work on each one, and tweaked it to reinforce the behaviors that led to correct outputs. The process does not rely on human input. To avoid the need for human evaluators, training uses tasks (such as math problems) where a computer program could automatically determine whether an answer is correct.
Training a reasoning model in this fashion is a complicated process. You need to work out a detailed approach for how to do the reinforcement learning, and build software to carry out that plan. Next, you need to test your approach; to save time and money, you do this by training a smaller model, and comparing its performance to other small models. You then enter a process of experimentation, training variations on the small model in search of better performance. Eventually you’re ready to train the full-scale model, a task with a lot of moving parts… and you really hope it works out on the first try, because the full training run will take months and cost a lot of money.
OpenAI has been working on the techniques behind o1 for years. This took place behind the scenes, but there were occasional leaks, associated with names like “Q*” and “Strawberry”. The project finally emerged into the light in mid-September, when OpenAI announced o1. I’m not aware of any information as to when DeepSeek began work on r1, but it’s a reasonable guess that they didn’t get down to work in earnest until some time after OpenAI first announced o1 in mid-September. That would limit the time frame to at most 4 months, and possibly less3.
Zvi estimates that the true cost to develop r1 (once DeepSeek had finished its predecessor v3) was in the low millions, possibly as much as $10M or so. We don’t know what OpenAI spent to develop o1, but given that it’s their flagship offering and they spent years working up to it, a full accounting must run to hundreds of millions. As a point of comparison, my understanding is that OpenAI spent well over $1M just to evaluate their newer o3 model on a single benchmark4. So we can guess that OpenAI spent at least an order of magnitude more to develop o1 than DeepSeek spent on r1.
To recap: DeepSeek trained v3 using far less money than OpenAI spent training GPT-4. This is noteworthy, but can perhaps be explained by general advances in algorithms, the fact that it’s easier to follow than to lead, and DeepSeek having a sharp team. But then they followed up with r1, outright leapfrogging most of the industry. How can we explain that?
It’s Difficult to Fit a Coherent Story to These Facts
DeepSeek seems to have developed a near-peer model to o1 in a matter of months, on a comparative shoestring, and using substantially simpler algorithms. Many observers take this as evidence that it’s not very hard to add reasoning capabilities to an LLM. But if that’s so, why didn’t anyone do it sooner?
We know OpenAI was working for years on the research path that led to o1. Why did it take them so long?
Rumors of OpenAI’s work on reasoning have been circulating for a long time, and the ideas behind it are not entirely new. Why didn’t Anthropic, Google, Meta, Elon Musk’s xAI, or any of the larger Chinese labs get into the game sooner? Not to mention various smaller but still well-funded players, such as Mistral, who would benefit hugely from the attention they would have attracted by getting out in front with a reasoning model.
It seems obvious that most of these companies would love to trade positions with DeepSeek right now, excepting only OpenAI (ahead with o1 / o3) and possibly Google and Anthropic (both of whom have things in the works; also Anthropic might have deliberately chosen to not kick off this segment of the race). Meta in particular has been entirely upstaged, and according to an article at The Information, researchers there are “in panic mode”:
Leaders including AI infrastructure director Mathew Oldham have told numerous colleagues they are concerned that the next version of Meta’s flagship AI, Llama, won’t perform as well as the Chinese AI, DeepSeek, according to two Meta employees with direct knowledge of efforts to catch up.
…
Researchers at OpenAI, Meta and other top developers have been scrutinizing the DeepSeek model to see what they could learn from it, including how it manages to run more cheaply and efficiently than some American-made models. … Meta, meanwhile, has set up several war rooms, or specialized groups of researchers, to dissect DeepSeek and use the insights to improve Llama, the employees said.
An earlier anonymous post, allegedly from a Meta employee, suggests that this is down to Meta’s AI team being poorly organized (lightly edited for clarity):
Meta generative AI organization in panic mode
It started with DeepSeek v3, which rendered the [not yet released] Llama 4 already behind in benchmarks. Adding insult to injury was the "unknown Chinese company with 5.5 million training budget".
Engineers are moving frantically to dissect DeepSeek and copy anything and everything we can from it. I'm not even exaggerating.
Management is worried about justifying the massive cost of our generative ai organization. How would they face the leadership when every single "leader" of the org is making more than what it cost to train deepseek v3 entirely, and we have dozens of such "leaders".
DeepSeek r1 made things even scarier. I can't reveal confidential info but it'll be soon public anyways.
It should have been an engineering focused small org but since a bunch of people wanted to join the impact grab and artificially inflate hiring in the org, everyone loses.
If it’s not fundamentally difficult to develop a model like r1, why didn’t anyone do it sooner? If it is difficult, how did DeepSeek pull it off in a hurry with a meager compute budget?
A recent post by Dario Amodei (CEO of Anthropic) suggests that training a r1-caliber reasoning model is not all that difficult:
R1 … is much less interesting from an innovation or engineering perspective than V3. It adds the second phase of training — reinforcement learning, described in #3 in the previous section — and essentially replicates what OpenAI has done with o1 (they appear to be at similar scale with similar results). However, because we are on the early part of the scaling curve, it’s possible for several companies to produce models of this type, as long as they’re starting from a strong pretrained model. Producing R1 given V3 was probably very cheap. We’re therefore at an interesting “crossover point”, where it is temporarily the case that several companies can produce good reasoning models. This will rapidly cease to be true as everyone moves further up the scaling curve on these models.
This still leaves me confused – if adding o1/r1-caliber reasoning capability to a strong conventional model is not all that difficult, why didn’t it happen sooner?
A Perfect Storm Led To DeepSeek’s Apparent Upset
There’s no single factor that can explain how DeepSeek pulled off this feat. But as I researched this post, many potential contributing factors emerged. I think the truth must be a combination of many factors. Taken together, the picture that emerges is that DeepSeek was more motivated to rush to market than the leading labs, executed very well, benefited from a combination of favorable circumstances, and produced a model that looks even better than it actually is.
Here are the potential contributing factors.
DeepSeek is a sharp, focused team. As Dean Ball put it,
Part of the reason DeepSeek looks so impressive (apart from just being impressive!) is that they are among the only truly cracked teams releasing detailed frontier AI research.
They have a team of strong engineers, without a lot of product distractions. A lot has been written about their distinctive corporate culture5, hiring young talent without worrying about conventional credentials. And they’ve been working on this for a while: in a private communication, researcher Dan Hendrycks pointed out that “making LLMs reason through [reinforcement learning]” has been a clear direction for some time, and that papers published by DeepSeek in 2024 show they were pursuing this course.
They’ve genuinely advanced the state of the art. The technical papers they’ve released in recent months identify a number of significant advances in model architecture, which drive down costs for both training and using models (aka “inference”). I had been wondering whether the big US labs might already have privately made similar advances, but Dario Amodei’s post suggests not:
DeepSeek's team did this via some genuine and impressive innovations, mostly focused on engineering efficiency. There were particularly innovative improvements in the management of an aspect called the "Key-Value cache", and in enabling a method called "mixture of experts" to be pushed further than it had before.
More details on DeepSeek’s efficiency advances here. Meanwhile, the big players are spending significant energy on other things, such as support for images, voice input and output, allowing models to search the web, and building various flavors of “agents”.
R1 probably isn’t quite as good as initial evaluations suggest. It’s unquestionably good enough that anyone would have been excited to launch it last year. But people’s opinions of its output are boosted by factors unrelated to its fundamental capability / “intelligence”. In particular, it has a more engaging, less corporate-bland style than o1, and people really appreciate getting to see the chain of thought. Also, its reasoning capabilities may be limited to more narrow domains than o1 (Dean: “I personally find r1 to be quite subpar for my use cases of o1”, referring to legal and policy questions). And anecdotes of r1 looking impressive may travel faster than anecdotes of it looking like a dud.
To some extent, DeepSeek may have plucked “low-hanging fruit” here. From the same ChinaTalk report:
From the perspectives of explorers and chasers, small companies with limited GPUs must prioritize efficiency, whereas large companies focus on achieving models as quickly as possible. Methods that improve efficiency on a 2,000-GPU cluster may not work effectively on a 10,000-GPU cluster, where stability becomes a higher priority.
In other words, DeepSeek may have found shortcuts that allow them to make something almost as good as o1 more quickly and cheaply, but which might not serve as well when trying to truly push the cutting edge.
DeepSeek’s vibes are being boosted by factors other than their actual technical achievement. People are excited by the release of the model weights, as well as the innovations detailed in their technical papers. Those papers look especially impressive because they shared the sort of meaty details that the big American labs keep hidden. And people may simply be rooting for the underdog.
The big labs were in less of a hurry. Dan also said “It’s safe to assume almost everyone has an o1-level model at least internally” (evidenced by Google’s “Gemini 2.0 Flash Thinking Experimental”, which is presumed to be distilled from a larger, more capable unreleased model), and “since [the] rate of improvement is so high and since compute costs are higher just waiting another month or two brings down costs and increases performance a lot”. Dean reports “the rumor is that both Google and Anthropic have things internally that are as good or better than o3”. Possible explanations for the big labs to hold back are that they want to release a more polished product, or just general bureaucracy. And Anthropic in particular may not have enough servers to meet the additional demand of a reasoning model; Claude keeps telling me that it is “temporarily defaulting to concise responses” because “we’re experiencing high demand”.
DeepSeek was able to draft on OpenAI’s work. Dan also indicated that the information OpenAI published in September, along with “some implementation details … leaked through SF conversation”, would have allowed DeepSeek to shortcut the long research process. Or as he later posted (emphasis added):
On top of a good base model, recent reasoning advances (e.g., o1) do not require that much compute to replicate. o1 can be replicated without tens of thousands of H100s.
If you know which research dead ends to avoid (e.g., MCTS) and roughly know what research direction to pursue, replication is much easier. These algorithmic insights likely slowly leaked through SF party conversations and other channels. It probably takes a few months once you have the idea for o1 to replicate it.
That's how DeepSeek did it without much compute.
To clarify, the proof-of-concept from OpenAI did most of the heavy lifting. This clarified that inference-time compute using RL+LLMs can be extraordinarily effective. This led to teams being formed that are dedicated to replication.
I am not claiming something outlandish like Chinese spies had to exactly copy their homework to replicate it themselves. Many details can be filled in through tinkering, which DeepSeek is good at.
ChinaTalk reports on a “a January 26 closed-door session hosted by Shixiang 拾象, a VC spun out from Sequoia China”:
AI is similar to a step function, where the compute requirements for followers have decreased by a factor of 10. Followers have historically had lower compute costs, but explorers still need to train many models. … While the vast amount of compute resources spent by explorers may not be visible, without such investment, the next "step" might not occur. [I am not clear on who was speaking here.]
Or as Sam Altman put it last month:
it is (relatively) easy to copy something that you know works.
it is extremely hard to do something new, risky, and difficult when you don't know if it will work.
Multiple observers have pointed out that this may be a particularly auspicious moment for fast-followers: because o1 was the first reasoning model, the scale of training was comparatively small, and thus easier to replicate. Nathan Lambert of the Allen Institute for AI notes that “it’s a new technique so the cost of being at the frontier is the lowest it has been in a long time so progress is fast”.
Maybe they cheated a bit? There are reports – unsubstantiated, but from a credible source – that DeepSeek has “about 50,000 NVIDIA H100s that they can’t talk about because of the US export controls that are in place”. That would be more than 25x the computing capacity they claimed to have used6. I’ve also heard a rumor that DeepSeek managed to get access to raw chain-of-thought transcripts from o1 (which they could have used to train their model), and there’s some circumstantial evidence pointing in the same direction7, as well as a vague report that Microsoft and OpenAI that they are “investigating whether data output from OpenAI’s technology was obtained in an unauthorized manner by a group linked to Chinese artificial intelligence startup DeepSeek8”. Finally, it’s theoretically possible that DeepSeek somehow got access to trade secrets, for instance leaked by a rogue employee. However, I have seen zero evidence of that last, and none of this is in any way substantiated. I’ll also note that DeepSeek has released extensive writeups explaining how they built v3 and r1. It’s hard to plausibly fake that level of detail, so it’s clear that they were not just copying OpenAI’s work.
Other Thoughts
Here is a grab bag of other thoughts raised by DeepSeek’s accomplishment.
Why is r1 so cheap? It’s available (for API use) at about 1/30th what OpenAI charges for o1. As noted earlier, DeepSeek has made important improvements to the public state of the art in AI algorithms, but the trends of the last few years would suggest that the OpenAI and the other big labs should have had similar advances by now (see footnote 2).
OpenAI may be providing a more gold-plated service, and they may be taking a larger profit margin. But historically they don’t seem to have been taking huge margins, and it’s hard to explain a 30x difference this way. Perhaps DeepSeek really has gotten ahead (for the moment?) on efficiency gains. In particular, they seem to have taken advantage of the fact that they own their own data centers to optimize the entire “stack”, carefully matching their algorithms to the specific chips they use, even bypassing some of Nvidia’s standard software for controlling GPUs. OpenAI, relying on Microsoft data centers, may not be able to optimize things as tightly (or may not have made this a priority).
Deedy Das of Menlo Ventures notes that Google’s Gemini 2.0 Flash Thinking model is 7x cheaper than r1, and better in at least some ways. And Google, unlike OpenAI, operates their own data centers and is used to optimizing the entire stack.
The big labs aren’t as eager to show off as I’d have guessed. Even if Google and Anthropic have reasons not to launch a flagship reasoning model, I might have expected them to show off unreleased work so as to maintain mindshare. I’d expect this even more for other labs like xAI, a couple of big Chinese labs, maybe Mistral, and others; perhaps none of those labs have a serious reasoning model yet?
The relative ease of “drafting” off the work of the frontier labs may call the business case for massive AI R&D budgets into question. I’ll say more about this in an upcoming post.
We haven’t seen the last of DeepSeek. There’s widespread agreement that r1 was produced in a hurry, and there’s plenty of room for DeepSeek to do better, no additional breakthroughs required.
We should hesitate to make pronouncements about the “race with China” until the dust settles. As I’ve been discussing, the story that DeepSeek caught up with OpenAI through sheer skill doesn’t really add up. DeepSeek has a sharp team, but what they’ve accomplished isn’t exactly the same thing as beating OpenAI at their own game. At the same time, it’s worth noting that the fact that DeepSeek has leapfrogged Meta undercuts arguments that the open Llama models by themselves are leaking the US advantage.
This doesn’t mean that export controls on AI chips are ineffective. US export controls are too new (and, initially, too leaky) to have had much impact here. The argument that “we did DeepSeek a favor by forcing them to be efficient” doesn’t withstand scrutiny. And training aside, as AI rolls out into the economy, everyone is scrambling for enough chips to meet the demand to use these models. (Samuel Hammond and Miles Brundage have each explained this nicely.)
If there was any doubt, we have fresh evidence that OpenAI is accelerating the overall pace of AI progress. Each new cutting-edge model inspires and informs their competitors, thus in turn increasing the commercial pressure on OpenAI to move quickly. Whether this is good or bad depends on your expectations regarding the impact of rapid progress in AI.
So What Have We Learned?
With v3 and now r1, DeepSeek has done something very impressive. However, they do not appear to have outdone the big American labs; instead, they have picked a different target – a quick dash to a reasoning model that is unpolished, but capable and open. For many purposes, this may be a better target, and there are many lessons to be drawn.
r1 is genuinely impressive, but (like so much work in AI!) less so than it seems at first glance, and is not a true match for OpenAI’s o1, let alone the upcoming o3. Even to achieve this, DeepSeek relied on lessons drawn from the leading labs; they have done top-notch work to reproduce existing work at a lower cost, but this is different from independently advancing the frontier of AI capabilities. Meanwhile, their larger competitors have been holding back due to some mix of competing priorities, wanting to create a more complete and polished product, and bureaucracy.
It appears to be less difficult to train a “reasoning” model than previously believed. However, the the years of research at OpenAI (and elsewhere) were critical to identify the general recipe. It may also be necessary to start with a very capable “conventional” LLM, of a sort which most labs have not had until fairly recently9. And you still need, in Dean’s words, a “truly cracked team”. As Cristóbal Valenzuela puts it, “progress now favors the quick and imaginative rather than the large and established.” This is certainly true if you want to make an attention-grabbing announcement; it remains to be seen whether it translates to business success.
The evidence has tilted toward it being difficult for even the largest and best-funded AI labs to maintain a large lead over their competition (and over open models), but this may partly be a temporary effect of the transition to reasoning models.
Finally, we shouldn’t be too hasty to draw conclusions from DeepSeek-r1. R1 is an important data point, but it’s just one data point, and a weird one. Conclusions like “China is catching up” are premature and oversimplified.
Thanks to Dan Hendrycks, Dean Ball, Helen Toner, Nathan Labenz, Nathan Lambert, Privahini Bradoo, and Siméon Campos.
I haven’t seen a comparison to Google’s Gemini 2 model, currently in pre-release. Google has provided access to an early reasoning-trained version of Gemini 2, awkwardly designated “Gemini 2.0 Flash Thinking Experimental”. Early takes are positive, but we haven’t seen the full model yet – “Flash” is Google’s term for a lightweight version of a model: cheaper, faster, and less capable.
Especially because DeepSeek could use existing AI models like Llama and GPT-4o to generate training data for v3. (I don’t know to what extent they actually did this.)
Dario Amodei recently provided some very helpful context for DeepSeek’s reported training costs:
DeepSeek does not "do for $6M what cost US AI companies billions". I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors). Sonnet's training was conducted 9-12 months ago, and DeepSeek's model was trained in November/December, while Sonnet remains notably ahead in many internal and external evals. Thus, I think a fair statement is "DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)".
He goes on to argue that DeepSeek’s reported cost for training v3 is right on the expected trend, giving ongoing efficiency improvements across the industry, culminating with:
All of this is to say that DeepSeek-V3 is not a unique breakthrough or something that fundamentally changes the economics of LLM’s; it’s an expected point on an ongoing cost reduction curve. What’s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese. This has never happened before and is geopolitically significant. However, US companies will soon follow suit — and they won’t do this by copying DeepSeek, but because they too are achieving the usual trend in cost reduction.
It’s been reported that Chinese regulation (including censorship) and testing measures can add a 1-2 month delay when releasing a model. This suggests that DeepSeek might have had to complete their primary work on r1 within 2-3 months of the o1 announcement.
See https://arcprize.org/blog/oai-o3-pub-breakthrough. o3 was evaluated on 500 tasks for the ARC-AGI-1 benchmark. In the “high efficiency” mode, they spent $17-20 per task. In the “low-efficiency” mode, the writeup implies that they spent 171 times more (1024 ÷ 6). That works out to almost exactly $1.5 million.
For instance, the CEO of Intuition Machines describes DeepSeek as having “better [infrastructure] engineering than the average lab, bulletproof pipelines and quick experiments, open and collaborative low blame culture. Multi-disciplinary staff hired for excellence rather than role and told to work on what they know best and love most, with low overhead teams led by very technical architects willing to make early bets.”
DeepSeek had reported using 2000 H800s, a less-capable variant of the H100. I’ll note that a data center containing 50,000 H100s would cost well over $1B. It’s possible that the claim they have this many GPUs is exaggerated, or it could be that DeepSeek was merely renting under-the-table access to a large data center somewhere.
The report of 50,000 H100s appears to originate with Dylan Patel of SemiAnalysis, who later clarified that the 50,000 GPUs probably includes other “H”-series chips such as the H20 and H800, and may all have been imported legally, before export bans were in place. (Note that these other chips are less capable than the H100.) The whole idea is controversial; Dan Hendrycks and many others argue it is more likely that DeepSeek’s claim to have only used 2000 GPUs for the final training run is true.
In related news, Epoch AI reports that starting in late 2021, the size of Chinese training runs started gradually falling behind the US. “Initially, Chinese AI developers rapidly scaled up language models, catching up to the top models globally by late 2021. However, Chinese models then started to fall behind, scaling at 3x/year, while the top models globally have scaled at 5x/year since 2018.”
One way DeepSeek could have obtained raw chain-of-thought outputs from o1 would be to use “jailbreaking” techniques to trick o1 into copying the chain of thought into its final output. OpenAI monitors for such attempts, but it’s conceivable that DeepSeek found ways of evading the monitors.
Microsoft Corp. and OpenAI are investigating whether data output from OpenAI’s technology was obtained in an unauthorized manner by a group linked to Chinese artificial intelligence startup DeepSeek, according to people familiar with the matter.
Microsoft’s security researchers in the fall observed individuals they believe may be linked to DeepSeek exfiltrating a large amount of data using the OpenAI application programming interface, or API, said the people, who asked not to be identified because the matter is confidential. Software developers can pay for a license to use the API to integrate OpenAI’s proprietary artificial intelligence models into their own applications.
Thanks to Nathan Labenz for this observation.
Outstanding summarization of this whole constellation of news and perspectives on it! I love the exploration of different possible explanations, factors, etc., with commentary, but appropriate uncertainty expressed/acknowledged.