DeepSeek
We believe that valuation for Ai companies is much simpler than one might think, as any valuation, no matter how high, is valid only as long as someone else is willing to find a reason to justify a higher valuation. Models that help with valuation in the Ai space tend to extrapolate sales and profitability based on parameters that don’t really exist yet or are so speculative as to mean little. There are some parameters that are calculable, such as the cost of power, or the cost of GPU hardware today, but trying to estimate revenue based on the number of paying users and the contracted price for AI CPU time 5 or 10 years out is like trying to herd cats. It’s not going to go the way you think it is.
One variable in such long-term valuation models is the cost of computing time and the time it takes to train the increasingly large models that are currently being developed. In May of 2017 the AlphaGo Zero model, the leading model at the time, cost $600,000 to train. That model, for reference, had ~20m parameters and two ‘heads’ (Think of a tree with two main branches), one which predicted the probability of playing each possible move, and the other estimating the likelihood of winning the game from a given position. While this is a simple model compared to those available today, it was able to beat the world’s champion Go player based on reinforcement learning (the ‘Good Dog’ training approach) without any human instruction in its training data. The model initially made random moves and examined the result of each move, improving its ability each time, without any pre-training.
In 2022 GPT 4, a pre-trained transformer model with ~1.75 trillion[1] parameters, cost $40m in training costs, and a 2024 training cost study estimated that the training cost for such models has been growing at 2.4x per year since 2016 (” If the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027, meaning that only the most well-funded organizations will be able to finance frontier AI models.[2]”). There are two aspects to those costs, first, the hardware acquisition cost, with ~44% for computing chips, primarily GPUs (Graphics processing units), which are used to process data rather than graphics. Additionally, ~29% is for server hardware, 17% for interconnects, and ~10% for power systems. The second is the amortized cost over the life of the hardware, which includes between 47% and 65% for R&D staff, but runs between 0.5x and 1x of the acquisition costs.
All in, as models get larger, training gets more expensive, and with many Ai companies still experimenting with fee structure, model training costs are a critical part of the profitability equation, and based on the above, will keep climbing, making profitability more difficult to predict. That doesn’t seem to have stopped Ai funding or valuation increases, but that is where DeepSeek V3 creates a unique situation.
The DeepSeek model is still a transformer model, similar to most of the current large models, but it was developed with the idea of reducing the massive amount of training time required for a model of its size (671 billion parameters), without compromising results. Here’s how it works:
- Training data is tokenized. For example a simple sentence might be broken down into individual words, punctuation, spaces, etc., or letter groups, such as ‘sh’, ‘er’, or ‘ing’, depending on the algorithm. But the finer the token, the more data processed, so tradeoffs are made between detail and cost.
- The tokens are passed to a gate network which decides which of the expert networks will be best to process that particular token. The gating network, acting as a route director, choosing the expert(s) that have done a good job with similar tokens previously. While one might think of the ‘expert networks’ as doctors, lawyers, or engineers with specialized skills, each of the 257 experts in the DeepSeek model can change their specialty. This is called dynamic specialization, and while the experts are not initially trained for specific tasks, the gate networks notices that, for example, Expert 17 seems to be the best at handling tokens that represent ‘ing’, and assigns ‘ing’ tokens to that expert more often.
- The data that the experts pass to the next level is extremely complex, multi-dimensional information about the token, how it fits into the sequence, and many other factors. While the numbers vary considerable for each token, The data being passed between an expert network and an its ‘Attention Heads” can be as high as 65,000 data points (Note: This is a very rough estimate).
- The Expert networks each have 128 ‘Attention heads’, each of which looks for a particular relationship within the that mass of muti-dimensional data that the Expert Networks pass to them. They could be structural (grammatical), semantic, or other dependencies, but DeepSeek has found a way to compress that data being transferred from the experts to the attention heads, which reduces the amount of computational demand from the Attention Heads. With 257 expert networks, each with 128 Attention heads and the large amount of data contained in each transfer, the compute time is the big cost driver for training.
- DeepSeek has found a process (actually two processes) that can compress the data that each expert network is passing to it Attention Heads by compressing the multi-dimensional data. Typically compression would hinder the Attention Heads’ ability to capture the subtle nuances that is contained in the data, but DeepSeek seems to have been able to use compression techniques that do not affect the sensitivity of the Attention Heads for those subtleties.
[1] estimated
[2] Cottier, Ben, et al. “The Rising Costs of Training Frontier AI Models.” arXiV, arxiv.org/. Accessed 31 May 2024.
We note that the DeepSeek model benchmarks shown below are impressive, but some of that improvement might come from the fact that the DeepSeek V3 training data was more oriented toward mathematics and code. Also, we always remind investors that it is easy to cherry-pick benchmarks that present the best aspects of the model. That said, not every developer requires the most sophisticated general model for their project, so even if DeepSeek did cherry-pick benchmarks (We are not saying they did), a free model of this size and quality is a gift to developers and the lower training costs are a gift to those that have to pay for processing time or hardware. Its not the end of the AI era, but it might affect valuations and long-term expectations if DeepSeek’s compression methodology proves to be as successful in the wild as the benchmarks make it seem it might be, but the fact that this step forward in AI came from a Chinese company will likely cause ulcers and migraines across the US political spectrum and could cause even more stringent clampdowns on the importation of GPUs and HBM to China, despite the fact that they don’t seem to be having much of an effect.