Supply Chain Market Research - SCMR LLC
  • Blog
  • Home
  • About us
  • Contact

DeepSeek

1/27/2025

0 Comments

 

DeepSeek
​

The definition of panic is: “Sudden uncontrollable fear or anxiety, often causing wildly unthinking behavior.”, but that does little to shine light on what is causing the panic or the circumstances leading up to said panic.  Today’s ‘panic’ was caused by a Hangzhou, China Ai research lab, less than 2 years old, that was spun off of a high-profile quant hedge fund.  Their most recent model, DeepSeek (pvt) V3 has been able to outperform many of the most popular models and is open source, giving ‘pay for’ models a new competitor that can be used to develop AI applications without paying a monthly or yearly fee.  By itself, this should be added to the list of worries that AI model developers already consider, but there are a number of existing AI models that are open source and they have not put OpenAI (pvt), Google (GOOG), Anthropic (pvt), or Meta (FB) out of business.  It is inevitable that as soon as new models are released, another one comes along that performs a bit better.  But that is not why panic has set in today.
We believe that valuation for Ai companies is much simpler than one might think, as any valuation, no matter how high, is valid only as long as someone else is willing to find a reason to justify a higher valuation.  Models that help with valuation in the Ai space tend to extrapolate sales and profitability based on parameters that don’t really exist yet or are so speculative as to mean little.  There are some parameters that are calculable, such as the cost of power, or the cost of GPU hardware today, but trying to estimate revenue based on the number of paying users and the contracted price for AI CPU time 5 or 10 years out is like trying to herd cats.  It’s not going to go the way you think it is.
One variable in such long-term valuation models is the cost of computing time and the time it takes to train the increasingly large models that are currently being developed.  In May of 2017 the AlphaGo Zero model, the leading model at the time, cost $600,000 to train.  That model, for reference, had ~20m parameters and two ‘heads’ (Think of a tree with two main branches), one which predicted the probability of playing each possible move, and the other estimating the likelihood of winning the game from a given position.  While this is a simple model compared to those available today, it was able to beat the world’s champion Go player based on reinforcement learning (the ‘Good Dog’ training approach) without any human instruction in its training data.  The model initially made random moves and examined the result of each move, improving its ability each time, without any pre-training.
In 2022 GPT 4, a pre-trained transformer model with ~1.75 trillion[1] parameters, cost $40m in training costs, and a 2024 training cost study estimated that the training cost for such models has been growing at 2.4x per year since 2016 (” If the trend of growing development costs continues, the largest training runs will cost more than a billion dollars by 2027, meaning that only the most well-funded organizations will be able to finance frontier AI models.[2]”).  There are two aspects to those costs, first, the hardware acquisition cost, with ~44% for computing chips, primarily GPUs (Graphics processing units), which are used to process data rather than graphics.  Additionally, ~29% is for server hardware, 17% for interconnects, and ~10% for power systems. The second is the amortized cost over the life of the hardware, which includes between 47% and 65% for R&D staff, but runs between 0.5x and 1x of the acquisition costs.
All in, as models get larger, training gets more expensive, and with many Ai companies still experimenting with fee structure, model training costs are a critical part of the profitability equation, and based on the above, will keep climbing, making profitability more difficult to predict.  That doesn’t seem to have stopped Ai funding or valuation increases, but that is where DeepSeek V3 creates a unique situation.
The DeepSeek model is still a transformer model, similar to most of the current large models, but it was developed with the idea of reducing the massive amount of training time required for a model of its size (671 billion parameters), without compromising results.  Here’s how it works:
  • Training data is tokenized.  For example a simple sentence might be broken down into individual words, punctuation, spaces, etc., or letter groups, such as ‘sh’, ‘er’, or ‘ing’, depending on the algorithm.  But the finer the token, the more data processed, so tradeoffs are made between detail and cost.
  • The tokens are passed to a gate network which decides which of the expert networks will be best to process that particular token.  The gating network, acting as a route director, choosing the expert(s) that have done a good job with similar tokens previously.  While one might think of the ‘expert networks’ as doctors, lawyers, or engineers with specialized skills, each of the 257 experts in the DeepSeek model can change their specialty.  This is called dynamic specialization, and while the experts are not initially trained for specific tasks, the gate networks notices that, for example, Expert 17 seems to be the best at handling tokens that represent ‘ing’, and assigns ‘ing’ tokens to that expert more often.
Here's is where DeepSeek differs…
  • The data that the experts pass to the next level is extremely complex, multi-dimensional information about the token, how it fits into the sequence, and many other factors.  While the numbers vary considerable for each token, The data being passed between an expert network and an its ‘Attention Heads” can be as high as 65,000 data points (Note: This is a very rough estimate).
  • The Expert networks each have 128 ‘Attention heads’, each of which looks for a particular relationship within the that mass of muti-dimensional data that the Expert Networks pass to them.   They could be structural (grammatical), semantic, or other dependencies, but DeepSeek has found a way to compress that data being transferred from the experts to the attention heads, which reduces the amount of computational demand from the Attention Heads.  With 257 expert networks, each with 128 Attention heads and the large amount of data contained in each transfer, the compute time is the big cost driver for training.
  • DeepSeek has found a process (actually two processes) that can compress the data that each expert network is passing to it Attention Heads by compressing the multi-dimensional data.  Typically compression would hinder the Attention Heads’ ability to capture the subtle nuances that is contained in the data, but DeepSeek seems to have been able to use compression techniques that do not affect the sensitivity of the Attention Heads for those subtleties.  


[1] estimated

[2] Cottier, Ben, et al. “The Rising Costs of Training Frontier AI Models.” arXiV, arxiv.org/. Accessed 31 May 2024.
 
Picture
Looking back at the cost of training large models that we mentioned above, one would think that a model the size of DeepSeek (671 billion parameters and 14.8 trillion training tokens) would take a massive amount of GPU time and cost $20m to $30m, yet the cost to train DeepSeek was just a bit over $5.5m, based on 2.789 million hours of H800 time at $2.00 per hour, closer to the cost of much smaller models and outside of the expected range.  This means that someone has found a way to reduce the cost of training a large model, potentially making it easier for model developers to produce competitive models.  To make matters worse, in the case of DeepSeek, it is open source, which allows anyone to use the model for application development.  This undercuts the concept of fee-based models who expect to charge more for every increasingly large model, and justify those fees on the increasing cost of training. Of course the fact that such an advanced model is free makes the long-term fee structure models that encourage high valuations less valid.
We note that the DeepSeek model benchmarks shown below are impressive, but some of that improvement might come from the fact that the DeepSeek V3 training data was more oriented toward mathematics and code.  Also, we always remind investors that it is easy to cherry-pick benchmarks that present the best aspects of the model.  That said, not every developer requires the most sophisticated general model for their project, so even if DeepSeek did cherry-pick benchmarks (We are not saying they did), a free model of this size and quality is a gift to developers and the lower training costs are a gift to those that have to pay for processing time or hardware.  Its not the end of the AI era, but it might affect valuations and long-term expectations if DeepSeek’s compression methodology proves to be as successful in the wild as the benchmarks make it seem it might be, but the fact that this step forward in AI came from a Chinese company will likely cause ulcers and migraines across the US political spectrum and could cause even more stringent clampdowns on the  importation of GPUs and HBM to China, despite the fact that they don’t seem to be having much of an effect.
Picture
Figure 1 - DeepSeek V3 Benchmarks - Source: DeepSeek
0 Comments

    Author

    We publish daily notes to clients.  We archive selected notes here, please contact us at: ​[email protected] for detail or subscription information.

    Archives

    May 2025
    April 2025
    March 2025
    February 2025
    January 2025
    January 2024
    November 2023
    October 2023
    September 2023
    August 2023
    June 2023
    May 2023
    February 2023
    January 2023
    December 2022
    November 2022
    October 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    March 2022
    February 2022
    January 2022
    December 2021
    November 2021
    October 2021
    September 2021
    August 2021
    July 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    January 2021
    December 2020
    October 2020
    July 2020
    May 2020
    November 2019
    April 2019
    January 2019
    January 2018
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    November 2016
    October 2016
    September 2016

    Categories

    All
    5G
    8K
    Aapl
    AI
    AMZN
    AR
    ASML
    Audio
    AUO
    Autonomous Engineering
    Bixby
    Boe
    China Consumer Electronics
    China - Consumer Electronics
    Chinastar
    Chromebooks
    Components
    Connected Home
    Consumer Electronics General
    Consumer Electronics - General
    Corning
    COVID
    Crypto
    Deepfake
    Deepseek
    Display Panels
    DLB
    E-Ink
    E Paper
    E-paper
    Facebook
    Facial Recognition
    Foldables
    Foxconn
    Free Space Optical Communication
    Global Foundries
    GOOG
    Hacking
    Hannstar
    Headphones
    Hisense
    HKC
    Huawei
    Idemitsu Kosan
    Igzo
    Ink Jet Printing
    Innolux
    Japan Display
    JOLED
    LEDs
    Lg Display
    Lg Electronics
    LG Innotek
    LIDAR
    Matter
    Mediatek
    Meta
    Metaverse
    Micro LED
    Micro-LED
    Micro-OLED
    Mini LED
    Misc.
    MmWave
    Monitors
    Nanosys
    NFT
    Notebooks
    Oled
    OpenAI
    QCOM
    QD/OLED
    Quantum Dots
    RFID
    Robotics
    Royole
    Samsung
    Samsung Display
    Samsung Electronics
    Sanan
    Semiconductors
    Sensors
    Sharp
    Shipping
    Smartphones
    Smart Stuff
    SNE
    Software
    Tariffs
    TCL
    Thaad
    Tianma
    TikTok
    TSM
    TV
    Universal Display
    Visionox
    VR
    Wearables
    Xiaomi

    RSS Feed

Site powered by Weebly. Managed by Bluehost