GPT-4 model architecture leaked: Contains 1.8 trillion parameters
GPT-4 model architecture leaked: Contains 1.8 trillion parameters, uses mixed expert model (MoE)
GPT-4 model architecture leaked: Contains 1.8 trillion parameters, uses mixed expert model (MoE)
Industry insiders recently revealed the GPT-4 large model released by OpenAI in March this year, including the GPT-4 model architecture, training and inference infrastructure, parameter volume, training data set, token number, cost, and mixed expert model (Mixture of Experts, MoE) etc. very specific parameters and information.
One of the authors of the article is Dylan Patel, who previously broke the news about Google’s internal document “We don’t have a moat, nor does OpenAI.”
The following introduces the main content of this article that demystifies the technical details of GPT-4.
At the beginning of the article, it is pointed out that the reason why OpenAI is not Open is not to protect human beings from being destroyed by AI, but because the large models they build are reproducible. In the future, major Internet companies in China and the United States (such as Google, Meta, Tencent, Baidu , ByteDance), as well as AI head start-ups, will have the ability to build a large model that can be comparable to GPT-4 or even surpass GPT-4.
The most durable moat of OpenAI is that they have feedback from real users, the top engineering talents in the industry, and the leading position brought by the first-mover advantage.
According to reports, GPT-4 contains a total of 1.8 trillion parameters in 120 layers, while GPT-3 has only about 175 billion parameters. In order to keep the cost reasonable, OpenAI uses the MoE model to build.
Specifically, GPT-4 uses 16 mixtures of experts, each with 111 billion parameters, routed through two experts per forward pass.
In addition, it has 55 billion shared attention parameters, trained using a dataset containing 13 trillion tokens, tokens are not unique, counting as more tokens according to the number of iterations.
The context length of the GPT-4 pre-training stage is 8k, and the 32k version is the result of fine-tuning on 8k. If the training is done in the cloud, at $1 per A100 hour, the cost of one training session is as high as $63 million. Today, however, training costs can be reduced to $21.5 million.
