Some Stray Thoughts After Leaving the Large Model Industry (Gemini 2.5 Pro Translated Version)

Interviewer: Please tell me the difference between DPO and PPO.

Me: PPO pertains to Yang, manifesting the Fire element, primarily harmonizing and constrained by the Reward Model of the Metal element. Metal generates Water, Water overcomes Fire, and GAE indirectly regulates the generation direction. DPO pertains to Yin, manifesting the Wood element, taking data as its vital essence. Wood can generate Fire, directly pointing to the origin, emphasizing the reduction of Metal element stagnation, and directly entering the heart-mind (spirit).

Interviewer: …Sir, is there something seriously wrong with your head?

Me: Yes, otherwise why would I be here working on large models?

The above is an account of my last interview experience in early 2025. Excluding a tiny bit of artistic license, that’s roughly how it went. Of course, I have now left this industry to become “the next unit of length-chan.” If you ask me why, it’s basically a sense of despair and powerlessness originating from the abyss. And some people even complained that my previous articles were corrupting the youth. To put it more vividly, since mid-2023, I’ve been getting slapped on each cheek by various frontrunners. By the latter half of 2024, I even started to suspect I was some cuckold in an NTR (Netorare) world, watching the methods and conclusions I painstakingly explained over and over again being implemented and validated by one “Blondie” (common NTR antagonist archetype) after another. To put it more euphemistically, other companies are like they’ve swallowed a steelyard weight, determined to produce something big; I’m like I’ve had a steelyard weight forcibly stuffed down there, preventing anything big from coming out.

Alright, now you should be able to vividly and euphemistically experience why I decided to leave. Even though I initially wanted to struggle a bit and interviewed with a few places, I later realized that most people’s understanding of large models is less self-consistent than a theory of Yin-Yang and BaGua. So, I simply gave up. I even have a transcendent feeling now, akin to a sage-like state, and can’t even muster any interest in cursing those middle managers. So, I’ll just casually organize the chaotic thoughts in my mind.

First, I have to talk about DeepSeek. The name itself sounds like a phonetic translation of an Anglo-Saxon name. This company is just terrible. Regardless of whether they’ve actually reduced the training and inference costs of models, they have greatly increased our societal costs. As everyone knows, the biggest wish of middle managers nowadays is generally to recruit people who can “carry them to success.” DeepSeek’s emergence has significantly raised the bar for who they consider capable of “carrying them.” They now look down on Ph.D.s from C9 League universities – preferably, candidates should be from the “top two” (Peking University or Tsinghua University), and they must be young, with clear eyes, capable of pulling all-nighters. This has led to even more brutal homogenized competition in academic credentials and metrics among potential participants in this field – I mean, everyone present here is a victim. But in reality, algorithms are nothing more than the Five Elements and Eight Trigrams. The best thing DeepSeek did (their true moat) is a kind of spontaneously formed systems engineering-like behavior within their organization. And why does this behavior exist? I speculate – or rather, I assert – that this is the instantiated expression within the organization of the formless concepts in “Little Liang’s” (likely referring to James Liang, a prominent figure) mind. Correspondingly, the scenes created by middle managers with empty minds can naturally only be a complete mess, like “feathers all over the place, horses trampling young crops.”

Similar to the previous bad point, DeepSeek’s other harm is also passive, and everyone present here is also a victim, because I’m sure you’ve all witnessed the current grand spectacle of AI-generated garbage articles flooding your platforms. Adding to what I was afraid to say earlier because I feared being attacked, R1’s (likely a model version) chaos-creating, world-founding-like hallucination capabilities have led to the current Chinese internet corpus being as full of nonsense and falsehoods as a real estate developer’s financial report. This situation will likely result in DeepSeek being the last domestic large model company to gain traction without relying on puff pieces. And all products relying on fresh content (like RAG) or scenarios relying on case studies (like medical or legal) will eventually traverse into the uncanny valley universe that DeepSeek inhabits, rationally outputting vast amounts of content unrelated to the real world. (Conversely, models like Wen〇, Hun〇, etc., haven’t deeply polluted the Chinese internet because their output, on the uncanny valley curve, at best falls into the “X-brand rubber X-brand dolls” segment, still relatively easy to distinguish).

Of course, R1 is destined to be the savior of researchers using standard datasets; give them a baseline, and they can generate filler papers with R1 until 2077. There are so many directions to explore here. For example, R1’s chain-of-thought can be controlled to become longer, shorter, deeper, shallower, and the first and second derivatives of these elements can be made larger or smaller. Or take the R1 training path: each component can spawn numerous versions like “Female, Azure, Sakura, Gold, Silver, Low Rank, High Rank, Master Rank, Tempered” (Monster Hunter game references), and changing the “skin” allows one to write a whole new set of things. Then there are R1’s various societal and philosophical problems: honesty, equality, inclusivity, universal love. Perhaps in the future, some fool will even try something like “Federated R1 Learning” (I’d bet one Nvidia share on it). Plus, with countless training sets and evaluation sets yet to be released, an uncountable number of papers will be constructed like a Cantor set, yet provide nothing new (measure zero).

So, R1 is good precisely because it is something without inductive bias; it is a near-perfect practitioner of “the bitter lesson.” But conversely, DeepSeek’s greatest future danger is falling into inductive bias, because R1’s success masked the fact that v3 is a highly customized structure. In fact, it’s very hard to say whether DeepSeek v3’s design choice is the optimal solution among the eight or nine foundational models from the seven companies of the “Six Little Dragons” (a term for prominent Chinese AI startups, with the numbers being slightly playful/imprecise). It’s just that their fierce architecture colleagues, with the same resources, gave their algorithms twice the opportunity for trial and error. While I don’t wish to see them decline in the future, if they do fall into a bad situation, the reason will inevitably be a lapse into inductive bias. I sincerely wish them to avoid this problem, promote a new AGI development model, and bring us academic beggars more “water resources” (good material for filler papers) like R1.

Returning to the core topic of AGI, ever since I entered this field, I have consistently maintained that the current structures, even with test-time scaling, are still not the correct path to AGI. Even taking ten thousand steps back, if a quadratic attention structure truly is a viable solution for AGI, then its initial conditions are highly unlikely to be among the random seeds we can currently express with integer numbers. From my perspective, the success of test-time scaling is a rather crude reproduction of the recurrent mechanism in how a nervous system thinks; it works as long as the logic generated by the model through induction is sufficient to support its self-consistency in sufficiently long texts. This is, of course, extremely difficult, because one of the characteristics of natural language corpora is their lack of self-consistency. So, this direction will most likely solve most coding problems in the future, rather than AGI.

Here, if you’ve read the article carefully enough, you will have fully noticed the word “self-consistency.” From the understanding I’ve gradually formed over the past two years, logical correctness is not necessarily a prerequisite for AGI; self-consistency is enough. This has led me to a “folk science” style answer to the AGI path: Firstly, an architecture capable of achieving AGI needs a meta-system (or hyper-system) for “logic compilation.” That is, there must first be a functional module to provide an instruction set for subsequent operations (this set can conflict with generally accepted correct logic but needs to be self-consistent). Only based on this instruction set will the model’s thinking module perform test-time computation. Although I have no proof for this judgment, I feel that when humans create new things, they don’t seem to use existing formal logic. Mathematical inferences, for example, often come as flashes of insight before being transformed into formal logical thinking and recorded in language. Secondly, a system capable of logic compilation should be a “multiple draft model” (please DeepSeek for yourself what this model is). Only then can there be sufficient possibility to generate self-consistent logic, ultimately reflecting as a framework that can be used for thinking.

Of course, I fully believe that the above paragraphs are largely unlikely to be understood by normal humans, because, to the best of my knowledge, this viewpoint has not been proposed by predecessors. However, I do have some preliminary experiments underway, and I will strive to publish them in the top three Chinese journals in the future for everyone’s amusement.

Let me share another viewpoint that is also likely incomprehensible to normal humans: I support the idea that the development of human society should evolve towards AGI hive-mind-ism. This is because it is perhaps not the only, but possibly the optimal, solution for humanity to break through the next stage of the Great Filter. Of course, as an Anarcho-Communist, I also firmly believe that the current development of large language models can allow us to truly realize OGAS (a Soviet-era economic planning network project) a century after its conception. Although I will be dead by then, just thinking about it makes me very happy.

Finally, please allow me to make a few more sarcastic remarks about the general environment in China. In this state of affairs where there’s an oversupply of graduates yet middle managers feel there’s no one usable, “connections” have conversely become the sole evolutionary path for the “fittest” to survive. Therefore, I hope everyone present, to avoid being driven out by “bad money” (mediocrity/inferiority), will fully express yourselves and form active interpersonal networks with your reliable seniors, peers, or juniors. After all, only by surviving can you produce output; toiling away in silence only deserves to “carry a 325” (receive a low performance rating).

Enjoy Reading This Article?