When Your Model Sucks as Much as Your Life (Gemini 2.5 Pro Translated Version)

I have a friend.

He told me, very distressed, that his ‘goddess’ (crush) always ignored him, her reason being she was busy tuning parameters and training a model, and the model’s performance was currently terrible. He said this situation felt, perhaps, even more vexing than his goddess’s model. Because her excuse was like a ‘red-header document’ (an official, unchallengeable directive), instilling a sense of awe and fear. Over time, he no longer dared to look directly at her profile picture, afraid his words might affect her hyperparameters, thereby affecting her experimental results, her paper, her graduation, her million-dollar offer, her big house in California.

After listening, I calmly sat on the edge of the bed, casually pulled out ‘Being and Time’ from the box where I keep my sleeping pills, flipped to the bookmarked page, and stared at each symbol on the paper with the concentration of someone doing cross-stitch. After a long while, amidst my friend’s uneasy breathing, I slowly said: “When you say a model sucks, you’d better first figure out which model you’re talking about. Although this system isn’t something I created independently, you might as well let me teach it to you again. Of course, you’re free to question this system; it’s just a heuristic methodology. If possible, I hope you can come up with something better yourself. In short, as a researcher whose mathematical abilities are ‘so good’ they can only do black-box deep learning, the ‘model’ you speak of can have five meanings. Let’s use NLP tasks as examples here:

\(\mathcal{M}_D\) (Ding an sich / Thing-in-itself): This model borrows Kant’s concept of the ‘thing-in-itself,’ an objective, fundamental existence independent of all our observations and understanding. Using an NLP example, we can imagine that behind the world of NLP, there’s a perfect model supporting the operational logic of all NLP tasks. Although we don’t know what this model is, we can perceive certain phenomena in the NLP world, like predicting the next token after an input sequence. However, these perceived phenomena ultimately cannot tell us what this perfect model truly ‘is.’ This unknown ‘something,’ which nevertheless affects our cognition, is the so-called thing-in-itself model.

\(\mathcal{M}_P\) (Perception): Although we cannot directly apply a model that is an unknown ‘something,’ we can, based on our senses, form cognitions, impressions, or concepts about its underlying operational logic. In other words, this second-layer model is an experience-based approximation of that unknown model from the first layer. Continuing with the NLP example, this approximate experience could be called ‘Attention is All You Need.’ That is, the attention mechanism is our modeling of the underlying operational logic of NLP—that perfect, yet unknown, ‘something.’ Obviously, this cognitive model is formed based on human senses and experience, and is thus naturally a degradation of the thing-in-itself model.

\(\mathcal{M}_I\) (Instantiation): After establishing a cognitive model, our next natural step is to instantiate it. In other words, we need to transform this generalized concept into something that can actually be operated or computed. Returning to the NLP example, after we believe ‘Attention is All You Need,’ we then start writing code for Transformers. We implement the attention mechanism using multi-head normalized softmax QKV, and we stack many fully connected layers to increase the model’s capacity. Clearly, we don’t always know why we do things a certain way. Part of it is experience gained through induction, but more often it’s heuristic guesswork. Therefore, an instantiated model is a degradation compared to our cognitive model.

\(\mathcal{M}_R\) (Reachable): Once we have an instantiated model framework, its maximum potential performance is actually already determined, because at best, it can only approximate our cognitive model. Looking back, we actually cannot reach this upper performance limit because we can only initialize parameters, not ‘enlighten’ them. Thus, we always need a model training process. That is to say, after we’ve ‘pulled’ (git cloned) the Transformer code, we need to start working on data, tuning hyperparameters, etc. (obviously, a model’s performance directly depends on this). When we’ve done everything we can, we obtain a ‘reachable model.’ Clearly, this model is also a degradation compared to the instantiated model.

\(\mathcal{M}_O\) (Observation): Finally, after you’ve trained your model, you still need a method to test the actual performance of this model you’ve obtained. Typically, you’ll use a series of widely recognized tasks, carrying an ‘axiomatic’ connotation, to evaluate this model. But essentially, through performance evaluations on multiple tasks, we are merely using a relatively feasible inductive reasoning to infer whether the model’s performance is good or bad. But in fact, even if we exhaust all tasks available, we cannot fully reveal all the capabilities of the model you’ve obtained. Perhaps the random seed you used during model initialization allows this model to describe your sucky life with outstanding results, but you wouldn’t know. You’re only evaluating an observation-based model, which is clearly also a degradation from the reachable model.

So, when you find your model’s performance sucks, you should have a very clear reverse reasoning process in mind. Our ultimate goal is to obtain an ideal approximation of the thing-in-itself model, and at every step in this process, there’s a degradation in model performance.

Is my model’s poor performance solely due to using incorrect observation methods? Should I design new experiments to test the capabilities the model has already acquired internally? Or, should I rely on current observation methods as the sole basis for evaluation? For example, back then, among the various large models in China, which one didn’t thrash GPT-3 on SuperGLUE? But ChatGPT isn’t something that needs to be evaluated using these datasets. Furthermore, if there isn’t a reasonable method available on the market, can I myself propose a more reasonable one and then assess my model’s performance?

If it’s confirmed that the observation method itself is not the problem, the next step is to consider if there’s an issue with ‘reachability’ – that is, is the data sufficient, has data noise been removed, is the optimizer correct, are the optimizer parameters right, is the training duration long enough, and so on and so forth. These are generally part of a project jokingly referred to as ‘alchemy’ (炼丹). However, in reality, most people’s efforts stop at this stage. The biggest taboo is to directly modify the instantiation or even the cognition when the model hasn’t been ‘fully reached’ (i.e., properly trained). I’ve seen too many students start blindly changing network architectures, randomly adding loss functions, or even haphazardly inventing concepts without fully understanding the issues with their model’s reachability. I advise everyone here not to hope to get lucky by chance, because even if you succeed in improving your model’s observed performance through these means, it will easily be dismissed as a ‘filler paper’ (水文 - low-quality publication).

Next, we need to address the issue of instantiation. Instantiation is difficult because we usually need to instantiate more than one concept. For example, in Transformers, the implementation of attention needs to balance ‘the attention mechanism itself,’ ‘regularization for ease of training,’ ‘multi-head attention to increase diversity,’ and many other concepts. Behind this, there are also compromises related to memory and computational complexity for long sequences. So, the final result everyone sees is a very cumbersome engineering framework. Therefore, when examining an instantiated model, we should first understand which concepts we’ve integrated, which are more important, or which can be replaced by a two-layer MLP, and understand why doing so is feasible. If it’s just through black-box parameter stacking, the model degradation caused at the instantiation step will be quite severe.

Next, you need to suspect whether you were wrong from the very beginning – that the operational logic in the realm of the thing-in-itself is not actually like the impressions/concepts/cognitions in your mind. For example, do we really need positional embedding? Is that unknown ‘something’ really just positional relationships, or is it the decay of attention itself (refer to Alibi)? Generally speaking, understanding at this level has already moved beyond the realm of ‘filler papers,’ because nothing is more important than a clearer and more accurate cognition of that ‘something.’ But to enhance cognition at this level, one must return to the first-layer observation model, because all cognition requires our observation to obtain. This is also why it’s important to persist in hands-on work; detached from these observations, cognition will gradually become distorted by the cascading degradation from layer after layer of models.

Having said this, it seems I’ve probably completed another sermon on connectionism. And my friend, after listening, had a face beaming with happiness and said, “Thank you, I’ll go find my goddess right now and tell her how to buy a big house in California.”

I closed ‘Being and Time’ in my hands, looked up at him, and said: “I was just teaching you that your cognition and instantiation regarding your girlfriend are problematic.”

My friend’s train of thought was suddenly pulled back from the big house in California. With a look of utter bewilderment, he reacted: “Me? Say what now?!”

I nodded seriously: “I’m talking about you,”

Enjoy Reading This Article?