When Your Model is as Frustrating as Your Life (DeepSeek Translated Version)

I have a friend.

He once confided in me, deeply troubled, saying that his goddess kept ignoring him because she was busy tuning parameters and training models—models whose performance was currently quite frustrating. He said this situation felt even worse than his goddess’s model because the reason she gave was like an official red-headed document, instilling a sense of dread. Over time, he had grown afraid to even look at her profile picture, worried that his messages might disrupt her hyperparameters, thereby affecting her experimental results, her paper, her graduation, her million-dollar job offer, and her California mansion.

After hearing this, I calmly sat on the edge of my bed, pulled out Being and Time from the box where I keep my sleeping pills, flipped to the bookmarked page, and stared at each symbol on the page with the focus of someone doing cross-stitch. After a long silence, amid my friend’s uneasy breathing, I slowly said:

When you say a model is frustrating, you’d better first clarify which model you’re talking about. Although this system isn’t something I created independently, let me explain it to you again. Of course, you’re free to question this system—it’s just a heuristic methodology. Ideally, I’d hope you could derive something better on your own. In short, as a researcher whose mathematical prowess is limited to black-box deep learning, the “model” you’re referring to could have five possible meanings. Let’s use NLP tasks as an example:

\(\mathcal{M}_D\) (Ding an sich): Borrowing Kant’s concept of the noumenon, this model is an objective, fundamental existence independent of all our observations and understanding. In NLP terms, we can imagine that behind the world of NLP lies a perfect model governing the logic of all NLP tasks. We don’t know what this model is, but we can observe phenomena in the NLP world—for example, we can predict the next token in a sequence. However, these observations still don’t tell us what this perfect model actually is. This unknown “something” that influences our perception is what we call the noumenon model.

\(\mathcal{M}_P\) (Perception): Although we can’t directly use a model we don’t understand, we can form cognitive impressions or conceptualizations of its underlying logic based on our senses. In other words, this second-layer model is an empirical approximation of the first-layer unknown model. Returning to NLP, this approximation might be called “Attention is all you need”—meaning the attention mechanism is our way of modeling the perfect but unknown “something” behind NLP. Obviously, this cognitive model is shaped by human perception and experience, making it a degraded version of the noumenon model.

\(\mathcal{M}_I\) (Instantiation): Once we’ve established a cognitive model, the next step is to instantiate it—that is, to translate this generalized concept into something operable or computable. Back to NLP: after deciding “Attention is all you need,” we start writing the code for a transformer. We implement attention with multi-head normalized softmax QKV, stack fully connected layers to increase model capacity, and so on. Clearly, we don’t always know why we do these things—some are based on empirical induction, but most are heuristic guesses. Thus, an instantiated model is a further degradation of our cognitive model.

\(\mathcal{M}_R\) (Reachable): Once we have an instantiated model framework, its performance ceiling is essentially fixed, because the best it can do is approximate our cognitive model. However, we can’t actually reach this ceiling, because we can only initialize parameters, not enlighten them. So we always need a training process. In other words, after git clone-ing a transformer’s code, we start gathering data, tuning hyperparameters, etc. (Obviously, a model’s performance hinges heavily on this.) After doing everything we can, we obtain a reachable model—which, again, is a degraded version of the instantiated model.

\(\mathcal{M}_O\) (Observation): Finally, even after training your model, you still need a way to evaluate its performance. Typically, you’d use a series of widely accepted, “axiomatic” tasks to assess it. But fundamentally, this is just high-probability inductive reasoning—we infer whether the model is good or bad based on its performance across multiple tasks. In reality, even if we exhaust all available tasks, we can’t fully capture every capability of the model. Maybe the random seed used during initialization makes the model exceptionally good at describing your frustrating life—but you’d never know, because you’re only evaluating an observation-based model, which is yet another degradation.

So, when you find your model’s performance frustrating, you should mentally retrace this reasoning in reverse. Our ultimate goal is to approximate the noumenon model, but every step in this process introduces degradation.

Is my model’s poor performance due to flawed observation methods? Should I design new experiments to uncover its latent capabilities? Or should I treat current evaluation methods as the sole benchmark? For example, weren’t all those Chinese LLMs crushing GPT-3 on SuperGLUE? Yet ChatGPT isn’t something you’d evaluate with those datasets. If no suitable evaluation exists, can I propose a better one before judging my model?

If observation isn’t the issue, the next step is to examine reachability—is the data sufficient? Is it noisy? Are the optimizer and its settings correct? Is training long enough? This is often mockingly called “alchemy.” But in truth, most people never move beyond this stage. The worst mistake is modifying instantiation or even perception without fully optimizing reachability. I’ve seen too many students randomly tweak architectures, add losses, or invent concepts without addressing reachability first. Don’t gamble on this—even if you boost observed performance, the result will likely be dismissed as low-quality work.

Next, we address instantiation. This is hard because we’re usually instantiating multiple ideas at once. For example, a transformer’s attention mechanism must balance “attention itself,” “regularization for trainability,” “multi-head diversity,” and so on—not to mention computational constraints. The result is a complex engineering framework. So when reviewing an instantiated model, first understand which ideas were merged, which are critical, and which could be replaced with, say, a two-layer MLP. Blindly stacking black-box parameters will severely degrade the model at this stage.

Now you might question whether your initial perception was wrong. Maybe the noumenon’s logic isn’t what you imagined. For instance, do we really need positional embeddings? Is the unknown “something” just positional relationships, or is it attention decay (see: AliBi)? Those who grasp this level rise above low-quality research, because nothing is more important than a clearer, more accurate perception of that “something.” But improving this perception requires returning to observation—all cognition stems from observation, which is why hands-on work is essential. Without it, cognition degrades through each layer of modeling.

At this point, I seemed to have completed another sermon on connectionism. My friend, now beaming, said, “Thank you! I’ll go tell my goddess how to buy a mansion in California.” I closed Being and Time, looked up, and said:

“I was just teaching you that your perception and instantiation of ‘girlfriend’ are both flawed.”

His mind was abruptly yanked back from California mansions. Stunned, he stammered:

“Me… what are you talking about?”

I nodded gravely: “Yes, you,”




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Compilation of Nanshan Jokes (DeepSeek Translated Version)
  • 南山笑话集锦
  • 当你的模型与你的人生一样糟心
  • Vision Foundation Models Are Utterly Useless (DeepSeek Translated Version)
  • 视觉大模型一无是处