CV to LLM, Please Understand GPT Through the Worldview of Reinforcement Learning (Gemini 2.5 Pro Translated Version)

Lately, I’ve been tormented to the point of being somewhat fed up. People doing face segmentation, detection, OCR—one by one, they’ve suddenly transformed into multimodal large language model experts, and then they proceed to critique me using a Computer Vision (CV) worldview from the last decade. I actually empathize a lot with these folks living in their large flats in first-tier cities because, apart from not having a large flat myself, I also repeatedly jumped back and forth from traditional CV and traditional Machine Learning (ML) into this field. And when I first arrived, I felt quite insecure, because even the mathematical notation system I was accustomed to had different expressions in the world of Reinforcement Learning (RL). Fortunately, I inherently have a self-deprecating personality and would grab newly graduated campus recruits to teach me various details. However, these people with their large flats are different; the psychological superiority that comes with a large flat prevents them from lowering their arrogant heads.

To put it simply, an LLM, in a sense, is like this game: 1 2 4 8 16 (?). Some will fill in 32, some 31, and others will provide sound reasoning to fill in 114514. All these answers can be correct under different circumstances, so you just need to pick one based on your mood. Now, you’ve filled in 32 because it seems most likely to help you pass an administrative aptitude test. So the game becomes: 1 2 4 8 16 32 (?). And you continue by filling in 64. But then you hesitate. You start to wonder if someone with dyscalculia can understand three-digit numbers, or if continuing this cycle will eventually become an exponential Ponzi scheme. So, after much hesitation, you write <|endofresponse|>, and the game ends thus.

The so-called three-stage LLM training prevalent in the market roughly operates on this logic. In the pre-training stage, you find numerous sequences from http://oeis.org (On-Line Encyclopedia of Integer Sequences) to learn their patterns, such as 2 3 5 7 11 (A000040), or 2 5 11 17 29 (A007491), or stranger ones like 1 2 3 5 7 12 (A326083). From this, you grasp the underlying logic of how a sequence is constructed. Now you can write any regular sequence based on your mood. Subsequently, in the instruction fine-tuning stage, you learn how to continue a given sequence with one or more numbers, and to write ‘eor’ (end of response) at the appropriate time to end the game. Finally, in the RLHF (Reinforcement Learning from Human Feedback) stage, you are told you are participating in an aptitude test, so you must reduce the impulse to fill in 114514, even if you have evidence that it’s feasible.

Through the above two crude paragraphs, I hope I have conveyed, in the plainest language possible, what the worldview of reinforcement learning is. If you, coming from CV, can roughly understand this, then you will also surely understand the following points:

Whether in pre-training or instruction fine-tuning, GPT’s object of operation is always the next token, or in other words, action based on current state. It’s only due to the ingenious design of the QKV attention transformer and attention masks that this work can be executed in parallel. This is fundamentally different from how Vision Transformers (ViT) operate; the latter processes all tokens simultaneously. Therefore, when forcibly stuffing ViT’s numerous tokens into a GPT model, one must somewhat consider: what is the action, what is the current state.
In sampling mode, the token predicted by a GPT model is merely a manifestation of a possible action. In “folk science” terms, it’s as if the probability density function of the action collapses to a specific value upon observation. In contrast, a ViT’s output embedding is a specific, deterministic embedding. Therefore, the most ideal training mode for a GPT model is on-policy sub-optimal sampling, whereas a ViT can be directly SFTed (Supervised Fine-Tuned).
Following the above point, the combinations of tokens that a GPT model can predict are already determined in the pre-training phase. Subsequent instruction fine-tuning and RLHF should not allow it to acquire a new action space. If this unfortunately happens, the data should be moved to the pre-training phase; otherwise, a shift in the action space will most likely lead to the model’s performance not being fully demonstrated or directly degrading. In my understanding, ViT does not have a similar problem, which is why many papers suggest that unfreezing ViT weights during the SFT phase leads to performance improvements.
A Reward Model (RM) is not a Discriminator, much less a binary classifier for good and bad data. An RM is a weak estimator that assesses whether your action/trajectory satisfies a specific objective (e.g., human preference) in a particular state. Based on the current pair-wise training mode, the prerequisite for an RM to work is that all responses are based on the same prompt, thereby minimizing the bias introduced by the prompt itself as much as possible. Using the previous example, in the same state “1 2 4 8 16”, the RM deems action=32 more likely to help you pass an aptitude test than action=114514. But if there is no identical state, and the RM is directly presented with the sequences “1 2 4 8 16 114514” and “837 130 5 391 3281”, the RM would likely still consider the former more likely to pass an aptitude test. So please don’t fall into the meek thought pattern of “using RM to clean SFT data.”
RLHF is not equivalent to (SFT loss of positive samples) minus (SFT loss of negative samples), because both positive and negative samples originate from the same policy model. We are merely evaluating the magnitude of the reward resulting from one of its actions, thereby adjusting the probability of taking that action in the current state. Of course, RLHF in this aspect is still a theoretical and practical sinkhole, but if you fall into the mindset of positive and negative samples, then there’s nothing left to do in this topic but to go tune GPT-4o and grind data.
Following the above point, responses sampled from the same prompt should also not be interpreted using a contrastive loss approach between “positive” and “negative” samples. Even if adding these tricks now seems to boost benchmark scores, I still believe this is not a reasonable thought pattern because it still severs the core issue that “positive and negative samples originate from the same policy,” transforming the problem into “preferential selection among different policies.”

This is all for now. If I think of anything else to add, I will. Please also offer your criticisms and guidance.

Enjoy Reading This Article?