Vision Large Models Are Utterly Useless (Gemini 2.5 Pro Translated Version)

Recently, due to various coincidences, I’ve been repeatedly asked a question that I find extremely repulsive: How many parameters does your vision model have? Although my Ph.D. professional training allows my expression to remain in that “1940 map of Europe without Poland” state (i.e., completely impassive/poker-faced), to get to the root of it, the concept of “meaninglessitude” induced by this question has already formed a rational aversion in me.

I. The Gap Between 22B and 175B is Approximately 175B

First, I don’t deny that as the parameter count of vision models increases, their absolute numerical performance on traditional vision tasks also improves accordingly. The best recent example is Google’s ViT-22B, although it required a small amount of non-open-source data and ~~conservative~~ “alchemy” skills (i.e., meticulous, perhaps arcane, empirical tuning) to get this model running. However, this doesn’t prevent the community from generally maintaining an optimistic attitude, believing that computer vision is still worthy of large models. Yet, in my view, this largest vision “afterlife model” (a model so large it’s as if sent to the great beyond, perhaps implying it’s impressive in scale but not practical impact), with an existence as thin as the arXiv homepage, has coincidentally proven that “piling up parameters and data volume is meaningless for the field of pure vision.”

Think about it for a moment: the training volume of this model actually reached (JFT-)4B images * 256 tokens/img * 3 epochs, approximately 3T tokens. This is already double the training volume of the largest LLaMa model. But ViT-22B didn’t yield any truly meaningful conclusions, apart from a slight increase of “three to five pecks” (a tiny amount) on datasets from the “classic old world” (older, standard benchmarks). That is, the model is more biased towards sketch rather than texture—these rather vacuous findings. With this conclusion, whose ROI is comparable to a spring/summer counteroffensive (implying massive effort for little gain), will there be another large model in vision? I don’t think so. However, don’t forget, the gap between 22B and 175B is approximately 175B.

II. Vision Models Remain Dull Even When Large

Before moving on to the metaphysical content in the next section, I will first, from an empiricist perspective, recount three “dark clouds hanging over modern computer vision.” Of course, these problems have not gone unnoticed, as every year there is a continuous flood of related papers milking these issues for publications/citations—which indeed proves they are difficult to solve.

The first to be discussed is naturally the clichéd problem of adversarial attacks. To be fair, although the NLP field also has this problem, it has largely been ignored since NLP entered the era of (generative) LLMs. In contrast, in the CV field, this problem haunts every unlucky reviewer like a vengeful spirit, because no one has stepped up to write a paper claiming that a large model could put an end to this area of research. Moreover, nowadays, no one is even willing to test the adversarial attack vulnerability of these large vision models, because people have generally accepted the idea that vision models should be attackable (i.e., it’s an inherent, accepted flaw).

Secondly, there’s the “elephant in the room” problem, proposed in 2018 to question the effectiveness of detection models and algorithms. The gist is that even if I photoshop an elephant into a picture of an ordinary room, our “intelligent” models can detect it with an IOU > 0.9, even though this is absurd from an experiential standpoint. This is actually a philosophical problem that could affect the very foundations of CV, but solving it or not doesn’t affect an algorithm’s performance on COCO (or it might only have a negative impact), so people just treat it as the elephant in the room.

Finally, the third dark cloud is a bit more practical, which I like to call the problem of the disordered sample space. In other words, visual information fails to form a systematic structure in the feature space of samples. Here, visual information refers to the abstractable semantics present in images, and systematic structure refers to the relationships between these semantics. Even with advanced self-supervised training techniques that allow certain types of objects to “huddle together for warmth” (cluster well, reflected by good t-SNE separability) without needing to know their semantics, from a semantic perspective, a very small r-neighborhood of a space shuttle can contain both an orange tabby cat and a jam sandwich. On the bright side, this enhances creativity based on visual features. On the downside, it forces vision researchers to constantly seek to create visual features based on improving performance on specific benchmarks.

III. Knowledge in NLP and CV

First, we need to confirm two concepts. Since the precise definition of these two concepts has long been a tangled issue, here we adopt the general definitions provided by ChatGPT:

Representation: In philosophy of mind and epistemology, “representation” refers to a mental state or entity that stands for something. The definition of representation implies that our minds have the ability to refer to or express objects, ideas, or situations existing in the world. Representations can take various forms, including mental images, beliefs, thoughts, or linguistic symbols, and can be seen as intermediaries between the mind and the external world, enabling us to have knowledge, perception, and understanding of the world.

Concept: On the other hand, in epistemology, a “concept” is an abstract or general mental representation that includes a type or class of things, events, or ideas. Concepts are the basic building blocks of thought and language, enabling us to categorize and organize our experiences and knowledge. Compared to a single representation, concepts are more abstract and general in scope. Concepts are formed through processes of abstraction and generalization, where we identify common features or properties across multiple instances and create a mental category to represent these shared characteristics.

To put it bluntly, both “representation” and “concept” involve expressing and conveying certain information or meaning, but their nature and origin are different. Representation can be seen more as a psychological-level phenomenon, while a concept can be considered a product at the thinking level.

In fact, I believe that after understanding these two concepts, the knowledge acquired by NLP and CV models at the cognitive level becomes easy to distinguish and comprehend. I believe that within the framework of language (especially generative) models, the data we provide to train a model essentially exhibits Relations of Concepts (here I borrow Hume’s idea, but unlike pure deduction, Relations in language models are still obtained through induction). Beyond this, for the understanding of a Concept itself, language models also gain an indirect cognition through its associations with other Concepts, without a direct understanding akin to human mental comprehension (especially for concepts like “time”). In other words, what language models learn are numerous Representations formed by the associations between Concepts, as well as more profound and complex representations presented by the associations between these representations. The representations that exist or can be understood in the human (linguistic) world are as numerous as stars; we need trillions of parameters to memorize these representations and understand their interrelations (or, an N-gram where representations are the quanta).

In contrast, for vision models, within (almost all) supervised or self/unsupervised training frameworks, the optimization objective is essentially to transform specific concrete data into a specific unique Representation, and then, through specific vision tasks (i.e., human experience), to ultimately abstract these representations into Concepts. In other words, it’s a process of learning Matters of Fact. Let me further explain why specific vision tasks are needed. Most people with experience training (self-supervised) large vision models will have this understanding: it’s very difficult to evaluate the effectiveness of vision models. People just choose the ones with good performance on k-NN/Linear Probe/Attentive Probe/Finetune (classification or other downstream tasks) to write papers. But the essential reason here is that without human experience to help the vision model perform abstraction, an algorithm won’t proactively do such abstraction (it has no Loss function or obligation to do so); it only needs to map data to representations. Conversely, if the path to abstracting Concepts is already considered in the algorithm’s design, then the model’s performance on Linear Probe will naturally be better (e.g., Yann LeCun’s I-JEPA, which will be discussed later).

IV. Parameter Count Can’t Solve Conceptual Abstraction

Now, let’s return to that damned parameter count. For current vision tasks, the parameter count of corresponding models probably cannot be the main indicator for a “ChatGPT moment” in large vision models. Back to those three dark clouds over modern computer vision:

Parameter count can only yield marginal gains in the strength of concept induction. Generally, when vision models induce concepts from representations, they cannot escape a general notion of clustering. In other words, we have prior knowledge that “representations of the same concept will cluster together,” e.g., a Gaussian Prior, and then we obtain posterior results through the actual distribution of representations. However, the benefit of increasing parameter count to more precisely express the position of representations in vector space is destined to become marginal, as parameter count does not have a linear relationship with error. So much so that from a certain point, with a limited number of samples, errors originate more from the prior distribution, and ultimately, increasing parameter count only manifests as that 0.1% performance boost on ImageNet. Besides, the side effect of increased parameter count—higher vector space dimensionality—also increases the number of representations needed to induce a concept (curse of dimensionality), thereby providing more space for adversarial examples to exist.

Parameter count cannot, independent of induction, know the existential logic between representations or concepts. Theoretically, neural networks can perform Universal Approximation, and this ability strengthens with increasing parameter count. In fact, from my personal experience, a model with more parameters can generally use a single feature vector to represent a relatively complex scene, such as a plate holding various fruit ornaments. However, we cannot provide high-quality human experience to conceptualize such things. In this example, we can usually only provide concepts like “still life,” “fruit plate,” or “a plate holding various fruit ornaments,” or combinations thereof. However, such concepts are too abstract for the representations the model can actually express. The model usually can only statistically generalize quantities of more primary representations to compose such highly complex representations. So, lacking guidance from data that can enumerate associations like “AND,” “OR,” “NOT,” etc., the model naturally won’t proactively categorize a complex representation into entirely different concepts just because a primary representation is present or absent. For example, lacking data, a model is unlikely to classify an image of a “forest” as a “park” just because there’s a bicycle in it, or classify it as “green mountains” just because a river is missing.

Parameter count also cannot solve the problem of disordered sample space, thus preventing the model from spontaneously learning new concepts arising from associations between concepts, in the absence of human experience. In natural language, concepts themselves have good hierarchical association structures. Therefore, representations formed by the association of concepts or representations (e.g., a sentence) also acquire a structural expression. Based on this structural expression, we can easily continue to create new concepts or representations. In contrast, visual information lacks such hierarchical association structures. Visually similar representations can correspond to completely different concepts (e.g., the letter ‘l’ and the number ‘1’). So, in the absence of human experience (self/unsupervised), it’s difficult to spontaneously form structural expressions in the (representation’s) feature space, and consequently, new concepts cannot be formed spontaneously. A prominent example is that before the emergence of cross-modal Align models, almost no vision algorithm could perform generalized zero-shot learning. Even on specific domains (CUB birds or Oxford flowers), zero-shot performance was very poor. While large language models can draw unicorns by purely interpreting linguistic concepts, a large vision model probably will never recognize “a destroyed Leopard 2 tank” ~~(unless the Russkies can provide it with enough human experience).~~

V. Naive Cross-Modality Alignment Doesn’t Solve Fundamental Issues

Those who have played with Stable Diffusion should have noticed this phenomenon: most prompts are like incantations of independent words, while scenes described in true natural language are difficult to depict accurately. Those who have directly worked with CLIP should also have noticed this: the vast majority of image-text matching scores (for both positive and negative samples) are distributed within a relatively narrow range, and it’s difficult to use an intuitive threshold to determine their match. Excluding reasons related to model training, this highlights two problems:

The number of Concepts that the visual end can accurately learn is limited, and they are mostly simple word/phrase-level Concepts. In fact, when I was actually training a Chinese CLIP, I encountered a situation where 800,000 out of the top 1 million learned Concepts were personal names (yes, the proportion of names becomes even higher further down). And this actually conforms to real-world distribution. Gentlemen, you can stand up right now and describe the objects you see using language; you’ll find that the relatively simple visual Concepts encountered in real life are quite scarce, while names of people around you are more common.

It’s not that the visual end cannot learn a relatively complex representation, but the visual end cannot generalize such complex representations into concepts like the textual end can. This leads to discrepancies in matching between the visual and textual sides. In reality, the data in training sets that allows the visual end to generalize a concept corresponding to a complex representation is limited. That is to say, even with the help of language, a large vision model will probably never recognize “a destroyed Leopard 2 tank” (unless the Russkies can provide it with enough human experience).

So, why has “Lock Image Tuning” now become a relatively reasonable CLIP training mode? Because the difficulty of using language to generalize representations learned by the visual end into concepts is far less than that of using vision to understand complex representations composed of multiple abstract concepts in language.

VI. To Convert or Not to Convert to Language Models

Finally, let’s discuss the way forward for scaling up vision models. Of course, we must exclude models purely aimed at learning representations, such as Go or weather models. The capabilities of these models themselves lie in learning mysterious representations that humans cannot abstract with language. As the data scale of these models increases, the parameter count naturally needs to scale up to enhance representational ability (e.g., on a 99x99 Go board, a model with hundreds of billions of parameters will theoretically perform better than one with tens of billions).

Current academia has generally found two ways out. One is to have vision models completely convert to the capabilities of large language models (e.g., Google’s PaLM-E and this year’s CVPR best paper, Visual Programming), letting vision models return to the function of providing visual representations for concepts, and allowing large language models to perform the interpretation of more complex concepts. At this point, the parameter count of vision models can be used to learn some representations that humans cannot abstract with language (e.g., depth maps, optical flow, hyperspectral signals), thereby compensating for some weaknesses of language models in spatial reasoning, bringing them closer to real-world AGI. However, this mode still relies on the “Russkies” to destroy tanks and collect vast amounts of data to provide human experience. The road ahead is like Louis XVI at age 39—no end in sight.

Another line of thought is reflected in Yann LeCun’s I-JEPA and Fei-Fei Li’s SiamMAE, where we forcibly make vision models understand the associations between representations. This task itself is not particularly difficult for ViT models with attention mechanisms. The biggest advantage of doing this is that it can partially solve the aforementioned second and third dark clouds. However, because these solutions emphasize individual representations, they find it relatively difficult to learn complex representations formed from multiple representations and concepts, which is particularly evident in poorer finetuning performance on ImageNet (ImageNet has some relatively complex scene categories). And in fact, the current vision academia is not very tolerant of algorithms with mediocre performance. Any bit of open exploration will be relentlessly harassed by reviewers speaking Chinglish. New algorithms seem trapped in an endless black hole, struggling to be born.

Of course, if you’re interested in how I answer the question “How many parameters does your vision model have?”, I generally say coolly, “ViT-B, 88M. They won’t let us deploy anything bigger.”

Enjoy Reading This Article?