Vision Foundation Models Are Utterly Useless (DeepSeek Translated Version)

Recently, due to various coincidences, I’ve been repeatedly asked a question that irritates me immensely: How many parameters does your vision model have? Although my Ph.D. training allows me to maintain an expression as unreadable as the 1940 European map (utterly devoid of Poland), digging deeper, this question evokes a conceptual meaninglessitude that has bred a rational aversion in me.

I. The Gap Between 22B and 175B Is Roughly 175B First, I won’t deny that as the parameter count of vision models increases, their absolute performance on traditional vision tasks also improves. A recent prime example is Google’s ViT-22B, which, despite relying on a small amount of non-open-sourced data and conservative alchemy tricks to get running, hasn’t stopped the field from maintaining an optimistic outlook—that computer vision is still worthy of foundation models. However, in my view, this “largest vision model to date,” whose presence is as thin as arXiv’s homepage, ironically proves that scaling parameters and data volume is meaningless for pure vision tasks.

A little reflection reveals that this model was trained on ~3T tokens (JFT-4B × 256 tokens/img × 3 epochs), already twice the training volume of the largest LLaMA model. Yet ViT-22B offers no truly meaningful conclusions beyond marginal gains on legacy datasets—essentially, it’s more sketch-oriented than texture-oriented. With an ROI comparable to a failed military campaign, will there be another “big vision model”? I doubt it. And let’s not forget: the gap between 22B and 175B is roughly 175B.

II. Vision Models Remain Dull, No Matter Their Size Before diving into metaphysics, let’s empirically recap three “dark clouds looming over modern computer vision.” These issues aren’t ignored—they attract endless papers yearly, all failing to solve them (which ironically confirms their intractability).

Adversarial Attacks: While NLP also faces this, the advent of (generative) LLMs has largely sidelined the issue. In CV, however, it haunts every unfortunate reviewer like a vengeful ghost, because no one has dared to claim that a foundation model can end this problem. Worse, nobody even bothers testing vision foundation models for adversarial robustness anymore—it’s now accepted that vision models are meant to be broken.

The Elephant in the Room: Proposed in 2018 to question detection models’ validity, this problem highlights how even if an elephant is photoshopped into an ordinary room, our “brilliant” models can detect it with IoU > 0.9—despite being empirically absurd. It’s a philosophical issue threatening CV’s foundations, but since solving it won’t boost COCO metrics (or might hurt them), everyone just treats it like the elephant in the room.

Disorder in Sample Space: Visual information lacks systematic structure in feature space. Even state-of-the-art self-supervised learning can cluster objects without understanding their semantics (evidenced by clean t-SNE separability), but semantically, a tiny r-neighborhood around a space shuttle might include orange tabbies or jam sandwiches. Optimistically, this boosts creativity; pessimistically, it forces researchers to chase benchmark gains rather than meaningful features.

III. Knowledge in NLP vs. Knowledge in CV First, let’s define two terms (long-debated, so we’ll use ChatGPT’s colloquial definitions):

Representation: In philosophy of mind/epistemology, a mental state that stands for something else. It implies our minds can refer to objects, ideas, or situations, acting as intermediaries between mind and world (e.g., mental images, beliefs, language symbols).

Concept: An abstract/general mental representation of a category (objects, events, ideas). Concepts organize thought/language via abstraction/generalization.

Crudely, representations are psychological phenomena, while concepts are cognitive constructs.

With these clarified, we can distinguish the knowledge acquired by NLP and CV models:

NLP (Generative Models): Training data essentially presents Relations of Concepts (à la Hume, but inductively learned). Language models understand concepts indirectly via their relations—not through direct human-like comprehension (e.g., of “time”). Thus, they learn representations woven from concept relations, plus deeper layers of such representations. The human (linguistic) world’s representations are countless, requiring trillions of parameters to memorize and relate them (a quantum N-gram of representations).

CV (Supervised/Self-Supervised): The optimization goal is to map concrete data to specific representations, then—via manual task design—abstract these into concepts (a Matter of Facts process). Why “manual”? Because without human guidance (e.g., linear probing), models won’t自发抽象; they’ll just align data to representations. Algorithms like I-JEPA (discussed later) bake in abstraction paths, hence their better linear probe performance.

IV. Parameters Can’t Fix Abstract Concepts Back to the damned parameter count: for current vision tasks, scaling up won’t trigger a “ChatGPT moment.” Revisiting the three dark clouds:

Diminishing Returns on Concept Induction: More parameters marginally improve vector-space precision for clustering representations (e.g., under Gaussian priors). But with finite samples, error soon stems from prior distributions, leaving gains as meager as +0.1% on ImageNet. Worse, higher dimensions (from scaling) exacerbate the curse of dimensionality, demanding more samples per concept and expanding adversarial attack surfaces.

No Logic Without Induction: Neural nets are universal approximators, and capacity grows with parameters. Empirically, larger models can encode complex scenes (e.g., a fruit platter) into single feature vectors—but we lack high-quality human labels to conceptualize them. At best, we supply coarse tags (“still life,” “fruit bowl”), too abstract for the model’s granular representations. Without data teaching “AND/OR/NOT” relations, models won’t infer that a “forest + bicycle” should be “park,” or “forest − river” equals “barren hill.”

Disorder Persists: Visual data lacks language’s hierarchical concept relations, so models can’t自发 form structural representations or derive new concepts. For instance, before aligned cross-modal models, zero-shot learning was dismal even in narrow domains (CUB birds/Oxford flowers). While LLMs can draw unicorns from pure text, a vision model may never recognize “a destroyed Leopard 2 tank”—unless Russians supply enough labeled data.

V. Naive Cross-Modality Alignment Isn’t the Cure Users of Stable Diffusion notice that prompts work best as咒语-like word lists, not natural sentences. CLIP users see that image-text scores cluster narrowly, lacking clear thresholds. Beyond training quirks, this reveals:

Visual Concepts Are Sparse: Top learned concepts are often simple words/phrases. When I trained Chinese CLIP, 80% of the top 1M concepts were names—reflecting real-world scarcity of describable visual concepts versus proper nouns.

Complex Representations ≠ Concepts: Vision models can learn complex representations but can’t abstract them into concepts like text can, causing misalignment. Even with language, few training examples teach, say, “Leopard 2 tank wreckage.” Hence, Locked Image Tuning makes sense: using text to abstract visual representations is easier than vice versa.

VI. Surrender to LLMs—Or Not? Finally, where do scaled-up vision models go? (Excluding pure representation learners like Go/weather models, where scaling does help.)

Full Surrender (PaLM-E, Visual Programming): Let vision models serve LLMs by feeding them visual representations (depth maps, optical flow) to patch spatial reasoning gaps. But this relies on endless human-labeled data—like Louis XVI at 39, the road ahead is interminable.

Force-Fed Relations (I-JEPA, SiamMAE): Make ViTs model inter-representation relations. This partly solves the 2nd/3rd dark clouds but struggles with complex scenes (e.g., poor ImageNet fine-tuning). Sadly, the academic gatekeepers—Chinglish-speaking reviewers—torture any underperforming idea, trapping innovation in a black hole.

P.S. How do I answer “How many parameters?” Coldly: “ViT-B, 88M. They won’t deploy bigger ones.”




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Compilation of Nanshan Jokes (DeepSeek Translated Version)
  • 南山笑话集锦
  • When Your Model is as Frustrating as Your Life (DeepSeek Translated Version)
  • 当你的模型与你的人生一样糟心
  • 视觉大模型一无是处