There's No Such Thing as a Natively Multimodal Model (Gemini 2.5 Pro Translated Version)

Ever since the release of Gemini and the Omni type models, people have incessantly seemed to see us as if we’ve been winning all along. Although my consistent attitude has been something akin to “Qin Shi Huang hiding a horse on Master Tang’s face—winning my ass” (a complex, cynical pun dismissing the idea of easy wins), it seems that after these few months, apart from a few more people being poached from some original voice synthesis teams, there hasn’t been any particularly significant movement in this industry. After all, in our galaxy, many companies are “based on Mercury” (implying they move very fast or are very hyped); considering a “six-month lag,” they probably should have released something by now. If they all wait until Llama 4 is released to act, then perhaps half a Saturn year will have passed. Of course, this is the macroscopic view. Microscopically, everyone has already started using the cheap and score-boosting Omni type to process data, and the manifestation of “winning” is also very practical: no longer needing to struggle with which version of the Turbo type model gives more stable results for data processing.

To get back to the main point, the primary theme of this article is not sarcasm. Here, I just want to discuss, relatively objectively and calmly, where the vague boundaries of our “forced-to-win-business” (a pun implying these models are hyped for business reasons, not necessarily intrinsic merit) natively multimodal models lie. Because, if you are familiar with the way of speaking within our country’s industry, “native” here can mean “shiny,” “awesome,” “different,” “can paint a new [promising] picture/hype,” “can be made by two interns,” but it uniquely won’t refer to “native” in its true sense. But then again, I feel that this kind of expression might not be problematic, because it seems there really is no such thing as a natively multimodal model.

“Natively Multimodal” is Itself a Re-creation Through Translation/Interpretation

In the current context, the forms of input and output received by an agent are generally referred to as “modalities.” However, in this context, the origin of the word “native” may just be a mental filling-in-the-blanks (over-interpretation) of some technical reports on Gemini and Omni-type models. If we trace the original texts, we only find relatively simple statements like:

The Gemini models are natively multimodal, as they are trained jointly across text, image, audio, and video. OpenAI’s first natively fully multimodal model.

Based on my understanding from “twenty years of cerebral hemiplegia” (a self-deprecating humorous exaggeration), Google’s use of “natively” here, to put it conservatively, should mean “input and output forms were not separately distinguished during training,” or perhaps even “although we froze some modules, from the entire system level, we didn’t single out any one form of input for intensive finetuning.” OpenAI is even more conservative; they are only willing to admit that there is no distinction of a single modality in input and output forms, even trying to blur the details of how these modalities are processed, following Google’s lead. But the unfortunate thing is that at the “Big Three Top Conferences” of the “Xinliangji” (a portmanteau likely referring to prominent Chinese AI tech media like Xinzhiyuan, QbitAI, and Jiqizhixin) in our region, they unquestioningly slapped on the label “natively multimodal.” If the naming job were up to me, I think calling it “Transformer for Multi-Domain (TMD - using the author’s preferred acronym ‘特喵的’ which is also a euphemism for ‘f***ing’)” would be more expressive and elegant. Because when we discuss models in this category, the emphasis is basically on how to fuse data from multiple domains and process (compress) or infer information via Transformers. So, in other words, if everyone is clear-headed about what they’re discussing, then it doesn’t matter whether you call it “natively multimodal” or “TMD.” If you’re not so clear-headed, then it’s best to use the “configuration + input + output” method to explicitly state what you mean: for example, “I now want to discuss with everyone a model that is Chameleon + text input + video output,” rather than engaging in a scholastic debate like “Is Chameleon native? Is LLaVA (L-Lava) native?” Because even a pure LLM+TTS, at the system level (not model level), is also a TMD.

Language Itself Is Not a Native Modality

Let’s assume for a moment that humans themselves are specialized as a kind of TMD model. This TMD model can receive main information input sources such as vision, hearing, touch, or smell/taste, and can output through muscle reactions, physiological responses, or brain signals. Then we will find that when we discuss the “native” in nativism, language itself is not an input or output modality. In other words, language itself is a secondary output signal transformed from input modalities like vision, hearing, and touch, or an ultimate output product formed by secondary output modalities such as “air vibrations” (speech) or “keyboard tapping.” From this perspective, humans cannot be directly stimulated by language or text to sequentially output new language or text in order. Furthermore, when I’m “flaming” someone online, is it a process of converting auditory signals to text via ASR, then processing the text in the brain, and then converting the brain-generated text into air vibrations perceptible by the opponent’s auditory system via TTS? I don’t think so. This isn’t because I don’t need to think when I flame people, but because language is an external expression of thought. Its relationship with the process of thought is reflected in how humans use language to display, transcribe, and cache thought, which itself is not an input or output mode. In other words, the reason I use language when flaming people is only because I need to convey my thoughts under conditions of “restricted muscle movement,” i.e., to display them through language. Without this restriction, I could directly use “Consecutive Normal Punches” (from One-Punch Man), and the latter would surely better express my current state of mind. So, in my view, the reason LLMs have developed so tremendously today might just be due to exploiting a “bug” where machines can directly use language as an input or output domain: benefiting from the characteristic that language is a highly abstracted display, transcription, and cache of thought, from which patterns are found and an approximate “language -> thought -> language” process is constructed. The best (and most extreme) example is training LLMs with program code (mathematical-logical notation), because this code is a direct record (carrier) of thought, so models trained on such data more easily acquire strong logical reasoning abilities. Conversely, this process may not replicate well in other truly native modalities, because these native input sources have not undergone a process of information abstraction (or compression), so it’s relatively difficult to obtain logical thought from native modalities.

Grounding is a Pseudo-Proposition

There’s a prominent trend of thought currently: since the language modality is indispensable, why not just ground other modalities to the language modality, and then let the language modality lead the way? Many people also believe that grounding itself can help the language modality solve the dictionary paradox problem—that is, if only the language modality exists, the explanation of a concept depends on its relationship with other concepts, and these explanatory concepts themselves also depend on other concepts for explanation. Multimodal grounding, however, can provide anchor points for some concepts; in other words, the model can use these concrete concepts learned from other modalities to better construct a complete language system. But there’s clearly a flaw in this understanding: the instantiation of concepts in language occurs in the mind of the receiving subject. In other words, any concept in language is abstract. These concepts do not anchor any specific instance beyond their abstraction itself. Such abstracted concepts are only used to convey the content of thought during communication. The result of their instantiation is influenced by the specific thought content of both communicating parties and is by no means solely determined by the concept itself. Precisely because of this, although it seems our models can accurately match “a picture of a cat” with the word “cat,” they find it very difficult to handle the following two situations:

Situations involving semantic ambiguity, or requiring information from other modalities. For example, the classic Sorites paradox (grain and heap problem). Whether some grains can be called a heap is merely a subjective result formed by the receiver of this concept through communication, so there isn’t a definitive instantiated result of a heap. Otherwise, if we forcibly use visual features like “quantity” or “presence of stacking” as criteria to distinguish grains from heaps for grounding, then should we match a picture of four grains arranged in a small pyramid to the concept of a heap? Or take the sentence, “I will punish you.” Spoken by a middle-aged burly man versus a young maiden, it will inevitably cause different instantiations in thought. These concepts, which inherently rely on information from other modalities, cannot and should not be directly grounded between modalities.
Concepts that exist only in a single modality like language. Although it sounds strange, there are indeed a large number of concepts that cannot be corresponded with other modalities. These concepts are purely results abstracted by thought and are only meaningful in the communication of thought. Moreover, most of them do not have real, reliable ontological counterparts. For example: “I” (the carrier of thought), “God” (the laws of nature), “empty set” (the starting point of logic), “elementary particles (electrons, photons, quarks, etc.)” (the origin of all things). We might draw a small blue solid sphere to represent an electron, but we certainly cannot match this blue solid sphere with the concept “electron.” And these concepts themselves might be more suitable as fundamental components of a language system than “cat,” “dog,” or “apple.” In short, grounding itself can indeed serve as an indicator of a multimodal model’s capability. But if grounding is made a goal, with the expectation that the text modality can lead the way, then that’s putting the cart before the horse.

Unnaturalness and Unlearnability

But very unfortunately, if our multimodal models are language-centric, then many times we have no choice but to rely on active grounding to train the models. This involves the nativeness and learnability of concepts themselves—that is, to what extent a concept is related to “native” cognition and experience, and how quickly it can build connections with other concepts. Take a very trivial example: the “red color family” (including “magenta,” “carmine,” “rose red,” “scarlet,” etc.). If these concepts are not actively associated with visual input, a pure language model will certainly not achieve a precise instantiated result when receiving this input, nor will it output based on the principle of “allowing the opponent to achieve accurate instantiation.” In other words, even if these concepts exist very naturally in native cognition, they are not learnable for language models, much like a visually impaired person discussing a colorful world, speaking by hearsay without any internal concept. Conversely, the advantage of language-centric models is providing learnability for non-natural concepts. Using the blue solid sphere as an example again: the native experience we can observe is interference fringes (alternating light and dark bands) on a background plate. This visual input can very directly allow us to learn the concept of “interference fringes” itself, but it is unlikely to allow us to learn the underlying concept of “electron”—even if we have many such pictures. To acquire the concept of an “electron,” it’s only possible by organizing the indirect signals caused by electrons in our cognizable input sources through language (or mathematical symbols)—in a sense, this is a kind of shortcut that highly intelligent beings like humans have made for their own thinking. Lacking this mechanism, based solely on video input, our delicate models might interpret the concept of “gravity” as “a disconnected region spontaneously moving along the Y-axis.” Thus arises this wondrous spectacle: the language modality must evolve towards multimodal models, yet multimodal models must operate by borrowing the non-native language modality.

Positional Embedding and Temporal Embedding

At the same time, there are also views that do not support language as a native modality, arguing that language as an input or output source lacks temporality, and its counterpart, sequentiality, is not equivalent to temporality. The core essence of temporality is that information must be encoded/decoded (sampled/outputted) in segments corresponding to real-world time and have an absolute temporal order. Conversely, sequentiality does not require encoding/decoding according to time segments (if the vocabulary is large enough, a sentence can be encoded into a single ID), nor does it require strict adherence to temporal order (as in “The printing order of characters does not affcet reading comprehnsion” - deliberate misspelling for effect, implying jumbled order is okay). In other words, positional embedding only provides a rough logical pattern between input tokens and cannot serve as strict temporal embedding. The best solution for the latter remains recurrent-series models with strictly time-ordered input.

Of course, my arguments here are not aimed at structures like RWKV (the author types “软文KV” - “soft-content KV,” likely a playful jab or typo for RWKV), Mamba, or xLSTM, because current structural designs still cannot transcend using sequential structures to model temporal structures. That is, it’s either forcibly giving the language modality temporality, or forcibly giving other native modalities sequentiality. And in fact, as defined in the previous section, the thought and logical elements contained in language do not depend on specific encoding/decoding segmentation modes, nor do they strictly depend on temporal order. Therefore, the efficiency and performance balance of a solution like the attention mechanism, which seeks high-weight tokens, is far better than solutions that rely on hidden variables for token-by-token input/output. Conversely, other native modalities (speech, video) not only require a temporally uniform encoding/decoding segmentation mode (manifested as sampling rate, frame rate), but the information a token can provide also strictly depends on the preceding token; using an attention mechanism might, in fact, reduce efficiency. But, but, my personal stance is to support RWKV-like approaches, because thought (and its corresponding language) will not be infinitely wide. That is, if encoding/decoding segmentation is done reasonably, and temporality is cleverly imbued, then it can be well integrated with native modalities.

Closing Remarks

Although there are still many uncertainties in academia regarding the cognition of language itself, it is undeniable that language is currently the most efficient means for creating intelligent agents. This efficiency is reflected not only in its close relationship with thought but also in its “synesthetic” effect with other modalities (e.g., a piece of text can evoke a sense of imagery, or make someone unconsciously sing it out), meaning that abstraction and concretization within “native modalities” can also be based on language. Therefore, models currently based on LLMs that can compatibly process multiple input and output sources are not an invaluable direction of exploration, but we should not blindly believe that this model is the ultimate solution. Language is merely an instinct that humans developed in the past to adapt to the environment. Perhaps in the future, humans will also create new “language modalities” to adapt to the development of intelligence.