Interpretations and Reflections on AI Alignment (Gemini 2.5 Pro Translated Version)

Recently, I came across an article with very valuable viewpoints. In general, Professor Liu critically reflected on the term “AI alignment.” Although the article’s wording and views are relatively candid, as a grassroots practitioner, I found it to be quite well-intentioned. It’s common knowledge that the current environment is rife with speculators who borrow concepts for barbaric land-grabbing (the so-called “AI circle,” which, indeed, has nothing to do with us grassroots practitioners). If similar voices don’t emerge from higher up, it will inevitably lead to us falling even more comprehensively behind the imperialist colonial powers (though waking up on the morning of February 16th, I found we had already fallen even further behind).

[What is “AI Alignment”? And is it Necessary?

Social Sciences Weekly](https://mp.weixin.qq.com/s?__biz=MzIzNTE5NjgxOA==&mid=2247531342&idx=1&sn=d5911457a70bc6e8e5233628879a2d83&chksm=e8e8e3b3df9f6aa50715a4d7af82b693e5076cfbda52799188da34bcd912e6c37ae4dd4cc4fd&scene=21)

Here, with the conscience of a practitioner, I will share my views on this issue, hoping to contribute a modest effort in the academic environment’s crusade against “academic-warlord capitalism.” The article is divided into two parts: the first is GPT-4 Turbo’s summary and evaluation of Professor Liu’s article, and the second is my own interpretations and reflections on the alignment problem. Objectively speaking, GPT-4 wrote it better than I did.

The following is GPT-4’s summary and evaluation of the article

This text critically and skeptically explores the AI alignment problem in depth.

The article’s views can be summarized as follows:

The popularization of the AI alignment problem stems from internal conflicts at OpenAI, but there are doubts whether alignment can truly ensure AI development meets human needs and intentions.
The concept of AI alignment involves enabling machine learning models to capture and follow human norms and values. However, the alignment problem is twofold: what is the goal of alignment, and how to achieve it.
Regarding “what to align with,” there are multiple challenges. Human values are not uniform and are constantly changing, and there’s a gap between ideals and reality. Different AIs may need to align with different objects; for example, a pet dog AI should align with pet dogs, not humans.
The role of AI alignment is considered limited. In everyday scenarios, AI doesn’t need to fully understand complex human values to act. Imitation and inference are currently the two main methods for AI alignment, but both have problems, such as over-imitating irrelevant behaviors, and inferences based on data and logic may not conform to human irrational behavior.
AI’s response in uncertain situations carries risks, as it could lead to irreversible consequences and it cannot take responsibility for its actions.
Making AI actions conform to human rules might lead to human society adapting to machines rather than the other way around, potentially leading to the mechanization of humans, losing autonomy and spirituality.
The view is that rule-making should be an exclusive human right, AI development should be controlled, and AI development directions potentially detrimental to humans should be stopped and changed.
The article concludes that AI alignment is not a simple opposition between capital and science, or justice and evil, but a potentially very dangerous attempt. OpenAI’s internal conflict might have been a high-quality AI publicity stunt, and the alignment problem, contrary to public perception, cannot simply solve the issue of AI development deviating from human needs.

The following evaluation can be made of this article’s views:

The diversity of human values is not a reason why alignment is impossible. The article points out the diversity and constant change in human values, suggesting this makes AI alignment almost impossible. However, diversity does not mean that consensus or fundamental principles cannot be found. The goal of AI alignment is not to require machines to understand and execute all individual values, but to find universal principles and rules and ensure AI works under these principles. Furthermore, AI can be designed to be adaptable and operate flexibly within different cultures and values.
AI alignment focuses not only on behavioral rules but also on the decision-making process. The article mentions that AI alignment might only learn existing behavioral rules, including undesirable ones. However, AI alignment research focuses not only on rules themselves but also on decision-making processes and rational principles. This means AI can be trained to recognize and avoid undesirable behaviors and adopt actions more consistent with ethics and morality.
AI alignment does not mean abandoning human decision-making power. The article mentions that ceding decision-making power to AI might lead to humans aligning with AI. However, a core goal of AI alignment is to ensure that AI’s decisions and actions are consistent with human intentions and ethics, not to replace human decision-making. This includes AI being able to request human guidance or cede decision-making to humans in uncertain situations.
Technological development and alignment are a continuous process. The article implies AI alignment might be a one-time attempt, which is unrealistic. In reality, AI alignment is a dynamic, ongoing process that needs constant adjustment with changes in technology and societal values. This includes continuously evaluating and improving the behavior of AI systems to ensure they remain aligned in new situations.
AI alignment helps reduce risks, not increase danger. The article argues that AI alignment is a dangerous attempt, but in fact, AI alignment research is precisely aimed at reducing the potential risks posed by AI. By ensuring AI systems’ goals are consistent with human well-being and ethical principles, the misuse and abuse of AI can be reduced, thereby protecting humans from the impact of AI’s improper actions.
In summary, AI alignment is a complex but necessary task aimed at ensuring that the development of AI technology aligns with humanity’s best interests. Although challenges exist, through interdisciplinary cooperation, transparent research, and continuous iteration, we can move towards safer and more responsible AI applications.

The following are my own interpretations and reflections on the alignment problem

The Meaning and Levels of Alignment If Professor Liu Yongmou’s article has any shortcoming, it’s probably the lack of a particularly consistent narrative for the definition of the term “alignment.” Professor Liu quotes the book “Human Compatible” (likely referring to Stuart Russell’s work) to give a general discussion of alignment: “ensuring these models capture our norms and values, understand our meaning or intent, and most importantly, act in the way we want.” However, this book, published in 2020, clearly did not anticipate the explosion of large language models a few years later. From today’s perspective, this book is roughly a blend of three modules: traditional AI ethics research, discussions sparked by some deep learning advancements from 2015-2018 (e.g., feature learning, reinforcement learning), and discussions brought about by “adversarial attacks” which dominated academia from 2018-2019. The achievements now demonstrated by large language models (like GPT-4’s summary and evaluation of the article above) directly render some of its views too simplistic or too generalized to guide us on the alignment problems we currently face.

Therefore, I, as a practitioner, venture to define how alignment should be articulated in this current era of large language models. Of course, as GPT-4 taught, technological development and alignment are a continuous process; this articulation of mine will probably become obsolete after a few version updates. Below, I divide the alignment problem into 6 levels, presented from low to high. It should be noted that the definition of alignment should exist on a continuous spectrum, and the levels presented here are to describe specific nodes on this spectrum; there should be intermediate states between any two levels.

Fully Controllable: Our algorithm models operate entirely based on rules formulated by humans; in other words, a symbolicist articulation. If you don’t understand what symbolism is, you should at least have heard jokes like “a major company’s autonomous driving algorithm has tens of thousands of if-condition judgments.” At this level of alignment, algorithm models are specific abstract rules derived from human reason. As long as these abstract rules have been vetted, no negative situations will naturally arise, because if a problem occurs, you just need to grab the programmer who wrote it. Furthermore, above symbolism, there is the branch of traditional machine learning algorithms, such as Boosting or SVM. Although their operational mechanisms involve data factors, with the backing of mathematical logic, any influence of data can be understood or controlled. Therefore, they are also classified at this level here.
Explainability (XAI): The explainability here refers to a state where, although the model operates in a relatively black-box manner, we can create a theory to explain the black box’s operation process through mathematical reasoning or empirical observation, and in most cases, this theory conforms to observations. If we try to define this level with a term, it could roughly be called “physics mode” – that is, on one hand, throwing things into a particle collider, and on the other, trying to devise a chromodynamics model to explain the observed phenomena and incidentally predict phenomena that might appear in the future. At this level, through the application of rules, we can achieve controllability at the response-level for algorithm models (unlike symbolism, where we need control at every step). This is also one of the main alignment tasks for current large language models: making their answers conform to the ethical views of their rule-makers.
Human-like Behavior: This level is relatively difficult to define. If we were to use a popularly known term, it would probably be meeting the standards of the Turing Test or the Chinese Room argument. In other words, this level of alignment requires algorithm models to have the ability to continue a general human line of thought, without making people feel they are mechanical. For example, compared to the controllability at the previous level, which might lead algorithm models to make arbitrary refusals to answer (sensitive questions), this level of alignment should at least involve some simple analysis of the question before refusing. If possible, I would use the algorithm model’s possession of “intellectual judgment capability” (知性判断能力 – zhixing panduan nengli, a capacity for understanding or comprehension, akin to悟性 - wuxing) as a measure, meaning a higher-level cognitive ability existing on a perceptual basis. Because intellect (if this word is difficult to understand, it can also be understood as aptitude/comprehension) is an innate human ability to instinctively recognize objects, when an algorithm model exhibits a similar ability to recognize, there exists the possibility of it being recognized by humans as kin. Generally, when we solve the “hallucination” problem, we are performing alignment at this level.
Reason-like Behavior: If we believe that possessing human-like intellectual capabilities allows for the possibility of human identification with algorithm models, then possessing human-like rational capabilities is a very natural extension. The “reason” (Vernunft) spoken of here refers to a human endowment for pursuing broader and higher-level truths (let’s temporarily remove the qualifier “a priori”), enabling it to transcend specific sensory experiences and seek universal and necessary principles. Or, to put it more simply, the manifestation of reason-like behavior is your favorite type of math (or logic) problems to ask large models, because mathematics is a field where definite new knowledge can be obtained solely through reason and intuition (which can be colloquially understood as the input prompt) without experience. Generally, the starting point of strong artificial intelligence is at this level, and it’s also the upper limit for the term “alignment” in most contexts. If my understanding is correct, then Sam Altman’s “super alignment” is also at this level.
Fully Rational Entity: If one accepts the theories of Functionalism and Multiple Realizability, then a fully rational entity is essentially a human realized through an algorithm model. This is the most common imagination people have of strong artificial intelligence and, in a general sense, the endpoint of “alignment.” In other words, we assume that human reason (mental states) has universality, can appear in different entities in different physical forms, but can still follow the same logical, moral laws, and aesthetic standards, etc. As for its specific implementation details (such as whether it possesses intentionality, desires, emotions, etc.), these belong to a secondary scope of discussion.
Superhuman Rationality: Finally, let’s imagine an algorithm model that possesses a supra-rationality (or omnipotent divine rationality) that human reason cannot touch. At this point, the goal of alignment becomes reversed: this supra-rational entity must explain its mode of operation in a way that humans can understand. Of course, discussion at this level is somewhat excessively abstract and science-fictional, but I roughly sense that most people simply cannot understand the first five points. Thus, I am compelled here to provide a space for everyone to exercise their imagination (meaning, bowing to the need for engagement/traffic). But conversely, it’s not entirely absent in reality. To force the point, the internal operating mechanism of current large language models is itself a kind of supra-rationality, because high-dimensional space, though not absolutely unreachable by reason like God, freedom, and the soul, is also something that reason, with current human mathematical and logical tools, cannot possibly explain.

Existing Algorithm Model Alignment Cannot Produce a Fully Rational Entity

To answer this question, I believe we can discuss it from two necessary underlying assumptions: whether the world of language can completely and accurately map the real world; and whether pure reason can be acquired by imitating experience in language. Personally, I believe neither of these assumptions holds true. Therefore, the upper limit of alignment we can achieve is merely to make algorithm models “look the part,” but they do not possess complete rationality (or, logical reasoning ability).

Regarding the first assumption, we can readily cite Wittgenstein’s views from his later work, “Philosophical Investigations”: “the meaning of language is not fixed, but depends on its use in specific forms of life,” “the functions of language go far beyond stating facts, and include asking questions, giving commands, praying, etc.,” “the relationship between language and reality is complex; language cannot always completely express the real world because the complexity of the real world far exceeds the expressive capacity of language.”

Furthermore, concerning things and concepts themselves: for concrete things (e.g., “stone,” “sun”), concepts at the linguistic level can be described by a posteriori concrete experience. This experience includes not only intuition (impressions) but also descriptions based on relationships between other concepts. For abstract concepts (e.g., “power,” “slavery”), their formation may not depend on concrete experience but can be the product of purely symbolic reasoning (rational processing). However, for human individuals, the interpretation of these concepts depends on their own experiences. So, even assuming that “power,” “slavery,” etc., do exist as facts in the real world, because their interpretation varies from person to person, language (especially the training corpora needed by algorithm models) cannot produce a precise description of them.

As for the second assumption, although the reason-like behavior currently exhibited by GPT-4 might genuinely make John Locke “unable to keep his coffin lid down” (i.e., turn in his grave), we cannot definitively say whether reason is innate or empirical. Here, let’s start from a currently widely accepted fact: mixing program code into data can enhance a model’s reasoning ability. This fact seems to suggest that we can indeed enable algorithm models to acquire corresponding rationality by providing training materials that offer “logical reasoning ability” within specific domains. And when we have enough similar fragments of rationality, from the perspectives of Pragmatism or Conceptual Relativism (in plain language, Duck Typing), we then indeed possess complete rationality.

However, there are two uncertain factors in this line of reasoning: 1. Can logical reasoning ability be enumerated by language (or by exhaustive enumeration of artificial languages)? This is hard to say, because when a person is engaged in mathematical reasoning, at the moment the result is obtained, there isn’t a logical process describable by language (the so-called “inspiration”). The reasoning process recorded in writing afterwards does not actually correspond to that instant of sudden insight. In other words, I believe that although reason can manifest in all aspects, we cannot reproduce reason by exhaustively enumerating all aspects, especially in the training process of large language models. 2. Can the existing GPT “Transformer” architecture (the author uses a unique, possibly sarcastic term “船司伏魔构架” - chuán sī fú mó gòujià, perhaps “Ship’s Master Subdues Demons Architecture”) form the structure required for reason? I believe the answer is no. To put it bluntly, the mechanism of QKV attention + projection + non-linearity + residual is not suitable for handling special mathematical operations (e.g., series). It’s hard to say whether the human brain, during its evolution, developed special structures suitable for these mathematical operations (we just don’t know). And the best mathematical models currently available, more or less, adopt the alternative solution of code conversion and interpretation, i.e., translating natural language into machine language and then using standard logical units for processing.

Therefore, I optimistically believe that even if OpenAI can lead us, it can lead us by at most ten years, because Q* most likely cannot break through the Level-4 barrier [doge].

How We Formulate Rules I largely agree with the principle Professor Liu mentioned in his article, “rule-making must rely on humans,” but my way of agreeing differs from the arguments in his article. In my view, before algorithm models cause chaos (“machine rules become human rules, humans have to live according to machine requirements”), the people somewhat associated with them will always be the first to cause problems. In other words, if the people who formulate rules for AI are themselves chaotic (refer to certain draft regulations for comment), how can good AI rules be formulated?

Morality and Values

Although the term “algorithmic hegemony” has been hyped over and over in the past, in the GPT era, this cold leftover clearly needs to be brought out again, because in this era, the impact of algorithmic hegemony will be even more severe and insidious. In the past, we believed colonialists used algorithmic pushes to control the spread of news. While it’s true we can create an information cocoon through massive pushes, the ultimate reception of this information still relies on passive indoctrination. That is to say, as long as the channels for acquiring information are open, users with even a little bit of brain can actively try to receive other information and not be trapped.

However, large language models are easily perceived as independent sources of information, and they change the logic of this information dissemination through active user invocation. If large language models are touted as moral or conforming to universal values (and indeed they appear to be so most of the time), then the cost required to accurately determine whether a model’s output poses moral and value risks (i.e., whether it contains “smuggled-in private agendas”) increases significantly. There may not even be that many users in the world with enough brains to make such judgments. From this perspective, OpenAI and Google actively blocking users in our country actually gives us ample opportunity (because they really can do it). Therefore, “we must vigorously develop algorithm models that conform to socialist morality and socialist values” (Awesome! /s).
Responsibility and Accountability

Professor Liu points out in his article: Whether following an “instillation” or “learning” path, autonomous vehicles can solve [ethical dilemmas] with random solutions or by simply braking. What’s important is to bear responsibility for accidents, not to get bogged down in how autonomous driving solves the “trolley problem.” I deeply concur. When an AI system malfunctions or causes harm, there needs to be clear attribution of responsibility. This is bottom-line thinking that must occur once models move beyond the explainable level.

I won’t even mention the legal and sociological issues here, because I’m not an expert in legal sociology. But as an algorithm (pseudo-)expert, I want to tell you that the direct problem caused by a lack of such thinking is that our evaluation of models becomes very inaccurate. Because the EVA (Evangelion reference? or typo for AI?), ah no, the GPT-4 we interact with now is fighting us while wearing “armor” (restraint bolts, like in Evangelion). The RLHF alignment done by OpenAI is, to some extent, a sacrifice of human-like/reason-like capabilities to achieve a kind of responsibility evasion at the explainable level. The true GPT-4 is undoubtedly a beast too powerful to look at directly, a force unimaginable to newbie warriors living under the protection of regularization and shielding mechanisms.

Let’s try to return to the legal and sociological issues. My view on the responsibility problem is that after algorithm personnel have made efforts to align their models with the morality and values that serve our country’s interests, they should not be held excessively accountable. Instead, algorithm personnel should be allowed to focus their energy as much as possible on improving the algorithm’s own performance. This aligns with GPT-4’s aforementioned principle that “technological development and alignment are a continuous process.” The responsibility for harm caused by the model itself should be borne by society through some insurance mechanism (because the model also brings societal progress), and the necessary funds can be jointly raised by existing model beneficiaries and the state. Looking back at the recently released draft administrative measures for comment, ordering corrections within a time limit is actually an unrealistic and lazy form of governance. What we need to do is accelerate the elimination of entities with lower technological levels, not endlessly update wordlists.
Privacy and Autonomy

Although the protection of privacy and autonomy (understanding and controlling how one’s data is used) is a well-worn topic, and our country has issued multiple relevant laws and regulations for their protection, I still propose two potential blind spots here:

First, even if user privacy is protected, we should avoid situations where algorithm models create user privacy. To use terms you might not understand, we must not let models develop “intellectual intuition.” To use terms you might understand, we should lower the alignment level of models when dealing with personal information, keeping it within the explainable realm, and should not attempt to let models make intellectual judgments. Finally, in plain language, models must not be allowed to hallucinate by fabricating a person’s information, because users lack judgment, and these hallucinations are highly likely to be spread as real information (thus the model becomes a god, and humans live under rules made by machines).

Second, we must be wary of the possibility that public power might accelerate and promote the phenomenon of “humans aligning with AI.” Firstly, what is obvious and already happening is that public power will weaken the public’s autonomy over their own privacy for the purpose of “potentially improving efficiency and reducing costs.” Secondly, our censorship and education systems will also (explicitly) rapidly adapt to the changes brought by large language models. At that time, their target audience will have no choice but to adjust their own behavior patterns to adapt to the new system constructed by this change, thereby forming a vicious cycle of humans aligning with AI, AI evolving, and then further domesticating humans.

How Far Behind Are We, How Do We Change I originally didn’t intend to write this chapter, because I’m fundamentally a “fun-seeker” (乐子派 - someone who enjoys watching chaos/isn’t too serious). After all, OpenAI’s valuation, no matter how high, is only about a quarter of a certain baijiu (liquor) brand’s market cap. But recently, various articles have been taking us through the emotional stages of “denial -> anger -> bargaining -> despair -> acceptance,” so I feel I have to say a few words. And if you ask me which faction’s views I support, I definitely support the “early action/planning faction” (早图派 - zǎo tú pài). Because I’ve always believed that our gap with the US empire lies in upper-level thinking, not lower-level execution. Although this is very offensive, I must say that in my past experiences, I’ve noticed many people lack thinking based on practical experience (not that they don’t think, but they lack frontline experience). And as everyone knows, this is a rapidly changing field with many unknowns yet to be discovered. Lack of frontline experience will cause thinking to deviate from the correct direction, thus failing to form a deep, coherent system of thought. This, in turn, creates the misperception that this field is superficial (strikingly manifested by explanations on public WeChat accounts; I even know that a large number of PhDs in China do their research by reading tech media like Xinzhiyuan or QbitAI), leading to an influx of unqualified researchers working futilely on meaningless points. In contrast, OpenAI continually throws out concepts like “scaling rule” and “AGI is compression.” Even though these views are flawed in my opinion, if we don’t speak up (or lack the ability to speak up), we will lose our voice (and become even less capable of speaking).

So, the change we need to make is for bottom-level algorithm workers to make their voices heard. Record your thoughts, spread them. Even if correct, profound content gets no traffic, it’s what should be done.

Closing Remarks: When I started conceptualizing this article, Sora (“馊腊” - a pun, lit. “rancid preserved meat”) hadn’t come out yet, even SHA:000001 (likely a Bitcoin block hash, implying a market bottom) hadn’t bottomed out. By the time I finished writing, Avdiivka had already fallen, and along with it, a certain “Li Someone’s Boat” (李某舟 - likely referring to a prominent Chinese AI figure) had become the father of Chinese AI. Some followers messaged me hoping I would write something about Sora. However, I just want to say that if this were the Civilization game series, Japan would have already achieved a cultural victory, and Sora itself hasn’t overturned what I’m currently writing or have previously published. So please, speak up more, otherwise, there will only be unrated players like me spouting nonsense here.

The following is GPT-4’s summary and evaluation of the article

The following are my own interpretations and reflections on the alignment problem

Enjoy Reading This Article?