Scar Literature from Reviewing CVPR - The Decline of Top Conferences and the Emptiness of Papers (Gemini 2.5 Pro Translated Version)

1.24 update: Checked it – for the 5 papers I gave a 1, other reviews were also 1-2 points. For the other one I gave a 5, other reviews were also 4-5 points. So, can the “saints” (holier-than-thou critics) shut up now?

Clarification: Please don’t take things out of context. Deciding to give a reject in 10 minutes and spending only 10 minutes on the entire review are two different concepts. For a very mediocre paper, you know its type after reading the abstract and introduction. The next step is to research the most relevant papers and find material to write comments. In the end, I still have to spend several hours cumulatively writing comments for each paper.

Also, why 5 rejects out of 6 papers? Because the acceptance rate for top conferences is typically in the 20-25% range. If you have 6 papers, you’ll definitely need to clearly reject 3-5. Isn’t that perfectly normal? If you give 1 reject, 1 accept, and 4 borderline, do you think the AC (Area Chair) wouldn’t want to kill someone?

Then there are those getting PTSD because I gave 5 rejects, and those treating several review outcomes as independent random events? Firstly, I can confidently say that the other reviews for these papers were also bound to be negative (I even seriously wrote one or two strengths and suggestions for improvement for each). If, during the discussion phase, I found my scoring was erroneous, I would certainly change it. Secondly, I wrote this precisely because I couldn’t stand how poor the quality of papers was this time, to talk about how papers should actually be written.

Furthermore, I received requests for 6 emergency reviews. After briefly reading the abstracts, they were all of mediocre quality. If I had accepted them all, out of 12 papers, I probably would have had to reject 9-10. I declined them all precisely because I didn’t want to torture myself (of course, the limit might be 8 papers; I didn’t dare to click).

I want to emphasize one more thing: for a junior reviewer (generally, those who receive fewer than 8 papers, with some margin of error), your job is to provide timely, accurate, and clear review comments. Once you’ve done that, your work is already well and excellently completed. Don’t act with the heart of a Bodhisattva trying to save all sentient beings; such an idea only provides verbal gratification and has no practical meaning.

I’m forced to say this again: if you’ve carefully read the guidelines, they generally require reviewers to express their opinions as clearly as possible during the review, unless you are genuinely unsure, and not to give borderline scores. This is why ICLR previously experimented with a 1-3-6-8 rating system, and why CVPR removed the “strong” qualifier this year – to reduce reviewers’ guilt when giving scores at either extreme, in other words, to encourage reviewers to express their opinions as clearly as possible. But you guys just love giving borderline scores, don’t you? The more borderline scores you give, the more random the AC’s meta-review becomes, the more people will engage in speculative submissions and achieve good speculative results, and the worse the conference’s quality will become.

And I am indeed quite arrogant. Given the average quality of CVPR papers nowadays, if it takes you a whole day to read one paper, I’m afraid you might not be qualified to review for others, right? (/s, no offense intended to those not concerned)

Original Text Below:

In fact, I had refused to review for many years. The act of submitting/reviewing (especially in the field of deep learning) is essentially a religious behavior of debating a certain research subject based on scientific faith, something that should inherently have a high barrier to entry. However, gradually, when I found myself having to spend several hours on absurd material like “ResNet is the greatest neural network,” “mass data smelting will allow us to surpass the UK and the US,” or “fighting roosters is also an adversarial attack” (a nonsensical pun on “adversarial attack” vs. “fighting rooster,” which sound similar in Chinese), I naturally gave up participating in such activities.

This time, however, I was spammed by the PC (Program Committee) with about five or six review invitation emails (“The quality of the conference strongly depends on the willingness of reviewers like you to provide thoughtful, high-quality reviews on time.” But they never invited me to be an AC, huh?). So, with a slightly curious mindset, I accepted, and was assigned manuscripts with IDs approaching 18000, which gave me a little “AI shock.” The reviewing experience was indeed worse than a few years ago. Out of the 6 papers I handled, 5 were given a 1-point Reject within 10 minutes (just after reading the abstract and introduction). Of course, some might question how I differ from the reviewers described on Zhihu (a Chinese Q&A platform). Indeed, everyone’s definition of an excellent paper varies. Occasionally, someone like me who almost never looks at experimental performance is a godsend for some (and naturally a disaster for most). However, friends who enjoy achieving SOTA (state-of-the-art) on a Cantor set (i.e., on very niche or insignificant problems) can certainly try submitting a few more times; you’ll eventually encounter a reviewer unlike me.

So, from this perspective, I want to discuss two issues: first, how have top conferences fallen into such a chaotic state, and second, what simple standards should we use to judge the quality of papers.

I. Reasons for the Influx of Disastrous-Quality Manuscripts in Top Conferences

Applying economic concepts, the fundamental reason for this situation is that the academic market itself is not sufficiently effective. In other words, a researcher’s value cannot be quickly and accurately reflected by their various academic activities. Therefore, if a researcher hopes to achieve academic success, they have no choice but to rely on speculative submissions to top conferences that might yield excessive returns. Of course, designing a value assessment system for academic researchers is a grand topic beyond my capabilities (and no one would listen to me anyway). Here, I only wish to explore the reasons why the market is ineffective and why speculative behavior can succeed.

Firstly, theoretically, even to form a weakly effective market, the value assessment of an academic researcher should be based on a full understanding of their public information (all publications, etc.). But the workload and standardization of judgment involved are beyond what ordinary organizations can accomplish. Thus, current academic evaluation systems adopt a path akin to “efficiency first, with fairness also considered,” for example, by judging quantifiable indicators like the frequency of first-authorship in top conferences and citation counts to quickly label a researcher’s value. Considering practical factors, especially in academic systems where professors also handle administrative duties, this approach itself is neutral. However, in an academic system with a large number of administrators, merely transplanting this path without any initiative appears quite निष्क्रिय (inactive/ineffective). Particularly for those who have obtained administrative “hats” (positions/status) through this paper-centric evaluation model, if they continue to evaluate the next generation of researchers using the same model, it will directly lead to the next generation in this market engaging in “involution” (内卷 - intense internal competition for diminishing returns) based on maximizing self-interest.

To digress a bit, more than institutional reasons, the difference in the level of needs satisfied by academic activities (i.e., human nature) is probably the core factor for market failure. In my view, academic activities should at least start from satisfying cognitive needs (curiosity, desire for knowledge), and obtaining high-level “hats” should be about satisfying higher-level needs like aesthetics and self-actualization (values, morals). If an academic researcher is merely pursuing physiological or safety needs (job security, benefits), and the purpose of obtaining a “hat” is to satisfy specific esteem needs (achievement, fame, status), it will inevitably lead to a distortion of academic value judgment. However, I still maintain a relatively moderate attitude here: a researcher’s needs hierarchy largely depends on the external environment rather than personal factors. For example, our “Break the Five Onlys” special action (a Chinese academic reform initiative targeting overemphasis on papers, titles, diplomas, awards, and projects) is essentially about providing an external environment where researchers don’t have to worry about lower-level needs (I’m not evaluating its results here, but I personally think it has positive significance).

On another level, a large number of relatively junior researchers are attempting such speculative behaviors (let me add a disclaimer here: although factors from a certain great Eastern country [China] are involved, the number of garbage manuscripts from another great Eastern country [likely India] has significantly increased in recent two years). My personal understanding is that the fundamental reason is the severe decline in the credibility of so-called top conferences. In other words, these researchers have lost their reverence for descriptors like “Top,” “Tier 1,” “S-class,” and have purely abstracted acceptance into a random probability event. This is manifested by the fact that even as a long-serving reviewer for these top conferences, I cannot determine if a particular submission from myself or my students might be accepted. I can’t even say for what reasons they might be rejected. And yet, I’m still f***ing reviewing other people’s papers! One can imagine how ridiculously random the current reviewing process is (let’s not even mention the top three AI/ML conferences; even UAI/AISTATS are approaching a similar state). In my ideal conception, a top conference should have enough credibility that low-quality manuscripts would realize their inability to be accepted and give up before even submitting, just as not everyone would attempt to submit to the four great mathematics journals. And when I find myself constantly cursing “what is this trash?” while looking at posters every year, I can’t find any reason not to make such speculative submissions myself.

However, unlike the ineffectiveness in the academic market caused by human nature, I personally understand that the decline in the credibility of top conferences is due to institutional reasons. Excluding those who pursue Reviewer/AC roles as “hats” for utilitarian purposes (I don’t know if that excludes most people), a Reviewer/AC who “generates electricity with love” (works out of passion) should be doing this job primarily to make our environment better; otherwise, he/she could just refuse without any penalty. Specifically, there are two main problems: First, reviewer expertise is genuinely limited, and this limitation comes not only from insufficient experience (e.g., the bizarre precedent set by a certain ML top conference where submitters volunteer as reviewers) but also from the explosive growth in research fields and the number of papers within them. So much so that without spending several hours doing a survey, you might not even know where the paper you’re reviewing stands or what its real contributions are to the field (especially interdisciplinary fields). Second, the role of ACs in the current review process is quite weak (especially for those who treat the AC role as a “hat”). Considering an AC might have dozens of papers, they often have no choice but to make simple decisions based on the positive/negative direction of review scores. This, in turn, further exacerbates reviewers’ tendency to give borderline scores, meaning reviewers are unwilling to take responsibility for the AC’s simple positive/negative decisions and prefer to give a mediocre score while waiting for a more definitive review. Thus, a death spiral is formed: the review process, which should be about group discussion to reach a conclusion, becomes a personal show for some “crazy reviewer.” If they dislike a paper and give it a low score, the paper is almost certainly rejected, even if the reviewer is unqualified, their comments are childish and meaningless, or even if they copy-paste review comments from a submission to another conference (I’ve encountered this multiple times).

Here, I will summarize some failed institutional attempts and offer a possible solution. First, the biggest failure is open review, because it essentially provides more information to an already ineffective academic value evaluation market, so it was doomed to be meaningless from the start. Judging by the results, open review not only failed to improve the quality of manuscripts and corresponding review comments (anecdotal evidence from my surroundings; feel free to correct me if you disagree) but also caused adverse effects like homogenization of review comments and “borrowing” of academic achievements. Of course, I’m not completely negating this solution; for high-quality papers, reading their reviews and rebuttals can help improve academic standards, but as a system for optimizing the review process itself, it’s a failure. Second is multi-round review or journalization. The intention was to enhance communication between authors and reviewers, but in execution, most reviewers don’t provide second-round opinions (especially for clear rejections), leaving authors shouting into the void with no response, severely degrading the submission experience (e.g., I will never submit to certain conferences/journals again). Additionally, there was an attempt at a “no rating” review model. However, the result was basically equivalent to a 3-tier system: accept / get lost / whatever. If ACs lack responsibility, it more easily leads to a one-vote veto situation. Besides, if I’m not going to be openly shamed by a reject, why not submit something “just for the heck of it”? What if it gets in?

So, what institutional solutions might improve the manuscript quality at such top conferences? Although I haven’t managed a conference, based on my long-term “scarring” experience, I believe adopting an “Abstract-Invited Submission” process could partially solve this problem. This process has three steps: First, all authors wishing to submit should provide a short (CVPR double-column, 2-4 pages) abstract (abstracts allow dual submission). This abstract is then scored by general volunteer reviewers. Subsequently, ACs select a portion of abstracts with higher scores or those they personally deem valuable and send submission invitations to their authors. Finally, the full manuscripts submitted by invited authors (full manuscripts do not allow dual submission) are reviewed by teams led by senior reviewers, and a selection of the best is accepted. This process both exempts reviewers from writing formal reviews for poor manuscripts and gives invited authors a relatively clear expectation of acceptance. If any big shots happen to see this, perhaps they could test its effectiveness.

II. Measuring Article Value from an Epistemological Perspective

Next, I want to talk about those five poor sods whose papers were given a 1-point Reject within 10 minutes (of course, from another perspective, I might be the unlucky one, having to waste so much time providing unrewarded guidance to what are clearly “drafts for comment”). This begins with Kant’s interpretation of knowledge, because in my philosophy, if an academic research manuscript cannot provide knowledge, it should be rejected. According to my rudimentary understanding of Deng Xiaomang’s translation, Kant, in the “Critique of Pure Reason,” proposed a theoretical framework distinguishing between a priori (conceptual) and a posteriori (empirical) knowledge. The former refers to knowledge that exists before experience; it is about form and logical structure and does not depend on specific empirical content. The latter is based on sensory experience; it depends on our interaction with the external world (there are some translation issues here that can be referred to in the Q&A below). Kant believed that although pure reason can produce necessary knowledge, such as logical and mathematical laws, this knowledge does not directly tell us facts about the physical world. Conversely, knowledge of the physical world all comes from experience, but the possibility of knowledge is determined by a priori conditions. Therefore, it requires the combination of sensory intuition input and rational concepts (making synthetic judgments).

I know you might dislike this act of quoting convoluted philosophical scriptures, so to put it in plain language: firstly, analyses of single concepts (“XXX is the cause of problem YYY, so using method ZZZ can solve it”) or single experimental results (“We used XXX, so we achieved SOTA on YYY”) are insufficient for an article to produce necessary knowledge. Secondly, by performing some meaningless, intimidating mathematical derivations (“a theorem that requires ten assumptions to prove”) or seemingly complete but very hollow ablation studies (“we removed every component and ran an experiment once”), it is also difficult to integrate the views or assertions in an article into necessary knowledge (even though most people do this).

I believe you will still find the above explanation abstruse, so next, I will directly provide some paper writing paradigms to discuss whether a certain type of article is truly meaningful in my eyes:

Based on theory T or experiment E, conclude A holds: This type of article might seem invalid at first glance, but in reality, the subject of such articles is usually a relatively new field. For example, the first paper proposing adversarial attacks. Providing “a new field” itself is a result of the joint action of reason and experience, so such articles are usually considered very valuable.
Based on theory T, conclude A holds, and simultaneously based on experiment E, conclude A’ holds: This type of article is common in the classic machine learning field and is a very solid way of argumentation. Although A’ is an approximation of A, it is sufficient to provide reasonable and reliable experience to form definite knowledge.
Based on experiment E, conclude A holds, and simultaneously based on theory T, conclude A’ holds: This writing paradigm is the erroneous path I mentioned earlier of using intimidating mathematics to improve article quality, often appearing in algorithm-focused deep learning articles in known fields. But in my view, if reason is needed to “complete” knowledge, it should participate in the process of deriving A from experiments; your theory should be an analysis of the experimental results. Analyzing an approximate A’ is a process separated from experience and does not in itself form knowledge. So, the strength of this article still depends on the strength of experiment E. Many people feel wronged when they “were doing experiments and writing theory and suddenly got rejected,” but in fact, it’s not unjust at all.
Given that based on experiment E, A holds; based on experiment E’, conclude A’ holds: Distinct from deriving A based on theory, if a similar experiment is used to verify a similar result, its strength’s upper limit depends on the inference from E to A, which will not be necessary knowledge (it might have flaws but be accepted). So, the premise of this writing paradigm itself is weak, and the lack of rational analysis makes it even more hollow.
Given that based on experiment E, A and B hold; based on experiment F, conclude A+B holds: This paradigm is our most common “Frankenstein monster” (patchwork) model. Its main problem lies in innate human thinking: if two things that hold true are stitched together, they will naturally hold true. Therefore, this experiment F is considered to offer no new experience for forming knowledge (meaning, “is it necessary to experiment on something that’s a no-brainer?”). But conversely, whether this logic is correct actually varies from person to person, so some people might also think that “A+B holds” does provide new knowledge, but its strength might not be very good, i.e., what’s commonly called a “filler paper.” Of course, I usually reject such articles decisively, so please rest assured.
Given that based on experiment E, A, B, C, D hold; based on experiment F, conclude A’+B+C’+D holds: This is a typical systematic application article. Although it might also seem like a filler paper, the interesting point is that finding the combination of A+B+C+D and replacing components within it is itself a manifestation of reason at work. Compared to simply A+B, finding a reasonable scheme to solve an application problem is a good manifestation of knowledge. However, whether it should appear in a top conference also depends on the strength of this application and its results.
Given that based on experiment E, A={A1, A2} holds; and simultaneously based on experiment E’, conclude A={A1} holds or does not hold: This paradigm is actually very interesting. It’s different from stitching A+B; instead, it examines the possibility of A itself holding true. Its rationality lies in the rigorous analysis of A’s composition, rather than a relatively loose combination of different A and B. Even if the result is that A does not hold, if the experiment’s strength is sufficient, for example, the seminal paper demonstrating Adam (optimizer) doesn’t hold, I would still be quite willing to give it a very high evaluation.
Given that based on experiment E, A={A1, A2} holds; and simultaneously based on experiment E’, conclude A={A1, A2’} holds: This paradigm (I don’t even think it can be called a paradigm) is common in articles by novice researchers. To put it bluntly, it’s about taking someone else’s code, changing one component (perhaps doing some ablation studies), and then the performance improves. In my view, the premise for knowledge in this article is entirely built upon the earlier experiments for A, so it provides almost no new knowledge. Even if the author provides a theory to demonstrate that A’={A1, A2’’} holds, it is still a straightforward inference from known knowledge (and its strength is limited by A itself). Even if such an article has very good results (meeting the standard for a “filler paper”), I find it difficult to give it a high evaluation. Moreover, to reject it, one usually has to write a lot of arguments demonstrating why it doesn’t provide knowledge, which is truly exhausting.
Given that based on experiment E, A={A1, A2} holds; based on experiment F or theory T, conclude A={A1, A3} holds: Here, A3 is different from the aforementioned A2’; it’s a completely different component. As long as experiment F is sufficiently strong, I personally consider it a good paradigm for providing knowledge.
Given that based on experiment E, A={A1, A2} and B={B1, B2} hold; and simultaneously based on experiment, conclude C={A1, B2} holds: This writing paradigm is very difficult to describe in a few words. It’s better than directly stitching two solutions, but it seems that reason hasn’t played a sufficient role. I can’t give a direct conclusion, but usually, such articles can achieve an effect similar to the aforementioned paradigms by restructuring the framework (story).

I. Reasons for the Influx of Disastrous-Quality Manuscripts in Top Conferences

II. Measuring Article Value from an Epistemological Perspective

Enjoy Reading This Article?