Research on Large Language Models Based on CCCP-Era Technologies (awesome-cccp-llm)
This document provides a systematic review of key research literature on Large Language Models (LLMs), grounded in classical mathematical and engineering theories developed during the mid-20th century. These theories offer a robust analytical framework for understanding the internal mechanisms, interpretability, robustness, and controllability of LLMs. They also provide a viable path for “academic archaeology” for PhD students searching for a research direction.
Stochastic Processes & Statistical Physics
This section treats LLMs as high-dimensional dynamical systems, utilizing Markov chains, Stochastic Differential Equations (SDE), and Random Matrix Theory to analyze their dynamical evolution and parameter structures.
Large Language Models as Markov Chains
https://arxiv.org/pdf/2410.02724v2
This paper proposes an innovative analytical perspective by abstracting the inference mechanism of LLMs (rather than the language itself) as a Markov chain over the vocabulary space. By constructing this model, researchers can apply Markov chain theory to analyze the system’s stationary distribution, thereby gaining a theoretical understanding of the long-term generation behavior of LLMs.
A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process
https://arxiv.org/pdf/2501.16783
This paper focuses on LLM safety, specifically the self-amplification of potential biases within “Chain-of-Thought” (CoT) processes. Its core contribution is modeling this dynamic process as a continuous-time Stochastic Differential Equation (SDE). Based on this model, the study utilizes the Fokker-Planck method to analyze the system’s phase transition characteristics. This perspective offers a new diagnostic criterion for LLM safety: ensuring the system remains in a “subcritical” region to prevent the runaway amplification of bias.
Small Singular Values Matter: A Random Matrix Analysis of Transformer Models
https://arxiv.org/pdf/2410.17770
This study employs Random Matrix Theory (RMT) to analyze the structure of high-dimensional LLM weight matrices. The core methodology involves using RMT (specifically the Marchenko-Pastur Law) as a “null hypothesis” for information. By observing the degree to which the singular value distribution of actual weight matrices deviates from RMT predictions, the researchers identify where information is stored. A key finding is that “small singular values matter”: information is stored not only in the largest singular values but also in the smallest ones that deviate from RMT predictions, which has significant implications for model pruning and understanding.
Black-box Detection of LLM-generated Text Using Generalized Jensen-Shannon Divergence
https://arxiv.org/abs/2510.07500v1
This paper introduces “SurpMark,” a method for detecting AI-generated text. The central idea is to model the dynamic changes of token “surprisal” during the generation process as a Markov chain. Efficient detection is achieved by calculating the Generalized Jensen-Shannon Divergence (GJS score) between this chain and a reference chain of human writing.
Quantitative Representation & Information Theory
This section utilizes two paradigms of information theory—Kolmogorov’s algorithmic perspective and Shannon’s probabilistic perspective—as “probes” to measure and attribute information flow within LLMs.
Position: Understanding LLMs Requires More Than Statistical Generalization
https://arxiv.org/abs/2405.01964
This paper uses Kolmogorov Complexity (KC) as a profound theoretical tool. Its core contribution is arguing that one of the fundamental reasons deep learning models (including LLMs) can generalize lies in their inherent “Simplicity Bias”—the tendency of models to learn simpler functions (i.e., functions with lower KC) during the training process.
The KoLMogorov Test: Compression by Code Generation
https://arxiv.org/abs/2503.13992v1
This paper demonstrates a clever “meta-analysis” approach. It inverts the analytical relationship by using code-generating LLMs (CodeLMs) as tools to estimate the upper bound of the Kolmogorov Complexity for any given sequence $x$. This is implemented by prompting the LLM to generate the shortest program $p$ that outputs $x$, where the length of $p$ serves as the estimate for KC.
Entropy-Lens: The Information Signature of Transformer Computations
https://arxiv.org/abs/2502.16570v1
This paper proposes an analytical technique called “Entropy Flow.” Its core contribution lies in tracing the evolution of Shannon entropy in token representations layer-by-layer within the Transformer. By visualizing and quantifying where information is “processed,” “compressed,” or “combined,” it opens the “black box” of internal information processing.
Cybernetics & Optimal Control
This section treats the LLM as a system to be “manipulated,” applying the conceptual frameworks of Cybernetics (Wiener) and the mathematical tools of Optimal Control (Pontryagin) to achieve model alignment and controllability.
Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space
https://arxiv.org/abs/2510.26219v1
This paper presents AISP, a training-free alignment method. The core idea is to apply a perturbation $u$ as a “control input” in the LLM’s pre-logit space and use optimal control techniques to guide the LLM toward generating high-reward outputs in real-time, achieving controllability during the inference phase.
Data Selection via Optimal Control for Language Models
https://arxiv.org/abs/2410.07064
This study extends the application of optimal control theory from “test-time” to “training-time.” Its core contribution is the rigorous formalization of the pre-training data selection process as an optimal control problem, with the goal of calculating an optimal data selection strategy that maximizes the model’s final performance.
Prompt Engineering Through the Lens of Optimal Control
https://arxiv.org/abs/2310.14201
This paper applies the optimal control framework to prompt engineering. Its core contribution is formalizing the prompt optimization process as an optimal control problem, seeking the optimal “control input” (i.e., the prompt) to steer the model toward a desired output state.
What’s the Magic Word? A Control Theory of LLM Prompting
https://arxiv.org/abs/2310.04444v4
This paper critically analyzes the limitations of directly applying Classical Control Theory (CCT) to LLMs. Its core contribution points out a fundamental “impedance mismatch” between the two. The analysis notes that LLM dynamics are discrete, high-dimensional, and possess a “Shift-and-Grow” characteristic (the dynamic growth of the KV cache) that CCT is unequipped to handle.
Unveiling LLM Mechanisms Through Neural ODEs and Control Theory
https://arxiv.org/abs/2406.16985v1
The core contribution of this paper is the re-modeling of the Transformer’s discrete dynamical process as continuous-time Neural Ordinary Differential Equations (Neural ODEs). This builds a bridge for applying continuous-domain control theory to analyze and ensure the stability and reliability of LLM training dynamics.
Signal Processing & Frequency Domain Analysis
This section treats language and Transformer components (such as positional encodings) as “signals,” using tools like Fourier analysis and Wavelet transforms to deconstruct them and guide architectural improvements.
Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
https://arxiv.org/abs/2412.17739v1
This paper treats Transformer positional encoding as a signal processing problem. Its core contribution is a deep frequency-domain analysis of the internal working mechanism of RoPE (Rotary Positional Encoding), mathematically linking it to the Non-Uniform Discrete Fourier Transform (NU-DFT).
Wavelet-based Positional Representation for Long Context
https://arxiv.org/abs/2502.02004
This paper provides a complete closed loop from theoretical analysis to architectural improvement. The analysis points out that while RoPE is similar to a wavelet transform, its key flaw (poor extrapolation performance) lies in using only fixed scale parameters. Based on this diagnosis, the study proposes a new positional encoding method using the full wavelet transform—capable of capturing multi-scale information—and empirically demonstrates that this method significantly improves the model’s extrapolation capabilities on long sequences.
Enjoy Reading This Article?
Here are some more articles you might like to read next: