-
How Many Different Outputs Can a Transformer Generate?
Maxime Meyer*, Mario Michelessa*, Caroline Chaux, Vincent Y. F. Tan.
We formalize the notion of accessible sequences—those that a transformer can produce for some prompt—and quantify how accessibility depends on prompt length and model parameters, closely matching empirical values.
We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sentences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks—such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.
-
Online Learning-to-Defer with Varying Experts.
Maxime Meyer*, Hoang Duy Dang*, Yannis Montreuil*, Lai Xing Ng, Axel Carlier, Wei Tsang Ooi.
We study Learning-to-Defer in the online setting, which accounts for varying expert availability and accuracy.
Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt T)$ under a low-noise condition, where $T$ is the time horizon, $n$ the number of labels, and $n_e$ the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal H$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.
-
Memory Limitations of Prompt Tuning in Transformers.
Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan.
We prove that the amount of information a transformer can memorize in a prompt scales at most linearly with its length, and that transformers suffer inherent memory limitations, regardless of prompt length.
Despite the empirical success of prompt tuning in adapting pretrained language models to new tasks, theoretical analyses of its capabilities remain limited. Existing theoretical work primarily addresses universal approximation properties, demonstrating results comparable to standard weight tuning. In this paper, we explore a different aspect of the theory of transformers: the memorization capability of prompt tuning. We provide two principal theoretical contributions. First, we prove that the amount of information memorized by a transformer cannot scale faster than linearly with the prompt length. Second, and more importantly, we present the first formal proof of a phenomenon empirically observed in large language models: performance degradation in transformers with extended contexts. We rigorously demonstrate that transformers inherently have limited memory, constraining the amount of information they can retain, regardless of the context size. This finding offers a fundamental understanding of the intrinsic limitations of transformer architectures, particularly their ability to handle long sequences.
-
Online Learning of Pure States is as Hard as Mixed States.
Maxime Meyer, Soumik Adhikary, Naixu Guo, Patrick Rebentrost.
We study online learning of quantum states and prove that learning a pure state is as hard as learning a mixed state (in terms of fat-shattering dimension), id est that the rank of the matrix has no impact on the complexity of the learning problem.
Quantum state tomography, the task of learning an unknown quantum state, is a fundamental problem in quantum information. In standard settings, the complexity of this problem depends significantly on the type of quantum state that one is trying to learn, with pure states being substantially easier to learn than general mixed states. A natural question is whether this separation holds for any quantum state learning setting. In this work, we consider the online learning framework and prove the surprising result that learning pure states in this setting is as hard as learning mixed states. More specifically, we show that both classes share almost the same sequential fat-shattering dimension, leading to identical regret scaling. We also generalize previous results on full quantum state tomography in the online setting to (i) the $\varepsilon$-realizable setting and (ii) learning the density matrix only partially, using smoothed analysis.