Can autoregressive models be trained to produce more cataphoric text?
This explores cataphora — when a text refers forward to something it hasn't introduced yet ('Before *she* spoke, Maria paused') — and whether left-to-right next-token models can be trained to plan those forward references, or whether the autoregressive setup itself is the obstacle.
This explores cataphora — forward reference, where a pronoun or placeholder points to something the text reveals only later — and the honest answer the corpus points toward is that this is less a training problem than a generation-order problem. Cataphora demands that the model commit to a global plan before emitting the early tokens that depend on it. Autoregressive generation produces text strictly left-to-right, one token conditioned on the past, which is exactly the regime where forward commitments are hardest to honor. Several notes here frame transformer generation as *flow rather than storage* — knowledge exists only in the act of performance, contextual and inseparable from the unfolding sequence (Do transformer models store knowledge or generate it continuously?). A model that improvises forward has no natural place to stash 'and here's what that pronoun will turn out to mean.'
The most direct light the corpus throws on your question comes from work that sidesteps autoregression entirely. Diffusion language models succeed precisely on the *global* control tasks — syntax, length, structure — that plug-and-play autoregressive methods can't reach, because their continuous latents let gradients flow across the whole sequence at once rather than through a discrete left-to-right bottleneck (Can diffusion models enable control that autoregressive models cannot reach?). Cataphora is exactly that kind of global property: you'd want to constrain the early reference and the late antecedent jointly. So the corpus's implicit verdict is that the cleanest route to more cataphoric text may be to change the paradigm — denoise the whole passage in parallel — rather than to coax a sequential decoder into faking foresight.
If you want to keep an autoregressive backbone, the interesting middle path is giving it a *plan* to decode from. Latent-thought language models couple a slow, global latent that captures structure with a fast local decoder, scaling reasoning along a dimension separate from raw parameters (Can latent thought vectors scale language models beyond parameters?). A latent that encodes 'this sentence sets up a referent resolved three sentences down' is the kind of representation that could make forward reference deliberate rather than accidental. Neural-memory architectures point the same way from a different angle — separating short-term attention from a long-term store that holds 'surprising' tokens lets a model carry structural commitments across long spans without quadratic cost (Can neural memory modules scale language models beyond attention limits?).
The cautionary thread is that you probably can't get there by prompting or light fine-tuning alone. The corpus repeatedly finds that surface-level interventions reorganize what a model already does rather than installing new behavior, and that strong training priors override in-context instructions unless you intervene deeper in the representations (Why do language models ignore information in their context?). Cataphora is statistically rarer and structurally harder than its backward-looking cousin anaphora, so a next-token objective will under-reward it by default. Training 'more cataphoric text' likely means changing the *objective or the architecture* so forward planning is represented and rewarded — not just asking nicely. The thing worth taking away: the question 'can we train for cataphora?' quietly becomes 'can a model plan before it speaks?' — and that's the same frontier where diffusion, latent-thought, and external-memory approaches are all pushing.
Sources 5 notes
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.