Can event boundaries be identified from statistical regularities without understanding events?
This explores whether a system can carve a stream of experience into discrete events — finding the seams between 'one thing happening' and 'the next' — purely by picking up on statistical patterns, without any grasp of what an event actually is.
This explores whether event boundaries can be found from statistical regularities alone, with no understanding of the events themselves — and the corpus says yes, surprisingly well, while quietly complicating what 'finding a boundary' even means. The cleanest evidence is that GPT-3 segments narrative into events that line up with averaged human judgments *more tightly than any individual human annotator does* Do language models segment events like human consensus does?. The model has no theory of events; it was trained to predict the next token. Yet next-token prediction over enough diverse text seems to bake in a statistical consensus about where things begin and end — suggesting the 'seams' between events leave a measurable trace in word-level regularities, and you don't need to understand the event to detect the trace.
There's a deeper formalization of why this works at all. Epiplexity tries to measure exactly the kind of structure a computationally bounded observer can actually extract from data, separating genuine learnable regularity from noise What can a bounded observer actually learn from data?. Read against the segmentation result, it reframes the question: event boundaries may simply *be* one of the high-yield regularities sitting in the data, learnable by anything that compresses well — no comprehension required, just enough capacity to notice that some transitions are more predictable than others.
But here the corpus turns the question on its head. Several notes argue that statistical success at boundary-finding is not the same as having events. The sharpest is the claim that AI produces 'event-residue, not utterances' — the output carries surface markers inherited from training, but lacks the actual event structure that real communication has; the human reader unilaterally animates the residue into a pseudo-event, supplying the missing structure themselves Does AI generate genuine utterances or just text patterns?. So a model can place boundaries that match human consensus while the 'event' lives entirely on the human side of the interaction. The statistics locate the seam; the understanding is imported by us.
This pattern — competence from statistics standing in for genuine grasp — recurs across the collection. Models reproduce human causal-reasoning *errors* (weak explaining-away, Markov violations) precisely because they absorbed the statistics of how humans talk, not because they reason categorically Do large language models make the same causal reasoning mistakes as humans?. And reasoning traces turn out to work as computational scaffolding rather than meaningful steps: deliberately corrupted, semantically irrelevant traces train models about as well as correct ones Do reasoning traces need to be semantically correct?, while trace *length* tracks proximity to training distributions rather than real problem difficulty Does longer reasoning actually mean harder problems?. The throughline: structure that looks like understanding is often recovered statistics.
The thing you didn't know you wanted to know: the answer isn't a clean yes or no. Yes, boundaries fall out of statistical regularity without comprehension — well enough to beat individual humans. But what that buys you is a *placement of seams*, not a possession of events. The understanding the boundaries seem to imply gets quietly contributed by whoever reads the output. The statistics find where to cut; meaning is what the human brings to the cut.
Sources 6 notes
GPT-3's event boundaries correlate more strongly with averaged human annotations than individual human annotators do. This suggests language models may pre-compute statistical consensus through training on diverse text, or that next-token prediction parallels human event cognition.
Epiplexity formalizes the structural information a computationally bounded observer can extract from data, separating learnable regularity from time-bounded entropy. This task-free measure correlates with out-of-distribution generalization and explains why some datasets enable broader transfer than others.
AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.
LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.