Open reviews

March 7, 2026 — Open review of Chen, Raccah and colleagues:

Chen, P.*, Raccah, O.*, Vo, V., Poeppel, D., & Gureckis, T. Distinct paths to false memory revealed in hundreds of narrative recalls. PsyArXiv. DOI


Chen, Raccah and colleagues tackle a very interesting and challenging question: How can we quantitatively study false memories at scale between naturalistic narratives and unconstrained verbal recall? They leverage modern LLMs to identify these errors in hundreds of free recalls across four different story stimuli from an open naturalistic recall dataset (recently released by the authors). Specifically, the authors use an LLM to match recall sentences back to events in the original narrative, then identify whether the recalled events contain either factual errors or confabulations. The authors validate these LLM-based judgments against human judgments in two behavioral experiments, showing that the LLM–human agreement is comparable to inter-human agreement. Finally, the authors seek to identify what semantic factors may drive false memories. Again using LLMs/embeddings, they find that semantic centrality of an event predicts its retention in recall, similarity of an event to other narratives tends to introduce factual errors, and the surprise of an event tends to trigger confabulations. This manuscript is already quite solid: the writing is clear and the approach is well-motivated; the methods are rigorous and described in excellent detail; the figures are very well designed. I have one main comment/question about the approach and interpretation, but most of my other comments are fairly minor.

Major comments:

I had a difficult understanding how to interpret the “Similarity to narrative corpus” metric you ultimately find is related to factual errors. If I understand correctly, you compute sentence embeddings for length-matched sequences in the NarrativeXL corpus, compute the cosine similarity between a given event and all of the embeddings from Narrative XL, then average these cosine similarities (page 15). This is interpreted as the resemblance of a story event to “common world knowledge”, “stereotypical narrative patterns”, “familiar narrative templates” (page 6), “familiar schema” (page 10), “prototypical narrative content”, and “common story patterns” (page 15). Is this really what this metric represents? I can imagine that there are many prototypical narrative structures in NarrativeXL, sure, but I would also imagine that some of them are quite different from one another. If a given story event is highly similar to one subset of narrative templates from NarrativeXL, it could also be highly dissimilar from another subset of narrative templates. This might yield an intermediate average cosine similarity overall. How should we interpret a story event with relatively low average similarity? Is it dissimilar from all prototypes of narrative structure? It could also be highly similar to one type of narrative structure, but highly dissimilar from a much larger subset of narrative templates. How should we interpret a story event with high average similarity? How could a story event be meaningfully similar to all (or most) narrative templates? Wouldn’t this lack of specificity suggest that the story event is kind of “null” in structure? I’m trying to imagine these story events as vectors in a high-dimensional embedding space; the NarrativeXL dataset will yield a huge cloud of vectors. Presumably there are somewhat meaningful clusters or structures in these narrative vectors. What is the interpretation of a story event whose vector is highly similar to many of these narrative vectors? My point is that I worry this metric could also be unfavorably interpreted as reflecting something like a “neutral” or “central” or “boring” event :) rather than meaningfully reflecting “prototypical narrative content”. You might be able to convince readers this metric has something to do with prototypical narrative templates by showing some examples of story events with high average cosine similarity to the NarrativeXL embeddings? Alternatively, I wonder if there’s some other related metric you could use rather than averaging across all those cosine similarities; I’m imagining a prototypical narrative template might be a story event that lives in a high-density local region of NarrativeXL embeddings or lives close to a cluster centroid of NarrativeXL embeddings.

In your language about “two dissociable routes” to factual errors and confabulations, you also interpret the results as indicating “distinct cognitive mechanisms” leading to the two types of errors. Do the results really let us infer two “distinctive cognitive mechanisms” or “multiple underlying mechanisms”? I understand that you see a clean split in the statistical effects (Fig. 4), but two types of errors are defined to be distinct (and in fact sometimes co-occur). I’m trying to imagine an alternative to the “two distinct cognitive mechanisms” where both types of errors stem from a unified geometric representational format for memory and semantics. Is this a possibility? Do the current results rule this possibility out?

The authors include “serial position (order of the event within the story)” as a fixed effect within their statistical models (which I think makes sense). However, I’m wondering if it might also make sense to include a U-shaped variable (or quadratic term on serial position)? This would allow you to account for primacy and recency effects in the recall sentences.

Can you say a little bit more about how we should interpret the observation that inter-rater reliabilities for identifying some of these false memories were fairly low (e.g., ~60% or 70% agreement). Does this observation just reflect that the task is fairly hard? Or that the task is fairly subjective? What are the implications? (e.g., is this kind of subjectivity something we just have to accommodate when it comes to complex, naturalistic tasks?)

Minor comments:

Can you say a little bit more about how you deal with recall sentences where both types of false memories co-occur? I think you mentioned this at some point but now I can’t find it.

Page 3: “An detailed” > “A detailed”

Page 6: “semantically coherent word list” > “lists”?

Page 8: “results using USE and GPT2-Small”—do you mean MPNet here?

Page 12: “generated and verified the narrative transcripts”—word missing here?

Page 12: “To stay below context window limits”—is there a precise token limit on the context window for GPT-4o? Maybe that information isn’t publicly available.

Page 12: “Recall transcripts were segmented into individual sentences.” How exactly were sentence boundaries defined/applied? Deciding where sentence boundaries are in continuous speech isn’t entirely trivial, right?

This is not directly related to the current results, but I’m curious: Have you tried seeing how often an AI agent creates false memories or confabulations for these stimuli? I suppose this would be tricky to assess and depend heavily on prompting, but it’d be neat to see whether AI “recalls” of the same stimuli are more accurate or perhaps yield similar proportions of false memory types.