Open reviews

August 4, 2024 — Open review of Raccah and colleagues:

Raccah, O., Chen, P., Gureckis, T. M., Poeppel, D., & Vo, V. (2024). The Naturalistic Free Recall Dataset: four stories, hundreds of participants, and high-fidelity transcriptions. PsyArXiv. DOI


Raccah and colleagues introduce the Naturalistic Free Recall Dataset (NFRD), which comprises verbal recall data (and time-stamped transcripts) from 200+ subjects across four spoken story stimuli. The authors thoroughly describe the dataset, then perform several validation analyses using cutting-edge methods for modeling the content of the story stimuli and spoken recall data. They use these methods to replicate some well-known memory effects observed in less-naturalistic list-learning paradigms. This is an excellent manuscript and already in better shape than the vast majority of papers I review. I provide a few relatively minor comments and clarifying questions below, but I think this is pretty much publishable as-is—great work!

Comments:

My first two comments are really just intended to promote the novelty (and challenge) of this kind of paradigm along with the associated analyses. There’s been so much work done on naturalistic stimuli, but verbal recall is still undervalued as a really great naturalistic behavior. We rarely recall arbitrary lists to one another, but we very often tell each other about a movie we saw, a paper we read, and so on—it’s much closer to the kind of narrative communication and “storytelling” that’s a core part of our everyday lives (e.g. Bruner, 1986). This is a very naturalistic way to probe what content from a real-world narrative really “sticks” in the mind of a listener (and what doesn’t).

The other side of this coin is that I think you could emphasize a bit more how challenging it is to extend the tools and mindset of cognitive psychology to this kind of behavior. I’m not sure these challenges even register to the experimentalist accustomed to studying list learning. In the context of experiments using stimuli like word lists, the elements are already discrete, and the recollection of a particular word is obviously matched to that word occurring in the list. When you try to generalize this framework to real-world narrative memory, the discrete elements become much less obvious and the question of how to match a recalled idea back to the original idea is much more challenging—because the meaning is what’s important, not the precise words used. So here, we use tools like event segmentation to identify (semi-)discrete events; we use topic modeling and word embeddings to numerically quantify “meaning” so that we can quantify the match between stimulus and recall. We happen to be living through a revolution in natural language processing with all sorts of new ways to quantify the structure and meaning of natural language, which makes datasets like this all the more exciting. Anyway, I would just make sure not to undersell the challenge and the innovation needed to elevate our science to the kind of thing that people actually care about, with context and meaning and an engaging narrative arc.

I have a couple minor methodological questions about the ground-truth event segmentation. First, I’m trying to understand the motivation and implications of the 5-second margin of 1s surrounding each button press. I think I understand the motivation that people may simply perceive event boundaries at slightly different times (in a way that is cognitively uninteresting), and we want to tolerate that noise. Can’t this effectively smooth out the consensus peaks into “plateaus” with multiple time steps at similarly high consensus? Do you simply choose the highest point as the boundary itself? Second, related question: what do you mean by “statistically significant participant agreement” at line 136? Significantly greater than zero consensus? Is there some correction for multiple tests across time points? I understand this is largely derived from the paper by Michelmann and colleagues (2021), but a little more detail here would be helpful.

At line 135, you describe a permutation test to determine whether an event is highly recalled across participants. I don’t fully understand how this permutation is performed. You say you “permuted across participants (i.e. within each column)”, but won’t this result in the same consensus for each event? You then “averaged across events within each story”. What is the permuted test statistic you populate into your null distribution? Anyway, I think what you’re doing probably makes sense, I’m just having a hard time figuring it out from the text.

I did some spot checks in the data repository at OSF and everything looks complete and reasonably organized. One thought occurs to me: For non-linguistics people (like me), the TextGrid file structure seems very esoteric; for example, as a neuroimaging/BIDS person, I would feel more comfortable with a CSV containing e.g. words, onsets, durations. I wonder if it’s worth either including a little further explanation of the TextGrid structure, or pointing to a script or code snippet that can serve as an example for how to process/extract words from a TextGrid.

I glanced through the GitHub repo: the README provides a useful outline, and the code seems well-organized and sufficiently commented. It might be worth adding a brief comment indicating what software versions you used (or an environment.yml file) and a suggestion how to most easily install the required packages (via the requirements.txt? conda? a particular python version? maybe it’s straightforward, but sometimes it’s not).

At line 36, you mention “phenomena that may occur less frequently or reliably under naturalistic conditions”—the review by Hamilton and Huth (2020) articulates this concern well and may be worth citing here.

If I understand correctly, most of the analyses in the Technical Validation focus on group-level effects of the flavor “what events in this stimulus are most memorable?” Can the authors speculate on how these tools, or other datasets collected in a similar fashion, might be used to study individual differences in memory?

Line 173: “for each participants” > “for each participant”

References:

Bruner, J. S. (1986). Actual Minds, Possible Worlds. Harvard University Press.

Hamilton, L. S., & Huth, A. G. (2020). The revolution will not be controlled: natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 35(5), 573–582. DOI

Michelmann, S., Price, A. R., Aubrey, B., Strauss, C. K., Doyle, W. K., Friedman, D., Dugan, P. C., Devinsky, O., Devore, S., Flinker, A., Hasson, U., & Norman, K. A. (2021). Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature Communications, 12, 5394. DOI