Open reviews

April 14, 2026 — Open review of Tepfer & Thornton:

Tepfer, L., & Thornton, M. (2026). Access to others’ internal monologues aligns neural processing of narratives and trait impressions. PsyArXiv. DOI


Tepfer and Thornton develop a naturalistic paradigm where they either omit or retain internal monologues (describing the characters’ internal mental states) in a film, then use intersubject correlation (ISC) to measure how knowledge of these mental states impacts processing later in the film. They show that the presence or absence of internal monologues reshapes later processing in temporal areas, and that these changes track with a separate behavioral dataset of event-by-event trait ratings in temporal cortex and ventrolateral prefrontal cortex. This is a very clever design for probing mental state inference in a more naturalistic context and I generally like the manuscript. My “major” comments below are mostly clarifying questions and some suggestions about the framing and analysis.

Major comments:

This sentence—“However, such studies have faced a tradeoff between internal and external validity, which has made it difficult to differentiate the inference of mental states from the process of using mental state knowledge” (Abstract)—was a bit hard for me to unpack (as a “not a real psychologist” reader), especially for the Abstract. I think I understand what you’re going for after reading the Introduction and thinking about it for a bit, but keep in mind that this might not be obvious to plenty of readers of this journal. For example, is it really the tradeoff between internal and external validity that makes it difficult to differentiate between inferring versus using inferred mental states, as your sentence in the abstract implies? Or is it just that the tradeoff (internal vs external) and the difference (inferring vs using) are two inter-related challenges? Another thought on this topic: Are we certain that inferring mental states of others and using those inferences are cleanly separable processes? Not much use making the inference if we’re not going to use it. Or, put differently, if the act of constructing the inference impacts our downstream thoughts or behaviors in any way, isn’t the inference being “used” immediately and almost by definition? The authors unpack what they mean more at greater length in the Introduction, but even after the second paragraph of the Discussion, I was still feeling a bit unsure about the distinction they’re trying to make here. My main point here is I think the authors can introduce this motivation in a more accessible way. My latter point about the separability (“in principle”) of these processes (inferring vs using) remains a concern, but I don’t want to belabor the point.

I’m not familiar with either of the films/episodes used in this experiment. Can the authors give us some general idea why they chose these stimuli? Are they particularly rich in internal monologue? Does the internal monologue have a particularly strong impact on the interpretation of these stimuli? Maybe a brief description of the themes/content of each film would also help.

In the Methods (page 8), you say: “After scanning, participants rated their impressions of the main characters in each video on the traits bossy, easygoing, nosy, rebellious, conscientious, and humble.” What’s the motivation for using these descriptors? Is there some precedent in the literature for using these terms? Are they related to these particular stimuli?

On page 9: “The neural timeseries were then divided into 25 segments—12 segments that corresponded to the IM moments, and 13 segments that corresponded to the NIM moments.” Seems like some of the segments might be quite short. Should we be worried about extremely short segments (IM or NIM) for, e.g., the downstream ISC analyses?

Also on page 9: “To ensure that no information from the IM clips was contaminating the ISC in the subsequent NIM clips (e.g., due to hemodynamic effects) we regressed the IM ISC values out of the ISC values from the subsequent NIM clip.” This is a nice trick and I get the reasoning behind it. Can you add a few more words describing how exactly you do this? I think you extract the ISC values for each participant at a given parcel, then run this regression across the N participant-wise ISC values? Then you submit the residuals of this regression to subsequent analysis?

Despite the previous comment, I still worry that low-level auditory difference between the IM being present versus absent for the two different groups could drive some of the ISC results we see in Figure 3, particularly in early auditory areas and STG. For the group where IM is absent, there will be an absence of speech that could be followed by a (potentially abrupt) onset of speech when the IM segment ends and the subsequent NIM segment begins. For the other group, there won’t be this off/on transient if the IM is present and leads seamlessly into the NIM segment. This off/on transient will tend to create a big bump of brain activity that could last for 10+ seconds into the subsequent segment and will drive up ISC in a boring sensory way that is not closely related to the narrative content of the segments. I’m not sure how best to deal with this, but two ideas come to mind: (1) try trimming off the first 10 seconds or so of each NIM segment before computing ISC in hopes of discarding the off/on transient; (2) set up some kind of boxcar regressors that mark the IM/NIM segments, convolve them with an HRF to account for the bleedover from IM into NIM, regress those out of the time series, then compute ISC on the residuals. My hope is that this kind of approach would somewhat clean up the results and give you a brain map that is more specific to the actual content of the narrative.

In the behavioral results (page 12), the authors report that “Ratings at the final timepoints were not significantly different across versions in scanner participants.” Was this something the authors had hypothesized would differ? I can imagine a kind of stimulus or narrative where missing pieces of the IM heavily impacts trait judgments at the end, but I could also imagine a situation where it doesn’t really make a difference, or where the differences might fluctuate at intermediate points in the narrative and even out by the end. Another way of putting this: Is it just due to a lack of power that we see a null result here? Or should we not have expected one? The explanation at the end of that paragraph where NIM segments seem to pull people back together helps a lot.

The pointers with brain area labels in Figures 3 and 4 look a bit off to me, particularly the A1 labels. For example, in Figure 3, the A1 labels seem to point to two different cortical locations in the left and right hemispheres. The right hemisphere pointer looks like it’s pointing at the frontal operculum (superior to the Sylvian fissure) rather than transverse temporal cortex / Heschl’s gyrus. These regions can get smoothed together when projecting volumetric data onto a cortical surface, so part of the issue here may just be surface resampling and inflation. But still: I worry that someone looking closely at these labels will say “hey, that’s not A1.”

Minor comments:

The design of this experiment reminds me of the kind of designs developed by Yaara Yeshurun; e.g., where you induce a context difference across two groups, then measure ISC for the same stimulus (Yeshurun et al., 2017a), or where you introduce small changes to the stimulus and see how these ramify using ISC (Yeshurun et al., 2017b). Could be worth citing these in terms of framing/motivation or discussion.

Line 35: “correlated” → “correlated with”?

Line 90: “By providing direct knowledge of characters’ thoughts, IM narration allows us to separate the process of mental state inference from the process of using knowledge of mental states to inform other social cognitive processes.” Isn’t part of the idea here that this is also happening in a passive viewing task, where participants aren’t required to overtly “use” the inferences as part of the task? If I’m understanding correctly, then maybe this should be stated explicitly.

Figure 1: There’s no explanation of “SEND” prior to its appearance in Figure 1 here, rather the explanation comes later in the Methods. Maybe just mention it in the figure caption?

Line 150: The “NIM” acronym is not defined yet at this point.

Line 160: “followed by further confound regression (e.g., head motion, motion outliers)”—please list out the actual confound regressors/parameters used

Line 196: “analogous to using a cross-validated distance metric (50) in a more traditional condition-rich RSA design”—I think this approach is more closely analogous to using “split-data RDMs” via Henriksson et al., 2015

Figure 3: Are there any significant results on the medial surface here? If so, can you show us? If not, maybe mention that at the end of the figure caption.

Figure 3: “Partietal” → “Parietal”

Line 271: “Average patterns of brain activity for NIM and IM clips”—averaged across what?

Line 273: “characters traits” → “characters’ traits”

Line 281: “LOC”—I don’t think you unpack this abbreviation; make sure to check that all abbreviations are introduced properly

Line 336: “hint a” → “hint at a”

Line 384: “this work how” → “this work shows how”?

Discussion: There’s lots of work suggesting that these superior and lateral temporal areas are critical for language processing—including rich, contextualized meaning in natural language. Given that IM is essentially a language stimulus, the downstream social processing is presumably riding on language as an entry point. But the word “language” is not mentioned a single time in this manuscript! :) Could be worth situating the present results also in terms of the social/communicative function of language that’s at the core of this design as well as the cortical language network.

The precuneus result in Figure 4 is neat! This reminds of work by Zadbood and colleagues (2022), where we found that precuneus (among other default regions) tracks with updates to prior narrative events in a movie. (No pressure to cite this, just wanted to mention it.)

References

Henriksson, L., Khaligh-Razavi, S. M., Kay, K., & Kriegeskorte, N. (2015). Visual representations are dominated by intrinsic fluctuations correlated between areas. NeuroImage, 114, 275–286. DOI

Yeshurun, Y., Swanson, S., Simony, E., Chen, J., Lazaridi, C., Honey, C. J., & Hasson, U. (2017a). Same story, different story: the neural representation of interpretive frameworks. Psychological Science, 28(3), 307–319. DOI

Yeshurun, Y., Nguyen, M., & Hasson, U. (2017b). Amplification of local changes along the timescale processing hierarchy. Proceedings of the National Academy of Sciences of the United States of America, 114(35), 9475–9480. DOI

Zadbood, A., Nastase, S., Chen, J., Norman, K. A., & Hasson, U. (2022). Neural representations of naturalistic events are updated as our understanding of the past changes. eLife, 11, e79045. DOI