Open reviews

October 24, 2025 — Open review of Leipold and colleagues:

Leipold, S., Ravi Rao, R., Schoffelen, J. M., Bögels, S., & Toni, I. (2024). Between-movie variability severely limits generalizability of “naturalistic” neuroimaging. bioRxiv. DOI


Leipold and colleagues analyze a fairly large-N fMRI dataset where subjects watch eight ~4-minute animated movies. They show that intersubject correlation (ISC) values vary modestly from movie to movie. These movie fMRI data are accompanied by a behavioral dataset where subjects provide semantic ratings/descriptions of abstract objects. The authors use an intersubject representational similarity analysis (IS-RSA) to compare these behavioral data to the brain data, and find that different movie stimuli, after statistical thresholding, yield somewhat scattered/different maps. This could be a useful paper; the statistical modeling seems solid and the question of variability in ISC across stimuli is interesting. That said, the tone comes across as a bit inflammatory (e.g., the scare quotes around “naturalistic”). The authors frame their findings as some kind of “gotcha”, when I suspect in fact many researchers in naturalistic neuroimaging are already well-aware that ISCs may vary somewhat across stimuli and try to address that concern in good faith. The behavioral component of the present dataset is very specific and a bit odd, and the overall results don’t make a very convincing case. Critically, I worry that even if the present results were stronger, they would not provide much support for the ideological criticisms the authors want to level at the field of naturalistic neuroimaging. It seems like the paper is written in hopes of setting off the “crisis” alarm bells, but the actual data and arguments fall short. Below, I try to outline some points where I believe the arguments can be improved.

Major comments:

The core issue with this manuscript, in my reading, is that the empirical results of this study do not provide strong support for the broad-strokes criticisms the authors want to articulate. The authors show that ISCs and IS-RSA correlations are (modestly!) different from one movie to another, in their sample of movies, with their particular behavioral task; that is, the ISC values don’t “generalize” across stimuli as well as the authors would like. I think we can correctly infer from this result that certain ISC-based findings (perhaps IS-RSA-based findings more egregiously) do not generalize perfectly well across naturalistic stimuli. Great! This is a useful point that can be made constructively. I think most people in the naturalistic neuroimaging community are already aware of this, however, and this concern is equally valid for any experimental paradigm. The issue is that they then attempt to use this result to prop up a screed against naturalistic neuroscience more broadly. To be completely clear: the observation that ISC values differ slightly from one naturalistic stimulus to another does not in any way undermine the value of naturalistic neuroimaging more broadly. Their results (1) do not inform the question of whether “Movies are more akin to tasks than to rest” (page 24 heading); (2) do not have any bearing on whether “movies are not a naturalistic depiction of reality” (of course they’re not, but they’re better than some of the alternatives! page 25 heading); (3) do not give us any reason to “[rethink] the use of ‘naturalistic’ in movie fMRI research” (page 26 heading). These seem to be ideological commitments of the authors’, but this is not a perspective piece. I’ll try to unpack some of these issues in the following comments.

Showing that ISC values differ somewhat across different naturalistic stimuli does not in any way undermine arguments that naturalistic paradigms have greater ecological validity or generalizability than more controlled experimental stimuli. I would consider removing the Discussion section titled “Rethinking the use of ‘naturalistic’ in movie fMRI research”. No one advocating for the use of more naturalistic stimuli would argue that any particular naturalistic stimulus or paradigm is “perfectly” ecological or generalizable—whatever that means. That is a straw man. We are advocating for researchers to strive for greater ecological validity or more naturalistic stimuli in a relative sense. The authors express concern that the term “naturalistic” is used differently in different subfields, but this is normal. Some scientists use static images of natural scenes as a clear improvement in ecological validity over Gabor patches (e.g., Olshausen & Field, 1996); other scientists use dynamic videos as an improvement over static images (e.g., Haxby et al., 2020). The fact that “naturalistic” is used in this relative way does not, in my opinion, undermine its utility as a term or as a research program.

A deeper issue here is that ecological validity or generalizability, at least in my reading, is fundamentally about how well a scientific claim, or ideally a model, will generalize beyond a given experimental paradigm to contexts that better reflect everyday life. The audiovisual features of a movie are intercorrelated in a way that more closely resembles natural perception (relative to rudimentary auditory and visual experimental stimuli; e.g., oriented gratings); the relative strength or frequency of these features better matches their distribution in natural perception (relative to, e.g., matched features across stimuli, balanced stimulus frequencies); these audiovisual features are embedded in an evolving temporal structure that more closely resembles our everyday experience (relative to, e.g., randomized trial orders, static image stimuli); in many cases, all of these features are embedded in higher-level narrative or social structures that resemble the ways we interact with one another (relative to, e.g., isolated words/sentences, disembodied faces, etc). To put it succinctly, all of these features occur in a broader context that is more representative of the way humans typically interact with the world. A foundational paper in naturalistic neuroscience (David et al., 2004), for example, shows that the receptive field properties of V1 neurons shift in response to naturalistic stimuli versus gratings; critically, models trained on natural stimuli predict different natural stimuli dramatically better than models trained on gratings. In Haxby et al., 2011, they show that models trained on naturalistic stimuli generalize to different naturalistic stimuli as well as experimental stimuli, whereas models trained on experimental stimuli do not generalize well to naturalistic stimuli and only generalize to experiments of the same kind. The authors’ arguments do not really make contact with these ideas at the core of ecological generalizability.

The first core result of the paper is the observation that ISC values/maps are different across different naturalistic stimuli (Figure 3). The abstract states that “ISC varied substantially across movies”. The authors find a statistically significant main effect of movie on average ISC across the whole brain—but the ISC values all range between ~.17 and ~.23? That seems pretty similar! Maybe I’m biased, but I don’t think many people working in the field of naturalistic neuroimaging would be surprised by this result. I mean it’s right there in Figure 2 of Hasson et al., 2010. Different movies yield different magnitudes of ISC. Can the authors report the correlations between the unthresholded ISC maps across all pairs of movies? I understand that correlation will discard this overall difference in ISC, but I suspect the cortical/spatial distribution of ISCs will be fairly similar across movies.

This modest variability in ISC across stimuli is presented in a very negative light, with the implication that researchers using ISC and naturalistic stimuli may have never thought of this. This feels a bit out of tune, given that researchers in naturalistic neuroimaging have been articulating this concern in good faith for years; e.g., “The major concern with collapsing across multiple movies is that each movie varies in innumerable ways, confounding the interpretation of results.” (Vanderwal et al., 2019); “Of course, these paradigms require researchers to choose a specific stimulus or set of stimuli; this choice should be informed by the research questions, planned analyses, and populations under study” (Finn, 2021); “Here, we aim to fill that gap by providing a practical guide for human neuroscientists outlining what to consider from a media perspective when selecting a preexisting media stimulus for a given experimental goal” (Grall & Finn, 2022). In many cases, a particular kind of naturalistic stimulus is chosen for a theoretically motivated reason; for example, Finn et al., 2018, state: “The narrative described a main character faced with a complex social scenario that was deliberately ambiguous with respect to the intentions of certain characters; it was designed such that different individuals would interpret the events as more nefarious and others as less so.” This choice also implies that the authors understand that their results may not fully generalize to different kinds of movies—and I suspect many papers admit this limitation in their Discussion sections. Sure, plenty of studies are limited to one or a handful of naturalistic stimuli, but this is typically a practical limitation; I’m sure most authors also wish they had more data and more diverse stimuli (but data are expensive). This limitation of generalizability is obviously true of more controlled paradigms as well. In some ways, the sensitivity of ISC to different qualities of different stimuli could also be viewed as a strength. For example, in the context of functional connectivity, Simony and colleagues (2016) show that within-subject functional connectivity matrices are highly insensitive to different stimuli (e.g., intact and shuffled narratives), whereas intersubject functional connectivity matrices are more sensitive to the differences between stimuli (good!).

The authors spend a chunk of the text arguing that “Movies are not a naturalistic depiction of reality”. Again, I think this is an argument against a straw man. I don’t really think anybody working in naturalistic neuroimaging sees cultural artifacts like movies or stories as somehow veridical slices of human experience. Obviously, a Hollywood movie is shot in a particular way to capture our attention, manipulate our emotions, convey a particular story. Granted, “natural scenes” like an unedited, fixed-viewpoint video of Washington Square Park (Hasson et al., 2010), despite how they may have been described in previous publications, are not veridical depictions of our “reality” either; I don’t often find myself just standing around on campus passively watching random people walk by for an extended period of time. These critiques have already been made more constructively by Grall & Finn, 2022, and I suspect that many working in the field of naturalistic neuroimaging are already aware of and appreciate these concerns. There’s a nuance here where part of the utility of naturalistic stimuli like movies is that, deliberately, they are crafted to very effectively elicit a wide range of brain states, to elicit more genuine emotions (relative to, e.g., static images of fearful faces), to elicit processes of social cognition, etc. This is presumably part of the reason why we actually seek out and enjoy naturalistic stimuli. I don’t listen to pure tones with the occasional oddball on my drive to work, I listen to music or a podcast. I spend all day at work reading papers, writing grants, talking to my colleagues, etc. When I come home from work, I use a narrative to tell my partner about my day; I use language to express meaning, affect, social dynamics and events. I don’t make popcorn and sit down on the couch with my partner to watch two hours of random dot stereograms; we watch a movie or a TV show. I don’t read lists of random words before I go to bed, I read a book. We are surrounded by “cultural artifacts” in our daily lives and I suspect they capture so much of our time and energy because they key into much richer neural processes for perception, motivation, language, social cognition, etc than do typical experimental stimuli.

I found the dataset used for this study to be frankly kind of odd. The authors use an open dataset of adults watching eight different ~4-minute animated movies with no language. I assume the images in Figure 1 are AI-generated stand-in images rather than actual frames from the movies, but it’s hard to judge how realistic or engaging these movie stimuli are. Do these relatively short movies have narrative structure? Are they geared toward children or adults? Next, the subjects judged the semantic similarity of 16 meaningless cup-shaped “greeble”-like objects by either naming them or rating them along features like pointiness, symmetry, etc. The authors mention that the clips were constructed to include objects frequently named in the behavioral task (e.g., humans, plants, tools, toys, food), but otherwise the behavioral task and movie task seem very weakly related. Should we expect strong individual differences in this behavioral task? Do the individual differences we do observe in this task really say anything more general about conceptual-semantic representation for any given individual? Should we really expect any individual differences in this behavioral task to be related to individual differences in the perception of these short movies? Just based on the design alone, I would expect the IS-RSA effects to be fairly weak and/or unstable. Unfortunately, this appears to be borne out by the results: in Figure 4, although we can’t see the actual effect sizes (e.g., correlations or coefficients between neural and behavioral RDMs), the thresholded maps seem fairly weak and scattered.

Following on the previous comment, a large part of the authors’ argument hinges on the IS-RSA maps in Figure 4 looking fairly inconsistent across movies. In addition to the nature of the tasks (previous comment), I wonder how much of this apparent inconsistency stems from analytical choices. Here’s a silly example: the authors use a 6-mm FWHM smoothing kernel, but surely a larger kernel would seemingly “strengthen” the similarity of these maps. (I’m not suggesting the authors pursue this; a 6-mm kernel is totally reasonable.) Here’s a more serious example: How much of the dissimilarity between these maps is driven by statistical thresholding? If I understand correctly, for the maps in Figures 4 and 5, are you using Bonferroni correction for multiple tests across parcels? This is a fairly stringent correction; I’m curious whether the maps would look less spotty with a less conservative threshold (e.g., FDR < .05). Currently the authors color the maps using a t-value, but the coefficient (or even just a simple correlation) would give us a better grasp on the strength of association between the neural and behavioral RDMs. I understand that the mixed effects model is more rigorous here, but I’d be curious to know what the actual correlation values are between the neural and behavioral RDMs. Lastly, I would ask the authors to directly compute the similarity of the unthresholded maps. For a simple pass at this, what’s the correlation between the coefficients of vectorized maps for each pair of movies?

I worry that panel D (and, indirectly, the spatial layout of panel B) of Figure 3 could result from trivial properties of the signal (e.g., signal-to-noise ratio). The authors show that regions with higher ISC, i.e., areas more directly driven by the stimulus itself, also tend to show higher variability across stimuli (quantified by ANOVA F-values). The authors admit that “this relationship might not be surprising from a statistical standpoint”—but it’s hard to tell whether the result is really meaningful at all. To make this point more clearly, I wrote a very rudimentary demo: https://github.com/snastase/isc-var-correlation/blob/main/isc_var_correlation.ipynb. In short, you will get a similar correlation to the result in panel D if simply by adding different amounts of random noise to different ROIs.

The text in the section titled “Movies are more akin to tasks than to rest” doesn’t really speak to the title of the section very clearly. The argument seems to be that movies are different from rest because different movies evoke different spatiotemporal patterns of neural activity. But this logic also applies to resting state! The point is that with resting state, we can’t even begin to compare all the little self-driven “tasks” the resting brain engages in from moment to moment because we have no external reference point. If you directly compare the time series from one scan to another, you will get zero “generalizability” as the authors put it. This is why resting state relies on a completely different analytic framework where we use the covariance of neural activity to abstract away from the (unknown) timing and content of neural events. Relatedly, if you apply (within-subject) functional connectivity analysis to movies, you will find that the resulting connectivity matrices are quite similar across different movies (e.g., Vanderwal et al., 2017). If we were to use intersubject functional connectivity (ISFC), I would expect to see larger differences between different movies because ISFC isolates stimulus-driven connectivity and is therefore more sensitive to differences in the content of the movies (Simony et al., 2016)—which I would interpret as a good thing!

On page 23, if the authors want to discuss “what features of a [naturalistic stimulus] drive neural synchronization”, I would suggest they highlight the large body of work using encoding models to address precisely this question (e.g., Nishimoto et al., 2011; Huth et al., 2012). In our recent work, for example, we explicitly point out that ISC, as a data-driven/model-free method, is content-agnostic, and we use encoding models to isolate what features of the stimulus are driving activity (e.g., Zada et al., 2024; Samara et al., 2025). ISC can tell us the “where” and “how much” of stimulus-driven neural activity, but we need encoding models to quantify explicitly “what” stimulus features are driving neural activity. Relatedly, on page 24, the authors want to know “what exactly are we generalizing across?”—out-of-sample prediction methods typically used in encoding analysis, where you train on a subset of stimuli and evaluate the generalization of your model on left-out test stimuli provides a principled way to quantify the generalization the authors are seeking here.

In the first sentence of their Discussion, the authors say “Here, we quantified the extent of variability in ISC between different movies, and we assessed the consequences of this variability for investigations of trait-like or state-like differences in neural synchronization during movie watching”—but omit that they only investigate “trait-like or state-like differences” in a very particular set of (short, animated) movies combined with a very peculiar and largely unrelated task! Apologies for being pedantic here, but the authors interpret their results with very general language—general enough to make claims about the entire field of naturalistic neuroimaging!—in much the way they would censure in the work of other authors.

Minor comments:

The Introduction section is a bit sprawling; the focus and precision of the writing can be improved.

Page 2, line 28: I don’t think anybody would claim that naturalistic paradigms fully “address” the limited ecological validity of traditional experiments, as is implied by this sentence; the goal is to improve ecological validity. Straw-manning in this way is misrepresentative and unconstructive.

Page 2, line 32: “the generalizability of findings from naturalistic paradigms has been largely overlooked”—is this really true?

Page 2, line 48: Note, however, that subsequent viewings of a naturalistic stimulus are not fully the same as the first view (e.g., Aly et al., 2018; Lee et al., 2021; Michelmann et al., 2021); the initial novelty and sense-making processes may not duplicate to subsequent exposures, whereas memory and prediction processes may increase after the first viewing.

Page 3, line 54: “However, it remains unclear whether different naturalistic materials, for example different movie clips, evoke consistent patterns of ISC”—is this really true? I feel like people have run ISC analysis on many different movie stimuli, occasionally multiple movies within a given study, and the pattern of results is fairly well understood; that is, ISC patterns are fairly consistent across movies, but with some potentially interesting deviations. Maybe it would be fairer to say that the differences in patterns of ISC across movies haven’t been “systematically” studied?

Page 4, line 92: “Considering that the main measure of neural synchronization, ISC, is strongly stimulus driven”—I’m not sure what this is supposed to mean; ISC strictly measures stimulus-driven activity and stimulus-driven activity alone; anything that is not in some (potentially indirect) way driven by the stimulus will not be time-locked to the stimulus in different subjects scanned at different times and therefore will not be detected by ISC.

Pages 7–9: I would suggest trimming down the fMRIPrep boilerplate a bit to exclude pieces that are not relevant at all to your downstream analysis; for example, if you don’t use the tCompCor time series in your confound regression model (page 10), there’s not much reason to spend your word count describing how fMRIPrep extracted it.

Page 12: Can you specify whether the subjects watched the movies first then performed the behavioral task second or vice versa?—I might have just missed this in the text.

Page 16, line 436: “In prior work, such coarse similarities have been interpreted as evidence that different naturalistic materials evoke broadly comparable synchronization across the brain (Nastase et al., 2021).” Is the similarity really “coarse”? Is there an implication here that the interpretation of the evidence in Nastase et al., 2021, was unjustified or misleading? Speaking as the author of that paper :) —when I look back at the ISCs for different stories in the Narratives dataset in Figure 4, I still think “wow, those are remarkably similar maps!” given that they’re different speakers telling completely different stories with different content, tone, duration, etc.

Page 27, line 758: “Within-subject functional connectivity measures may be more consistent across different movies than ISC-based measures (Tian et al., 2021). This relative stability is likely because functional connectivity is largely driven by intrinsic network architecture (Buckner et al., 2013)”—sure, intrinsic network architecture, but also idiosyncratic noise! Any idiosyncratic blip of head motion or respiration will show up in within-subject functional connectivity. ISFC, on the other hand, isolates the stimulus-driven component of connectivity and discards idiosyncratic noise (Simony et al., 2016).

Figure 3: Why do we have correlations along the y-axis of panels A and C greater than r = 1; I assume Fisher z transformation? Can you inverse z transform these values back to correlations for visualization?

References:

Aly, M., Chen, J., Turk-Browne, N. B., & Hasson, U. (2018). Learning naturalistic temporal structure in the posterior medial network. Journal of Cognitive Neuroscience, 30(9), 1345–1365. DOI

David, S. V., Vinje, W. E., & Gallant, J. L. (2004). Natural stimulus statistics alter the receptive field structure of v1 neurons. Journal of Neuroscience, 24(31), 6991–7006. DOI

Finn, E. S. (2021). Is it time to put rest to rest? Trends in Cognitive Sciences, 25(12), 1021–1032. DOI

Finn, E. S., Corlett, P. R., Chen, G., Bandettini, P. A., & Constable, R. T. (2018). Trait paranoia shapes inter-subject synchrony in brain activity during an ambiguous social narrative. Nature Communications, 9, 2043. DOI

Grall, C., & Finn, E. S. (2022). Leveraging the power of media to drive cognition: a media-informed approach to naturalistic neuroscience. Social Cognitive and Affective Neuroscience, 17(6), 598–608. DOI

Hasson, U., Malach, R., & Heeger, D. J. (2010). Reliability of cortical activity during natural stimulation. Trends in Cognitive Sciences, 14(1), 40–48. DOI

Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., & Ramadge, P. J. (2011). A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72(2), 404-416. DOI

Haxby, J. V., Gobbini, M. I., & Nastase, S. A. (2020). Naturalistic stimuli reveal a dominant role for agentic action in visual representation. NeuroImage, 216, 116561. DOI

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210–1224. DOI

Lee, C. S., Aly, M., & Baldassano, C. (2021). Anticipation of temporally structured events in the brain. eLife, 10, e64972. DOI

Michelmann, S., Price, A. R., Aubrey, B., Strauss, C. K., Doyle, W. K., Friedman, D., Dugan, P. C., Devinsky, O., Devore, S., Flinker, A., Hasson, U., & Norman, K. A. (2021). Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature Communications, 12(1), 5394. DOI

Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19), 1641–1646. DOI

Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. DOI

Samara, A., Zada, Z., Vanderwal, T., Hasson, U., & Nastase, S. A. (2025). Cortical language areas are coupled via a soft hierarchy of model-based linguistic features. bioRxiv. DOI

Simony, E., Honey, C. J., Chen, J., Lositsky, O., Yeshurun, Y., Wiesel, A., & Hasson, U. (2016). Dynamic reconfiguration of the default mode network during narrative comprehension. Nature Communications, 7, 12141. DOI

Vanderwal, T., Eilbott, J., & Castellanos, F. X. (2019). Movies in the magnet: naturalistic paradigms in developmental functional neuroimaging. Developmental Cognitive Neuroscience, 36, 100600. DOI

Vanderwal, T., Eilbott, J., Finn, E. S., Craddock, R. C., Turnbull, A., & Castellanos, F. X. (2017). Individual differences in functional connectivity during naturalistic viewing conditions. NeuroImage, 157, 521–530. DOI

Zada, Z., Goldstein, A. Y., Michelmann, S., Simony, E., Price, A., Hasenfratz, L., Barham, E., Zadbood, A., Doyle, W., Friedman, D., Dugan, P., Melloni, L., Devore, S., Flinker, A., Devinsky, O., Nastase, S. A., & Hasson, U. (2024). A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron, 112(18), 3211–3222. DOI