Open reviews

December 1, 2025 — Open review of Small and colleagues:

Small, H., Lee Masson, H., Wodka, E., Mostofsky, S., & Isik, L. (2025). Ubiquitous cortical sensitivity to visual information during naturalistic, audiovisual movie viewing. PsyArXiv. DOI


Small and colleagues compare a number of advanced vision and language models (as well as lower-level features and human annotations) in terms of how well they capture variance in fMRI activity during a naturalistic audiovisual movie stimulus (part of a BBC Sherlock episode). They evaluate these models across cortex and within individually-defined functional regions of interest (fROIs) derived from localizer tasks. Surprisingly, they find that a neural network model for vision captures the largest proportion of variance in neural activity across the vast majority of cortex. While the language models perform well in a punctate set of language areas, they show that, critically, the visual model still matches or outperforms the language models even in functionally-defined language ROIs. This is a very interesting result, and likely to be somewhat controversial given that the language network has been argued to be quite selective for language processing (e.g., Fedorenko et al., 2024). I think there’s real scientific value in comparing these different classes of features and evaluating their relative contributions under naturalistic conditions.

I’m honestly a bit surprised (like the authors were, I suspect) by how well the vision models perform and I want to make sure there’s no bias against the language models in the authors’ modeling approach. The authors really don’t want readers interested in language to look at this and simply conclude that the authors didn’t give language features a fair shake at “winning” against the vision models. So, my top-level comments are geared toward ensuring that the comparison between vision and language models is as well-matched as possible (without violating the “balance” of these features as they occur in a naturalistic stimulus). That said, I think the methods are rigorous, I generally believe the results, and I think this is worth publishing.

Major comments:

My first thought when digesting the results was: Is the superiority of the vision model just a matter of volume—i.e., differences in the number of samples/TRs? Visual input in a movie is essentially continuous (with dramatic changes at camera cuts, scene changes, etc), whereas language is relatively punctate. Utterances, conversations, narration, etc surely occur frequently over the course of the film, but I doubt as continuously as the visual inputs; the structure of language is more on-and-off than vision. Two questions/suggestions: First, is it possible that the on-and-off structure of language, where you may have zeros in the language embeddings for several TRs at a time (while the visual input proceeds continuously), dilutes or overwhelms the contribution of word embeddings? For example, in some of our past work (Kumar, Sumers et al., 2024), we observed that just the on-and-off structure of language can drive what looks like strong encoding performance for LLM embeddings (particularly in early auditory areas), but it’s really just the onsets/offsets of language driving the prediction, rather than word-by-word meaning. Is there anything else the authors can do to reassure us that this on-and-off structure of speech isn’t contaminating the language model performance? Second, I’d be curious to see how the language models fare against the vision model if you simply discard all TRs where there’s no language occurring. I understand this somewhat violates the frequency/structure at which visual and linguistic inputs co-occur in natural contexts, but this would provide a stringent test of vision models against language models and may allay concerns that there’s simply less language than there is vision naturally present in the stimulus.

Related to the previous comment, I wonder if the visual content of the stimulus is natively more “dense” or high-dimensional than the language input. Visual input has extremely high bandwidth, whereas speech is presumably “thinner”—can the authors quantify this? I have two ideas here as well: First, can you quantify the intrinsic dimensionality of both the vision embeddings and the language embeddings over the course of this particular stimulus? Something very simple here I think would suffice: for example, run PCA on the stack of vision embeddings across the entire stimulus and tell us how many dimensions are needed to capture 90% of variance in visual content for Sherlock; repeat this same procedure for the language embeddings. If the visual features have higher intrinsic dimensionality over the course of the stimulus, this could provide (part of) a straightforward explanation as to why the vision model outperforms the other models—i.e., because the stimulus simply has more visual stuff going on in it than it has linguistic stuff. Second, this thought process leads to a slightly more ambitious idea: Maybe you just need more language (i.e., more words/samples/TRs) for it to reach a similar dimensionality/richness to vision. I wonder if a subsampling procedure would allow the authors to quantify how much more language would be needed to match the visual dimensionality; e.g., you could try cutting the number vision embeddings down by 50%, 25%, 10% (i.e., discarding 50% of the TRs) and see at which volume it matches the language model dimensionality. I don’t think the authors necessarily need to follow through on this analysis, but I wanted to suggest it in case it might help to clarify the interpretation of their findings.

I’m very surprised that the AlexNet results are so uniform in their values across the selected voxels. In the ISC map, we see a large range of values, with very high values in vision and language areas and quite low values in other areas. In other words, we see a nice “dynamic range” in the ISC map with strong spatial structure across brain areas. On the other hand, the results for the joint model and AlexNet all sit in a similar range of modestly high values with very little spatial structure. Why is this the case? This makes me worry that there may be some kind “bug” in the model fitting or evaluation. It also worries me that in many areas where ISC is fairly low, the encoding performance values are still quite high. I understand that this is certainly possible given that the encoding models are trained and tested within participants, but I wouldn’t expect the difference to be this large. For instance, at the bottom of page 4, the authors report that the full model accounts for 20–30% of a noise ceiling estimated using ISC in individually-defined ROIs. This seems reasonable to me! However, based on Figure 2, I would expect the full model to account for well over 100% of the ISC-based noise ceiling across many voxels outside those ROIs—yikes. This difference implies that, for all of these voxels, the models are able to capture something for most subjects that is unique to each subject and not captured in the mean time series of other subjects. Again, this is certainly possible due to idiosyncrasies in functional topography across individuals (among other reasons), but I don’t think I’d expect the maps to look quite like this. If there was some such “bug”, I’m afraid I don’t know what it would be—but I’ll outline a couple ideas in the following comment:

The models are highly flexible, so overfitting or leakage of some kind between training and test could lead to maps that look uniformly high across most voxels. I assume AlexNet is the widest model, so if overfitting is a problem, it will likely win the overfitting competition. If I understand correctly, the authors are splitting the time series into 20-TR (30-second), then assigning a random 80% to training and the remaining 20% to test (apologies if I’m misunderstanding the cross-validation structure). This is fairly reasonable, but it is not the most stringent approach and may still allow the models to capitalize on temporal autocorrelation. This approach falls somewhere between (1) the most egregious mistake of selecting random subsets of TRs, allowing for strong temporal autocorrelation across training/test splits (as pointed out by Hadidi, Feghhi et al., 2025, with regard to the influential paper by Schrimpf et al., 2021) and (2) the most stringent approach of splitting the time series into five temporally contiguous chunks. I wonder how these 20-TR chunks interact with the delayer of up to 7.5 seconds. I would suggest the authors replicate their core analysis using cross-validation with five temporally contiguous chunks to ensure the main findings hold. I was also wondering if the smoothing procedure or poor surface interpolation (nilearn’s vol_to_surf is not ideal) could lead to the uniformly high AlexNet performance in Figure 2, but my money would be on the cross-validation procedure as the main culprit.

I’m having a hard time understanding what the feature matrices end up looking like for the language models. I think I understand the HuBERT and word2vec matrices, but am I correct in understanding that the downsampling procedure for the sBERT embeddings will result in a brief impulse vector at the end of each sentence? This seems pretty weird. This will result in relatively few distinct vectors (i.e., the number of sentences) sparsely distributed over the course of the scan. I’m also not sure this really makes sense “cognitively”—i.e., I suspect the brain constructs a sentence word by word rather than having a burst of activity marking sentence completion. (For example, I could also imagine broadcasting the sBERT embedding for a given sentence along the full duration of the sentence.) I have another relatively large concern following on this comment: After explaining this procedure with sBERT, the authors simply say “In follow-up analyses, we also extracted embeddings in the same way from GPT-2.” I don’t really understand what this means, given that GPT-2 generates embeddings word by word. Are you extracting some kind of sentence embedding out of GPT-2? I think the only way to “do justice” to language processing in these analyses would be to use word-by-word GPT-2 embeddings. I hope the authors are using word-by-word GPT-2 embeddings in the same way as prior research; if not, I would like to see the findings updated with this approach.

I have a couple questions about the model dimensionality: If I understand correctly, the vision model embeddings were reduced to 6480 dimensions using sparse random projections. Is this the dimensionality per layer? This seems pretty wide for a single layer, but it’s hard to judge without knowing the original dimensionality. It may be helpful to create a supplementary table where you report the original and reduced dimensionality of each model and layer. I don’t really understand the Johnson-Lindenstrauss lemma, but I’m also surprised that the “optimal” dimensionality (6480) is still 3x the number of samples (1921)—is this correct? I believe that all of these models are fit simultaneously with himalaya; how wide exactly is the final model? Back-of-the-envelope math is telling me something on the scale of (6000 * 6 + 700 * 12) * 5 delays = ~220,000+ features? With over a dozen different bands? We’re often using quite wide models with ridge regression in our work, but damn this is aggressively wide (for 1921 TRs) and presents a difficult hyperparameter optimization problem. Are the authors worried that they might be flying a bit too close to the sun in terms of dimensionality? I wonder if the results would look similar if all the widest models were reduced to, say, 100 dimensions using PCA, although I understand this analysis isn’t necessarily optimal/fair.

At the top of page 12, you say “Data was smoothed…”—isn’t this redundant with the same sentence two paragraphs up? Further down the page, you say “Group maps were created using Nilearn’s second level modeling with a smoothing FWHM of 3.0mm.” Similarly for the encoding analyses. Does this mean the same data were smoothed multiple times? I.e., time series are smoothed prior to regression, then coefficients or encoding performance values are smoothed again prior to group-level analyses? Is this standard practice? I wonder if this could contribute to the strangely smooth model performance maps in Figure 2. I’m not asking the authors to re-run anything here because I don’t think this is a super important issue, just looking for clarification. (Also: “Data was smoothed” > “Data were smoothed”)

In the final interpretational analysis (Figure 5), the authors say “we extracted the 25 highest weighted units from AlexNet layer 6.” Just to check: the magnitude of these weights could depend arbitrarily on the original scale of features (e.g., feature with arbitrarily high variance gets small weight despite being important overall). Do you z-score the features within each training set (e.g., StandardScaler)? In general, I understand that this interpretation effort is necessary—but I didn’t find the outcome particularly convincing. I could see Figure 5 living in the supplement instead of the main text, but I’ll leave it up to the authors.

In Figure S4, I don’t immediately see any black-outlined language fROIs in subjects 1 through 9. Were they not localizable or is this a mistake of some kind?

Minor comments:

I think the authors could probably make an even stronger argument for work of this kind in the Introduction. In the vast majority of experiments in cognitive neuroscience, data collection is designed to target specific features at the expense of other features; experiments are often designed to isolate a particular kind of feature from other features. But this makes it extremely difficult to gauge the relative contributions of different features in more natural contexts. Outside the laboratory, many features tend to co-occur, are often entangled, and vary in their importance to the brain and behavior. By using this model comparison approach with a naturalistic stimulus, the authors can more faithfully quantify which features the brain cares about most in a context much more closely resembling everyday life.

Interesting that the vision and language embeddings have fairly low correlations! I suppose it wouldn’t make much sense from a filmmaking perspective for vision and language to be highly redundant. They are usually “congruent” presumably, but separately/complementarily informative.

Figure 1: In panel B, the vertical spacing of the vision, motion, etc labels is a bit janky.

Page 2: I was expecting to also see a citation to Güçlü & van Gerven (2015) with the reference to Eickenberg et al. at the bottom of page 2

Figure 2: I find this color map a bit confusing. For example, the low (but significant) values in the ISC map are overly similar to the background color of the brain. The sBERT results are almost entirely in the purple range of the color bar, making the resulting brain map look like it’s on an almost entirely different color map than the AlexNet results.

Figure 2 caption: Extra space after second sentence. Missing reference to panel D.

Page 4: “DICE” > “Dice”—I assume this isn’t all caps if it’s someone’s name

Discussion: The authors could consider discussing how multimodal vision/language models (e.g., CLIP) might fit into this puzzle.

Page 8: “We also showed for the first time that social interaction perception and language regions, as identified with standard controlled experiments, are largely non-overlapping, spatially and functionally.”—annoying comment: doesn’t this depend a good bit on where you draw the statistical thresholds for defining these regions?

Page 9: “Finally, this work is unique because we localized”—this sentence is awkward

Page 10–11: I think you can probably remove any pieces of the fMRIPrep boilerplate that correspond to derivatives you don’t end up using; e.g., if you don’t actually use tCompCor confounds, not much point using lines of text to explain how fMRIPrep extracts them

Page 13: How did the authors decide on sentence boundaries in the transcript?

Page 15: Cite the BrainIAK paper if you used it: Kumar et al., 2021

Reference 55, Doerig et al.: This one came out in Nature Machine Intelligence recently.

Figure S4: I’m actually surprised how well the orange language-preferring regions match with the black-outlined language fROIs—cool!

References:

Doerig, A., Kietzmann, T. C., Allen, E., Wu, Y., Naselaris, T., Kay, K., & Charest, I. (2025). High-level visual representations in the human brain are aligned with large language models. Nature Machine Intelligence, 7, 1220–1234. DOI

Fedorenko, E., Ivanova, A. A., & Regev, T. I. (2024). The language network as a natural kind within the broader landscape of the human brain. Nature Reviews Neuroscience, 25(5), 289–312. DOI

Güçlü, U., & Van Gerven, M. A. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27), 10005–10014. DOI

Hadidi, N.*, Feghhi, E.*, Song, B. H., Blank, I. A., & Kao, J. C. (2025). Illusions of alignment between large language models and brains emerge from fragile methods and overlooked confounds. bioRxiv. DOI

Kumar, M., Anderson, M. J., Antony, J. W., Baldassano, C., Brooks, P. P., Cai, M. B., Chen, P.-H. C., Ellis, C. T., Henselman-Petrusek, G., Huberdeau, D., Hutchinson, J. B., Li, P. Y., Lu, Q., Manning, J. R., Mennen, A. C., Nastase, S. A., Richard, H., Schapiro, A. C., Schuck, N. W., Shvartsman, M., Sundaraman, N., Suo, D., Turek, J. S., Turner, D. M., Vo, V. A., Wallace, G., Wang, Y., Williams, J. A., Zhang, H., Zhu, X., Capota, M., Cohen, J. D., Hasson, U., Li, K., Ramadge, P. J., Turk-Browne, N. B., Willke, T. L., & Norman, K. A. (2021). BrainIAK: The Brain Imaging Analysis Kit. Aperture Neuro, 1. DOI

Kumar, S.*, Sumers, T. R.*, Yamakoshi, T., Goldstein, A., Hasson, U., Norman, K. A., Griffiths, T. L., Hawkins, R. D., & Nastase, S. A. (2024). Shared functional specialization in transformer-based language models and the human brain. Nature Communications, 15, 5523. DOI

Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118. DOI