Open reviews
June 7, 2025 — Open review of Hadidi, Feghhi and colleagues:
Hadidi, N.*, Feghhi, E.*, Song, B. H., Blank, I. A., & Kao, J. C. (2025). Illusions of alignment between large language models and brains emerge from fragile methods and overlooked confounds. bioRxiv. DOI
Hadidi, Feghhi and colleagues show that certain results from a particular thread of papers fall victim to a methodological misstep: using cross-validation with shuffled train and test splits. For many types of data (e.g., fMRI data, natural language, or basically any time series data), using shuffled train and test splits means that samples in the test sets will be correlated with samples in the training sets due to autocorrelation. They show that two core results from a landmark paper by Schrimpf et al., 2021, using shuffled splits, can be captured by a simple “orthogonal autocorrelated sequences” model (OASM). These two results, (1) that unidirectional / autoregressive models (like GPT) outperform bidirectional / masked models (like BERT) and (2) that brain predictions correlate with next-word prediction, both effectively disappear in the Schrimpf et al., 2021, datasets when using properly contiguous cross-validation splits instead of shuffled splits. They also show that, under certain conditions, a simple “position and word rate” (PWR) model provides surprisingly high correlations on par with the language models.
This is a very timely paper, and likely to be somewhat controversial—although I think this paper is unequivocally correct in pointing out certain methodological missteps that undermine core results in prior work. I have two high level concerns: (1) that some readers will overgeneralize these findings and “throw the baby out with the bathwater”, and (2) that some of these “gotchas” are also due in part to relatively odd structure of the original datasets (as well as certain analytic choices that remain unexplored here). This creates a certain tension between accurately reproducing (suboptimal) analytic choices applied to particular datasets used by Schrimpf et al., 2021, and making more general recommendations for the field. I unpack these comments below, followed by a number of more specific methodological questions and clarifications. Overall, I think this is a useful paper, and correctives of this kind are part of a healthy science—but, for a paper like this, the authors need to be extremely judicious and precise in their application of the corrective to avoid causing more harm than good.
Major comments:
1a. This is my highest-priority comment: I think it’s extremely important that the authors be very precise about which papers—and which associated results—are known to be affected by certain methodological missteps, and explicitly which studies (and results) are not. There are already plenty of people out there who would love ammunition to argue that (1) encoding models are generally useless for neuroscience (not true at all) and (2) that neural network models for natural language (e.g., LLMs) are not useful for modeling the human neural machinery for natural language (also not true). There is a real risk that the important correctives this paper highlights will be overinterpreted as an indictment of the entire research program—which includes many labs and papers that have in fact been very careful and have managed to avoid these methodological missteps. (This kind of pattern—where a legitimate corrective is introduced, but then overinterpreted—occurs repeatedly in the fMRI literature; for example “voodoo correlations”, “cluster failure”, etc; Vul et al., 2009; Eklund et al., 2016.) I think the authors are already making a good-faith effort to do this, but the more judicious and explicit the better in this respect.
1b. Concretely, I would first suggest that the authors more explicitly and comprehensively enumerate which papers are subject to particular criticisms. For example, as a spot-check for shuffled cross-validation, I confirmed that Caucheteux et al., 2021a, 2021b, 2022, 2023, Caucheteux & King, 2022, all explicitly state in writing that they’re using temporally contiguous splits for cross-validation. Similarly, work from Huth and colleagues typically uses a completely held-out test stimulus; e.g., Antonello et al., 2021, 2023; Antonello & Huth, 2024. Toneva et al., 2022, cross-validate to separate scanner runs. All work from Hasson and colleagues (that I have been personally involved with) has used contiguous cross-validation folds; Goldstein et al., 2022, 2025; Hong, Wang et al., 2024; Kumar, Sumers et al., 2024; Zada et al., 2024, 2025b; and some studies, like Goldstein et al., 2024, put more stringent tests of generalization at the core of the work. This is of course a biased selection based on my own personal familiarity, but a spot-check of this kind suggests to me that a majority of published work does not directly fall victim to the methodological issue of shuffled cross-validation splits. I know it’s quite difficult to verify and list out all non-offenders in this way, but I think it’s incumbent on the authors to ensure that specific criticisms, e.g., about missteps in cross-validation, do not bleed over into studies that put in the effort to do things right.
1c. I would also suggest that authors be very clear that failing to replicate certain results from Schrimpf et al., 2021, does not necessarily mean those results won’t hold up elsewhere. For example, the correlation between encoding performance and perplexity in Figure 4 has been found by Caucheteux & King, 2022, in both fMRI and MEG, using much larger samples (fMRI N = 100, MEG N = 95). Antonello & Huth, 2024, also replicate this result in a small-N dataset where each subject listened to 5 hours of naturalistic story stimuli (despite using this result to illustrate a different point). On the other hand, the samples sizes here are quite small (N = 10, N = 5, N = 5) and the stimuli are generally non-naturalistic and more simplistic / impoverished (at least in Pereira2018 and Fedorenko2016; which may undercut the power of LLMs). This suggests that certain results under consideration here are not invalid generally, just that they cannot be replicated using the specific datasets from Schrimpf et al, 2021 (and here) and the specific methodology used by Schrimpf et al., 2021. There are many possible reasons for this that do entirely hinge on the methodological correctives discussed here; for example, these results may not replicate because the structure induced by the experimental design (e.g., stimulus on/off blocks, matched/isolated sentences, etc) may overpower the structure of the actual language stimuli; or the language stimuli may not be sufficiently diverse or engaging or meaningful to subjects.
1d. By “specific methodology”, the first thing that comes to mind is that Schrimpf et al., 2021, use a very restrictive set of voxels based on individual-specific functional localizer tasks. This analytic choice is imported directly into the current work and otherwise unexplored. These tasks simply contrast actual sentences versus nonword sentences, and will tend to focus on regions that have uniformly high activation for the actual sentences (greater than the nonword sentences). On the other hand, encoding performance in an analysis of this kind will be driven by reliable variance across words/sentences with different meanings (the model-based predictions are mapped onto up-and-down fluctuations in the signal across TRs / words / sentences; if the signal is uniformly high, it would result in poor encoding performance). Other research groups have compiled a large body of results indicating that different components of natural language comprehension extend beyond these areas (e.g., Huth et al., 2016; Nastase et al., 2021), and many voxels outside these regions exhibit strong encoding performance with LLMs (e.g., Caucheteux et al., 2022; Kumar, Sumers et al., 2024). I don’t know if this is true, but it’s certainly possible that this particular way of defining language regions yields a set of voxels that are unusually well modeled by OASM- or PWR-style models. My point here is that, as written, the current set of non-replication results are limited to this method for selecting voxels of interest. If the authors want to make broader claims, they would have to reproduce these null results across a wider set of voxels.
2. My second high-level comment (related to comment 1c): One of the sources of my confusion when reading this is that I hadn’t realized quite how (let’s say) “weird” the datasets used by Schrimpf et al., 2021, were. As someone who primarily uses longer-form, naturalistic stories/narratives as stimuli, I found myself wondering how many of these methodological and interpretational gotchas are partly symptoms of using these oddball datasets. For example, the authors say these results “depend somewhat unpredictably on the choice of dataset, models, and methods.” This is patently true of many results, but all datasets are not created equal—examples where some of these results hold up are in much larger, more diverse datasets (see comment 1c). If the authors’ goal is to show that this result does not hold up in the oddball Schrimpf et al., 2021, datasets alone, then that’s fine, and they should be very explicit about that. If the authors want to claim that this result does not hold up in general, they would need to reproduce the null result on much larger, more diverse, more naturalistic datasets used by other groups; for example, LeBel et al., 2023, Nastase et al., 2021, Zada et al., 2025a, or Li et al., 2021. At the very least, the authors should discuss the limitations of the datasets used by Schrimpf et al., 2021, versus larger, more naturalistic datasets.
3. Related to the previous comment, I think the authors should unpack the structure of the datasets in a bit more detail, as this will impact the interpretation of their findings. For example, I stumbled when I got to line 58: “participants read short passages presented one sentence at a time, and a single fMRI volume (TR) was acquired after presentation of each sentence.” That’s strange. I understand that during a reading task, timing of fixations will be variable / idiosyncratic. Does this mean participants read the sentence, pressed the button to indicate they were finished, and following data were used for analysis? The scanner is presumably acquiring TRs continuously across an entire run, so I assume “a single fMRI volume (TR) was acquired” actually means that all other TRs acquired during sentence reading and the intervening gap were discarded for analysis. What is going into the context window when extracting embeddings for a dataset like this? In a paper like this, I don’t think it’s sufficient to simply redirect the reader to the original papers for further detail.
4. Following on the previous comment, I’m curious exactly how these datasets were preprocessed and whether the particularities of the preprocessing may impact the results observed here. For example, if linear drifts were not factored out of the fMRI time series, this would certainly contribute to the success of the OASM model. If the on/off structure of stimulus blocks was not accounted for in some way, this could both (a) hurt LLM encoding performance (due to large stimulus on/off signal fluctuations that are not related to the linguistic content), or (b) enhance the performance of the PWR model (e.g., if the signal ramps up at the beginning of a stimulus block, then tapers off over the course of the sentence). For example, in Zada et al., 2025b, we explicitly include low-level confounds as well as task-level on/off block regressors to rule out these factors.
5. It should be explicitly noted that, at least since Huth et al., 2016, many fMRI studies in this vein have included low-level confound regressors like word rate; e.g., Huth et al., 2016, used word rate, phoneme rate, and phonemes alongside their semantic model of interest; similarly for our work in Kumar, Sumers et al., 2024, where we include phonemes, phoneme rate, word rate, and a silence indicator in a separate nuisance band using banded ridge regression (linguistic features of interest will be correlated with these low-level confounds in most cases of natural language).
6. At lines 555 and 608, you say “For R-squared, we clip results to be at least 0 at the voxel / electrode / fROI level to prevent noisy values from biasing the mean downwards.” Is there any precedent for doing this fairly aggressive data imputation? In my experience, out-of-sample R-squared values in this kind of analysis can be very low, with many going strongly negative (i.e., providing poorer predictions than the mean on the test set). This can be a useful tool for diagnosing whether the modeling pipeline (cross-validation, normalization, hyperparameter selection, etc) are working properly. I’m curious about what proportion of voxels end up getting clipped in these analyses… For example, if you only have 243 voxels to begin with (Pereira2018, Experiment 3) and half of them disappear due to negative R-squared values, that would give me major pause when interpreting the current results.
7. I was a bit confused about the statistical tests used for model comparison. For example, at line 126, you say: “Third, because Ω is defined at the participant level, we use a one-sided paired t-test across square error values to determine the percentage of voxels / electrodes / fROIs where OASM+GPT2XL explained significantly more neural variance than OASM alone. We applied FDR correction for each participant separately for this analysis.” What are the samples going into this t-test if you’re doing the test within each participant? (i.e., what are the replicates determining the degrees of freedom in the test?) If I’m reading the Methods correctly, starting at line 617, you’re treating voxels (e.g., N = 384) within the language ROIs as samples? This is fairly nonstandard, and I’m not really sure what inferential interpretation it has. I’m also not sure how that relates to the term “percentage” in your description of the test. Alternatively, you might be supplying sentences or test folds into the statistical test? A different approach here to performing a within-subject statistical test would be to randomize the mapping between the time series and the embeddings somehow; e.g., phase randomization, random circular rotation of the time series, or block-wise permutation (as in LeBel et al., 2023). When you say “FDR correction for each participant separately”, FDR correction across what multiplicity of tests exactly? Across different model comparisons? I don’t necessarily think any of this is wrong; but please clarify these points in both the Results and the Methods.
8. At line 92, you indicate that Schrimpf et al., 2021, used ordinary least squares (OLS) to map GPT-2 XL embeddings onto the brain activity. Is this correct? I don’t understand how this worked. Aren’t the GPT-2 XL embeddings 1,600 features wide, with many fewer samples (TRs) in the fMRI data? (i.e., 384 and 243 samples in Pereira2018, 416 samples in Fedorenko2016, 1317 samples in Blank2014…) Wouldn’t it be impossible to solve such a regression equation with OLS? Was Schrimpf doing PCA on the embeddings prior to regression? Or some pseudo-inverse trick? My understanding is the authors somehow replicated the same analysis using OLS (lines 92 and 533). How? In any case, I agree that the authors should focus on ridge for the core results, as this has become the standard in the field.
9. At line 542, you say “We z-score all predictors across samples before training regressions.” I’m not demanding the authors re-run their models, but this is not the best practice. Z-scoring the entire dataset prior to cross-validation will introduce leakage between training and test sets. For a paper like this that hinges on methodological rigor, ideally you should z-score within the training set of each cross-validation fold; e.g., use scikit-learn’s StandardScaler in a Pipeline to .fit_transform() each training set, then .transform() the corresponding test set using the mean and standard deviation of the training set.
10. How are you accounting for hemodynamic delays in the fMRI data? Note that many examples of current work use a finite impulse-response model where they horizontally stack lagged duplicates of the model features (embeddings), via Huth et al., 2016. This is a more sensitive approach, as it allows the model to estimate a linear combination of lags for each voxel / electrode in each subject.
11. Do the authors have an intuition whether this problem of autocorrelation when using shuffled splits is worse for fMRI data? My intuition is that fMRI data is by nature much more temporally autocorrelated than ECoG data. For example, I agree that it’s not ideal for Mischler et al., 2024, to use shuffled splits, but with the high temporal resolution and signal properties of ECoG, it might not be as egregious as for fMRI.
12. The discussion of using shuffled splits (which I agree with the authors should basically never be used in the context of autocorrelated time series like fMRI data or natural language) could be developed into a broader point about generalization in model performance. Using cross-validation with temporally contiguous splits is certainly a good baseline recommendation, but it may be worth indicating to readers that other, even more stringent generalization tests can be theoretically interesting. For example, evaluating model generalization to entirely separate, fully held-out stimuli (e.g., Huth et al., 2016); evaluating model generalization across strictly non-overlapping sets of words (e.g., Goldstein et al., 2024); ensuring that context windows are reset at cross-validation folds (e.g., de Varda et al., 2025); and evaluating model generalization across subjects (e.g., Zada et al., 2024).
Minor comments:
Line 238: Can you summarize the paragraph starting at line 238 with a take-home message? Is the point here that this result is conceptually congruent with the paper by Kauf et al., 2024?
Line 343: “On the other hand, ?, Hong et al. [2024]”—missing citation (probably Antonello & Huth, 2023?)
Line 375: “contextual syntactic representation”—this doesn’t seem like very standard terminology to me (e.g., many people who care a lot about syntax work on “context-free grammars”; might be worth unpacking a bit more).
In section 4.2 of the Methods, can you add a little more detail about GPT-2 XL? The main thing that occurred to me here is that I couldn’t find the embedding dimensionality for XL… If I recall correctly, it’s 1,600?
Line 439: “L2 penalty [Dupré la Tour et al., 2022] ??.” —missing citation?
Line 451: “ane expanded” > “an expanded”
Line 630: “We note that squared error values from a model are correlated, which means that the t-test is biased towards positive results which show that an LLM contributes significant variance over a set of simpler models.” I didn’t follow this sentence. You mean squared error values will be correlated across cross-validation folds? Or squared error values will be correlated with model complexity?
Line 745: Hawinkel reference is missing the date.
I include many references below because I think it’s critical for the authors to be comprehensive about who’s done what in a piece like this.
References:
Antonello, R., Turek, J. S., Vo, V., & Huth, A. (2021). Low-dimensional structure in the space of language representations is reflected in brain responses. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems: Vol. 34 (pp. 8332–8344). Curran Associates, Inc. link
Antonello, R., Vaidya, A., & Huth, A. (2023). Scaling laws for language encoding models in fMRI. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems: Vol. 36 (pp. 21895–21907). Curran Associates, Inc. link
Antonello, R., & Huth, A. (2024). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language, 5(1), 64–79. DOI
Caucheteux, C., Gramfort, A., & King, J.-R. (2021a). Disentangling syntax and semantics in the brain with deep networks. In M. Meila & T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning: Vol. 139 (pp. 1336–1348). PMLR. link
Caucheteux, C., Gramfort, A., & King, J.-R. (2021b). Model-based analysis of brain activity reveals the hierarchy of language in 305 subjects. In M.-F. Moens, X. Huang, L. Specia, & S. W.-T. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3635–3644). Association for Computational Linguistics. DOI
Caucheteux, C., Gramfort, A., & King, J.-R. (2022). Deep language algorithms predict semantic comprehension from brain activity. Scientific Reports, 12, 16327. DOI
Caucheteux, C., Gramfort, A., & King, J.-R. (2023). Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7(3), 430–441. DOI
Caucheteux, C., & King, J.-R. (2022). Brains and algorithms partially converge in natural language processing. Communications Biology, 5, 134. DOI
de Varda, A. G., Malik-Moraleda, S., Tuckute, G., & Fedorenko, E. (2025). Multilingual computational models reveal shared brain responses to 21 languages. bioRxiv. DOI
Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences, 113(28), 7900–7905. DOI
Goldstein, A., Grinstein-Dabush, A., Schain, M., Wang, H., Hong, Z., Aubrey, B., Nastase, S. A., Zada, Z., Ham, E., Hong, Z., Feder, A., Gazula, H., Buchnik, E., Doyle, W., Devore, S., Dugan, P., Reichart, R., Friedman, D., Brenner, M., Hassidim, A., Devinsky, O., Flinker, A., & Hasson, U. (2024). Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns. Nature Communications, 15, 2768. DOI
Goldstein, A., Wang, H., Niekerken, L., Zada, Z., Aubrey, B., Sheffer, T., Nastase, S. A., Gazula, H., Schain, M., Singh, A., Rao, A., Choe, G., Kim, C., Doyle, W., Friedman, D., Devore, S., Dugan, P., Hassidim, A., Brenner, M., Matias, Y., Devinsky, O., Flinker, A., & Hasson, U. (2025). A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations. Nature Human Behaviour, 9, 1041–1055. DOI
Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., Nastase, S. A., Feder, A., Emanual D., Cohen, A., Jensen, A., Gazula, H., Choe, G., Rao, A., Kim, C., Casto, C., Lora, F., Flinker, A., Devore, S., Doyle, W., Dugan, P., Friedman, D., Hassidim, A., Brenner, M., Matias, Y., Norman, K. A., Devinsky, O., & Hasson, U. (2022). Shared computational principles for language processing in humans and deep language models. Nature Neuroscience, 25, 369–380. DOI
Hong, Z., Wang, K., Zada, Z., Gazula, H., Turner, D., Aubrey, B., Niekerken, L., Doyle, W., Devore, S., Dugan, P., Friedman, D., Devinsky, O., Flinker, A., Hasson, U., Nastase, S. A., & Goldstein, A. (2024). Scale matters: large language models with billions (rather than millions) of parameters better match neural representations of natural language. eLife, 13, RP101204. DOI
Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453–458. DOI
Kauf, C., Tuckute, G., Levy, R., Andreas, J., & Fedorenko, E. (2024). Lexical-semantic content, not syntactic structure, is the main contributor to ANN-brain similarity of fMRI responses in the language network. Neurobiology of Language, 5(1), 7–42. DOI
Kumar, S., Sumers, T. R., Yamakoshi, T., Goldstein, A., Hasson, U., Norman, K. A., Griffiths, T. L., Hawkins, R. D., & Nastase, S. A. (2024). Shared functional specialization in transformer-based language models and the human brain. Nature Communications, 15, 5523. DOI
LeBel, A., Wagner, L., Jain, S., Adhikari-Desai, A., Gupta, B., Morgenthal, A., Tang, J., Xu, L., & Huth, A. G. (2023). A natural language fMRI dataset for voxelwise encoding models. Scientific Data, 10, 555. DOI
Li, J., Bhattasali, S., Zhang, S., Franzluebbers, B., Luh, W. M., Spreng, R. N., Brennan, J. R., Yang, Y., Pallier, C., & Hale, J. (2021). Le Petit Prince: a multilingual fMRI corpus using ecological stimuli. bioRxiv. DOI
Mischler, G., Li, Y. A., Bickel, S., Mehta, A. D., & Mesgarani, N. (2024). Contextual feature extraction hierarchies converge in large language models and the brain. Nature Machine Intelligence, 6, 1467–1477. DOI
Nastase, S. A., Liu, Y.-F., Hillman, H., Zadbood, A., Hasenfratz, L., Keshavarzian, N., Chen, J., Honey, C. J., Yeshurun, Y., Regev, M., Nguyen, M., Chang, C. H. C., Baldassano, C., Lositsky, O., Simony, E., Chow, M. A., Leong, Y. C., Brooks, P. P., Micciche, E., Choe, G., Goldstein, A., Vanderwal, T., Halchenko, Y. O., Norman, K. A., & Hasson, U. (2021). The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension. Scientific Data, 8, 250. DOI
Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N., … & Fedorenko, E. (2021). The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences, 118(45), e2105646118. DOI
Toneva, M., Mitchell, T. M., & Wehbe, L. (2022). Combining computational controls with natural text reveals aspects of meaning composition. Nature Computational Science, 2(11), 745–757. DOI
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274–290. DOI
Zada, Z., Goldstein, A. Y., Michelmann, S., Simony, E., Price, A., Hasenfratz, L., Barham, E., Zadbood, A., Doyle, W., Friedman, D., Dugan, P., Melloni, L., Devore, S., Flinker, A., Devinsky, O., Nastase, S. A., & Hasson, U. (2024). A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron, 112(18), 3211–3222. DOI
Zada, Z., Nastase, S. A., Aubrey, B., Jalon, I., Michelmann, S., Wang, H., Hasenfratz, L., Doyle, W., Friedman, D., Dugan, P., Melloni, L., Devore, S., Flinker, A., Devinsky, O., Goldstein, A., & Hasson, U. (2025b). The “Podcast” ECoG dataset for modeling neural activity during natural language comprehension. bioRxiv. DOI
Zada, Z., Nastase, S. A., Speer, S., Mwilambwe-Tshilobo, L., Tsoi, L., Burns, S., Falk, E., Hasson, U., & Tamir, D. (2025b). Linguistic coupling between neural systems for speech production and comprehension during real-time dyadic conversations. bioRxiv. DOI