Open reviews

April 21, 2026 — Open review of Singh, Antonello and colleagues:

Singh, C.*, Antonello, R. J.*, Guo, S., Mischler, G., Gao, J., Mesgarani, N., & Huth, A. G. (2025). Evaluating scientific theories as predictive models in language neuroscience. bioRxiv. DOI


Singh, Antonello and colleagues use AI to answer a series of natural-language questions time-point-by-time-point across a wide range of naturalistic spoken narrative stimuli used in an fMRI experiment. They show that a larger question-answering (QA) model can be boiled down to a smaller 35-question (QA-35) model that predicts fMRI activity surprisingly well, on par with higher-dimensional embeddings from a large language model (LLM). They use the encoding weights learned on this QA-35 model to create interpretable brain maps for each question, which roughly correspond to contrast maps derived across a number of targeted experimental localization paradigms. Crowdsourced expert opinions on the importance of different features only roughly correspond with the prediction performance for matching questions. To the authors’ credit, they also test their model against an open ECoG dataset, and find that their QA model does not hold up as well as it does in fMRI. Overall, I like this manuscript, even if I find the framing a bit frustrating. The work is solid and interesting, and I generally believe the results; I hope to see it published soon. In the first few comments below, I try to articulate my frustration with the framing/tone; I hope the authors (and anyone else!) reading these comments takes them with a grain of salt!—as they’re pretty subjective and I’m playing devil’s advocate to some extent. In the remaining comments, I lay out a few questions that I hope the authors can use to strengthen and clarify their claims. Sorry for the longwinded review!

Major comments:

1a. Throughout the manuscript, the success of the QA-35 model is framed as a cautionary tale about the use of more complex, less-easily-interpretable LLM embeddings. In my opinion, this wasn’t super convincing and comes across as a bit regressive. To my reading, the success of the QA-35 model serves just as much as a lesson about the experimental localizationist approach as it does about more recent efforts to leverage LLMs against the brain. This approach feels like the logical culmination of an explanatory program where the researcher points at each brain area and says “this brain area does X”, where X is some idea like “time” or “numbers” or “actions” or “people” that comes intuitively to the human experimenter. Historically, this localizationist approach was grounded in experimental paradigms motivated by all sorts of theories of cortical functional organization (e.g., Kanwisher, 2010; Martin, 2016). The QA-35 model effectively boils down all of these theories into a battery of simple yes-or-no questions: similar selectivity can be distilled by annotating the presence of intuitive features as they occur in naturalistic spoken narratives. In my opinion, this work does a great job of deflating that sprawling experimental program, nearly to the point of absurdity. In some sense, this feels like the ultimate “word model” as Kendrick Kay (2018) calls it: pick your favorite intuitive idea (e.g. “a journey”), formulate it into a question to pose at each moment of the stimulus (e.g. “does this part of the stimulus describe a journey?”), and with enough natural-language fMRI data, voila!—here are the “journey” parts of the brain. The way the authors’ approach implicitly trivializes the localizationist program is an interesting contribution of the manuscript (whether that was the authors’ intent or not).

1b. I guess the deeper question is: Is this the kind of explanation we should seek? The methodology developed in this manuscript puts a premium on human-interpretable, natural-language descriptions of brain activity. Does a map of positive and negative “journey” voxels help us understand the brain? The authors correctly characterize their set of questions as one particular basis set that predicts brain activity well in aggregate. This set of features is surely privileged from the perspective of the human experimenter, as each feature is expressed as a simple question in natural language. I’m not sure this implies that these are privileged bases for the brain. Surely some voxels will track coarsely with the presence of a “journey”, but I’m not sure the word “journey” helps us understand why that happens or what these voxels are doing. I find myself asking: What could this methodology possibly tell us about early visual areas during audio-only story-listening? Visual areas seem to have positive weight for questions 9 (sensation/feeling), 12 (texture/sensation), 23 (reflection/introspection)—what does this even mean? As the authors are well aware, rotating the basis would yield the same prediction performance, but arbitrarily change the weights (and their interpretation). In other words, this particular set of questions is just one view onto the underlying features encoded at each voxel, one that is particularly convenient to the experimenter seeking to interpret the underlying features in terms of simple yes-or-no questions.

1c. My main gripe with the overall framing of the paper is that the authors interpret their approach as undermining the utility of LLM embeddings for modeling human brain activity. If a simple list of 35 yes-or-no questions can predict brain activity just as well as a 1000+ dimensional LLM embeddings, why bother with the LLM embeddings? (Maybe I’m overreading here!) Sure, the embeddings derived from an LLM only correlate with latent features encoded at each voxel, but this is equally true of the 35 questions (and historical experimental manipulations). The authors frequently describe LLM embeddings as “black-box” models, which frankly I think plays into a silly stereotype. LLM embeddings are complex and not-so-intuitive to interpret, sure, but they’re also not really black boxes: they are fully transparent in the sense that they’re just a bunch of numbers; we can open them up and inspect every weight and every unit activation for every token; we can causally manipulate them in ways we can’t manipulate a human brain; we can project words into the embedding space and observe their geometry (similarly to the authors’ prior work; Huth et al., 2016) and we can use them to simulate brain response to a novel stimuli (as the authors have previously argued; Jain et al., 2024). They are much less of black boxes than the human minds that formulated these 35 questions. There are rapidly advancing subfields developing around efforts to better understand and manipulate these models. The risk of disparaging these models as black boxes is that it stymies good-faith work aiming to use this family of models responsibly.

1d. As mentioned in the previous comment, the manuscript seems to value a model based on its folk-intuitive, natural-language interpretability. While this is certainly not a strength of LLMs, LLM embeddings have other qualities that may be theoretically desirable: they are expressed numerically instead of in words; they are learned directly from the structure of language; they play a mechanistic role in an explicit model that processes and produces natural language; they are encoded across simple neuron-like units in a way that was historically inspired by neural population codes, and so on. I think I would fully agree with the authors that it’s difficult to assess/interpret how these qualities of the model relate to brain activity in terms of prediction performance alone (e.g., Antonello & Huth, 2024). But there’s plenty of good work aiming to adjudicate questions about these qualities of LLMs using clever model comparisons in both vision and language. My point in saying all this is that I think the authors can present the insights of their work without framing it in such a way as to discourage other avenues of research. I understand that this is just my subjective (i.e. biased) reading of the manuscript, so I’ll leave it up to the authors to decide how to proceed.

2a. There are a few analytic choices that I worry could slightly favor a lower-dimensional model like QA-35; I unpack each of these in the following handful of comments. I want to emphasize that I don’t think any of these comments will dramatically impact the overall results, and will not undermine the interpretational utility of a model like QA-35. I would be totally comfortable with the authors making the claim that “QA-35 performs on par with LLM embeddings, but is more interpretable” given the present results, but it feels like they want claim something closer to “QA-35 performs better than LLM embeddings, and this is because it’s interpretable,” which feels shakier to me for the following reasons. First, In the main text, the authors compare models in terms of their average test performance across voxels and subjects. Does the method for aggregating across voxels (averaging in this case) bias the model comparison at all? When I was looking at the aggregate results in Figure 1f, I really wanted to see the whole-brain model comparison maps reported in Figure A6. Certain voxels are better predicted by QA-35, others are better predicted by Llama. These whole-brain maps don’t tell a very clean story: Llama clearly performs better than QA-35 in S03; QA-35 has an advantage in S01 and S02, but the advantage disappears in S02 when all voxels are considered rather than just 100 PCs (see follow-up comment below). This inconsistency makes the aggregate model comparison results a bit difficult to interpret; I would suggest the authors not center the aggregate comparison if it doesn’t clearly replicate across subjects.

2b. The authors have a fairly unconventional approach here where they use PCA to reduce each subject’s ~100,000 voxels down to 100 PCs, then fit their encoding models to the PCA-reduced voxel data. I assume they use this step to expedite the model fitting computation. First of all, we need to know how much variance in the original voxel-level activity these 100 PCs capture (e.g., is it 90% or 40%?). I understand that the authors project the model predictions back onto the original voxel space before evaluating the models in terms of voxelwise test correlations. I still worry that retaining only the first 100 PCs of voxel activity could reduce the diversity of voxel activity to serve as a target when training encoding models. The test set model predictions at each voxel can only be reconstructed from a linear combination of the model predictions on the first 100 PCs. Is it possible that reducing the diversity of voxel-level activity will tend to favor a lower-dimensional encoding model? I can’t think of any other way to be sure other than just refitting the models of interest (e.g. QA-35 and Llama) on voxel-level data (instead of on the PCs) and re-comparing aggregate performance.

2c. The authors provide the LLMs with only a 10-word context window to ensure a “fair” comparison with the 10-word windows used in the QA model. I don’t think the LLMs will see a huge benefit for longer context windows, but this is clearly a handicap for the LLM embeddings, so I’m not sure I’d describe it as “fair”.

2d. Can you please compute and report the core model performance results for the first presentation of each of the test stimuli across the main models of interest? Averaging brain activity across 5 or 10 presentations of the same story to serve as the test dataset will tend to wash out processing having to do with novelty and sense-making, may introduce unusual processes having to do with memory or prediction (e.g., Michelmann et al., 2021), and will tend to emphasize any processes that are invariant to repetition, habituation, or boredom. Could using an averaged test dataset bias the results toward a simpler, less context-sensitive model—hard to say. I understand the performance values will tend to be reduced overall for a “noisier” test set, but this seems like a worthwhile tradeoff to capture the actual process of comprehension (instead of whatever weird stuff might be happening on the repetitions).

2e. If the authors want to make strong claims about one model having superior performance over the other, I’m surprised they didn’t jointly fit the QA-35 model alongside the LLM embeddings using banded ridge, then evaluate split correlations or unique variance explained. Even if the LLM embeddings won out in this scenario, it wouldn’t undermine the interpretational utility of the QA-35 model.

3. I don’t find it hugely surprising that a well-crafted set of semantic features performs well in predicting brain activity in aggregate. I’m curious whether a 35-dimensional model derived from the LLM embeddings using the same elastic net procedure would yield similar encoding performance. I understand that the features resulting from this analysis would not have the benefit of being interpretable like QA-35, but it’s possible that a reduced set of good features in fact generalizes slightly better than whatever ridge regression can distill from a much wider embedding, especially on datasets of limited training size, like on the left side of Figure 1 panel f. To my mind, this is a useful control to contextualize the argument that QA-35 outperforms LLM embeddings in smaller training samples: if a 35-dimensional LLM model (derived using the same elastic net procedure on the full training set) yields the same performance as QA-35 in smaller training samples, then it’s just the dimensionality, not the interpretability, that matters.

4. Following on the previous comment, am I correct in understanding that the 35 questions are selected from the full training set, then applied to smaller training sets shown in Figure 1 panel f? Is this fair? I assume that applying the question selection procedure to a smaller training set would be more likely to yield a biased set of questions with poorer generalization performance to different datasets.

5. Reporting the results in terms of percentages feels a bit misleading for what are relative low correlations and small absolute differences overall. Please report the two correlation values from which the percentages are derived when you describe the results in terms of percentages; e.g., (line 92) “For example, when models were trained with only 5 stories (roughly 75 minutes of data), the relative improvement over the best baseline was 43% (from r = X to r = Y).” In this particular example, it looks like the values are going from ~.015 to ~.025.

6. When interpreting the ECoG results, the authors largely focus on the QA + Spectral model rather than the QA model alone (referring to QA + Spectral as the “interpretable” model…). Is this really fair? The QA-35 model alone performs pretty poorly on the ECoG data; it’s among the lowest performing models alongside syntactic features and lexical embeddings. The QA + Spectral model performs considerably better than QA-35 alone, but this suggests that Spectral is doing a large amount of work in the combined model. Note that the percentage by which Llama outperforms QA-35 (or Llama + Spectral outperforms QA + Spectral) for this relatively small stimulus set in ECoG is not reported in the same way it is for Figure 1f, but is likely fairly large…

7. At line 320, you say that “further work on the accuracy of auditory QA will likely substantially improve the performance of our method on modalities with higher temporal resolution like ECoG, which tend to be more phonological and acoustic.” I’m not sure this is a fair interpretation of the results. I’m not sure that ECoG is “more” phonological/acoustic. In fact the authors’ results in Figure 4b contradict this interpretation, given that Llama (green line) and GPT (blue line) embeddings outperform the phonological/acoustic features in the ECoG data, and the Spectral features only add a small performance boost to Llama (red line). Rather, I suspect that fMRI is particularly insensitive to these features and ECoG has higher SNR overall.

Minor comments:

Line 118: “whereas communication and speech information heavily predicts activations in inferior temporal (IT) lobe”—looking at the cortical maps, I don’t really see a lot of heat in IT for the communication questions… Would we expect to find communication and speech encoding in IT?

Line 257: I suppose the negatively correlated features between fMRI and ECoG could also be due to biased (under)sampling in the Podcast stimulus, as well as biased sampling of electrode locations across cortex.

Line 577: In the Baselines section of the Methods, please tell us the dimensionality of the LLM embeddings used in the main-text analyses.

Line 585: “Embeddings for baseline embeddings”—typo here?

Line 593: What statistical results does this description of a permutation test refer to? I’m having a hard time understanding how randomly selecting TRs from the training data results in a permutation test? Are you retraining the models at each permutation? What exactly is getting permuted and how does that result in a null distribution for a statistical test?

References

Antonello, R., & Huth, A. (2024). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language, 5(1), 64–79. DOI

Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453–458. DOI

Jain, S., & Huth, A. (2018). Incorporating context into language encoding models for fMRI. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (pp. 6628–6637). Curran Associates, Inc. link

Kanwisher, N. (2010). Functional specificity in the human brain: a window into the functional architecture of the mind. Proceedings of the National Academy of Sciences of the United States of America, 107(25), 11163–11170. DOI

Kay, K. N. (2018). Principles for models of neural information processing. NeuroImage, 180, 101–109. DOI

Martin, A. (2016). GRAPES—Grounding representations in action, perception, and emotion systems: How object properties and categories are represented in the human brain. Psychonomic Bulletin & Review, 23(4), 979–990. DOI

Michelmann, S., Price, A. R., Aubrey, B., Strauss, C. K., Doyle, W. K., Friedman, D., … & Norman, K. A. (2021). Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature Communications, 12, 5394. DOI