Open reviews

April 28, 2025 — Open review of de Varda and colleagues:

de Varda, A. G., Malik-Moraleda, S., Tuckute, G., & Fedorenko, E. (2025). Multilingual computational models reveal shared brain responses to 21 languages. bioRxiv. DOI

de Varda and colleagues show that the internal representations learned by multilingual large language models can accurately predict human neural activity across a number of different languages. In other words, a multilingual language model trained to predict brain activity in response to one language (using embeddings from that language) generalizes zero-shot to predicting brain activity in response to a different language (using embeddings from that second language). The manuscript is very thorough, with a large number of diverse languages (as well as many supporting analysis), and I found the results to be compelling overall. In general, I’m excited about this paper! I have a couple questions/concerns about the processing pipeline and whether certain analytic choices will tend to bias the results one way or another.

Major comments:

My highest-level comment is that I felt there was a good deal of tension arising from the fact that testing the generalization of encoding models to different languages also requires testing that they generalize across individual subjects. This tension jumped out to me in two places (points 1 and 2 below):

1a. First, this generalization test demands a particular cross-validation structure for predicting left-out languages in left-out subjects. The details of this cross-validation process, particularly the subject-level structure, were not sufficiently clear in both the text and in Figure 1B. Am I correct in understanding that the within-subject models are trained and tested within the same small number of subjects for a given language? But the across-language models are necessarily evaluated across subjects? Shouldn’t the within-language performance also be evaluated across subjects to provide a fair comparison? For example, is it possible that the correlations with next-word prediction (and potentially cross-lingual embedding similarity) are mixed because one condition is evaluated within subjects and the other is evaluated across subjects?

1b. In the same way, the across-language models appear to be trained on N – 1 languages and tested on the left-out language. I couldn’t understand how exactly you trained a single encoding model across languages. For example, on page 21, you say “for each individual language li ∈ L, we fitted an encoding model in all the remaining languages lj ∈ L \ {li}.” Do you just concatenate the time series across languages? Do you average somehow? In any case, this means that the training data for the across-language encoding models will be qualitatively different (e.g., much larger volume and more diverse) from the training data used to train the within-subject encoding models. I worry that differences between within- and across-, e.g., in Figure 2B, C, and D, could be attributed to relatively uninteresting differences in the training/test structure of the cross-validation procedure. In sum, if the authors want to draw important conclusions about differences between within- and across-language encoding models (e.g., next-word prediction correlates for within- but not across-), then the cross-validation structure should be more carefully matched.

2a. Testing encoding models across subjects presents a tricky problem because there is no obvious voxel-level functional correspondence across individual brains (especially for areas like the language network). The authors are committed to precisely defining fROIs in the language network within each subject using functional localizers (which I generally agree with!). To square their subsequent analyses with this commitment, however, they make two particular choices:

First, they functionally define each subject’s language network using an English-language localizer, then estimate and evaluate encoding models within this set of English-defined voxels. The authors provide some justification for this (page 17): “We used the English localizer to ensure comparability with previous studies and because it has been established that the localizer for a language works well as long as participants are fluent in that language.” In general, I don’t think this is a bad simplification step, and I suspect using a localizer in the language of interest instead would yield a similar set of fROIs. That said, I worry that using English-defined voxels across all subjects and languages—if it introduces a bias at all—will tend to bias the results toward something shared across languages. To put it naively, you’re using voxels that are more likely to be “multilingual” somehow or sit at the intersection between the language of interest and English—the language shared across all subjects. Do the authors have localizer data for the separate languages of interest? If so, it could be worth running a supplementary analysis in hopes of showing that the results turn out similarly using language-specific localizers.

2b. Second, as far as I can tell, they collapse data across voxels in each fROI, across fROIs in the language network, and across subjects into a single time series for modeling. One time series for the entire language network. I understand that the authors expect relatively homogeneous results across regions of the language network (and are testing a lot of different models here), but that’s a lot of averaging. I was hoping to see some brain maps or at least some ROI-level results, but perhaps they don’t turn out particularly interesting.

3a. I was a bit surprised by the “time series extraction” exclusion step of the analysis. My understanding is that the authors excluded subjects where the response time series was not positively correlated between the 2–3 participants listening to the same language of interest (Figure 4A, B). This is just an intersubject correlation (ISC) analysis, right? I understand that individual-subject fMRI data are noisy, but I’m pretty surprised that you would get such low ISCs for pairs of subjects listening to the same language; roughly a third are numerically negative, some strongly negative. Do the authors have any idea why this might be the case? Could this be a result of the fROI definition somehow? Or a timing issue? I’d be curious where in the brain (if anywhere) those subjects do have positive ISCs. I’m worried that readers will find it troubling that you’re excluding 2/3rds(?) of the original languages/subjects from Malik- Moraleda, Ayyash, et al., 2022.

3b. Related to the previous point, am I correct in understanding that for the left-side bars in Supplementary Figure 8 (pages 46–47), you’re correlating time series responses to different passages in different languages? This is a “sanity check” in the sense that it doesn’t make sense and there should be no ISC, right? Don’t the different passages have different numbers of TRs? Do you trim them to match in order to compute the ISC? I guess I don’t really understand how you’re implementing this ISC-based sanity check.

Can you unpack the randomization baseline a bit more? First, how exactly are you randomizing the time series? I would recommend implementing a random circular shift (or phase randomization) in order to retain the temporal autocorrelation of the signal; simply shuffling TRs will disrupt the temporal autocorrelation, confounding the comparison with the intact time series. Second, are you only computing a single randomization, as opposed to, e.g., a null distribution with many randomizations?
In the Introduction (e.g., page 3), you set us up to expect results regarding (a) the correlation between brain score and next-word prediction, and (b) the superiority of autoregressive models over bidirectional models. I didn’t personally find this angle on the results to be particularly compelling. Don’t the results reported here still generally run up against the issues discussed by Antonello & Huth, 2024? In other words, there are a number of possible reasons (e.g., more training data) why the autoregressive models might end up better at predicting both brain data and the next word. Maybe there’s a more convincing argument to be made given that next-word prediction doesn’t appear to correlate with across-language performance, but cross-language embedding similarity does? (Although I’m still worried that this difference could be partly driven by differences in the training/test structure of the different cross-validation schemes.)
I had a bit of trouble understanding exactly how these results of generalizability across languages square with the authors’ fairly circumscribed functional role for the language network. My understanding from their previous work is that the language network carries out the core syntactic/semantic compositional processes for encoding and decoding language between peripheral perceptual/motor systems and higher-level knowledge (e.g., Fedorenko et al., 2024). I would expect that peripheral/motor components are quite different across languages, whereas the higher-level knowledge would be largely shared across languages—but this makes me wonder where exactly the language network (as defined) would sit. Presumably some syntactic/compositional properties are shared across languages, but many are not? Should we expect even stronger generalization for areas concerned with higher-level meaning? One possibility is that the multilingual LMs are essentially discarding any language-specific structure by the intermediate layers to arrive at a more general representation of meaning. Should we be surprised that this supra-modal representation of meaning partly captures activity in the language network? I’m honestly not sure this is a very substantive issue, but just wanted to convey that it didn’t fully crystallize for me as a reader.

Minor comments:

Love the examples in the first paragraph!

Abstract: “We found evidence of cross-lingual robustness in the alignment between language representations in artificial and biological neural networks.” —This sentence was a bit difficult to parse.

Page 3: “we investigated our core question: whether linguistic representations are similar across languages” —You might consider re-wording this core question; as stated, it was unclear whether this really relates to the brain, or just language models.

Figure 1A: Are the activity vectors coming out of the brain in Fig. 1A supposed to be time series for individual ROIs? You might add a label like “TRs” (to match the predictor matrix below) so people don’t interpret them as patterns of activity across voxels at a single time point.

Figure 1B: Readers might benefit from a few more labels in this cross-validation schematic. It wasn’t immediately clear to me that the x-axis along these vectors is time/TRs. Similarly, I had a hard time following the red and blue segments in the background. For example, in the top “within” schematics, it wasn’t clear to me why the red test segments switch from positions 1, 2, etc, in the background for the brain data (top), but stay on the 10th segment for the model (bottom).

Page 5: “fitted” → “fit” (and elsewhere) (“fit” when used as a verb, “fitted” when used as an adjective, right?)

Figure 2C: It wasn’t clear what corresponds to the “within” and “across” models in the small subplots for each model. I would label this directly so people don’t have to search through the figure caption to get the answer.

Page 11: “in light with” → “in line with”

Page 16: “normalization (estimated for the mean image using trilinear interpolation)” —What preprocessing step does this refer to? Spatial normalization to the MNI template? (linear? nonlinear? what algorithm?) Or some other kind of normalization?

Page 17: “we shifted the time-series by 6 seconds” —Shifted the time series relative to what? I assume you mean relative to the stimulus/embeddings for the encoding analysis. I was a little confused because in the following sentence you mention the ISC analysis, which should not require any time-shifting.

Page 20: “To get a sense of the similarity in the embedding representations across languages, we calculated the average cosine similarity between the contextualized embeddings corresponding to all word pairs across languages” —I had a hard time understanding what this sentence is telling us. Is this the full description of computing cosine similarity between embeddings across languages, the reciprocal rank, etc?

Page 21: The description of the regression model in “Brain encoding models” is very clear—great! One note: using the default leave-one-TR-out inner cross-validation in scikit-learn’s RidgeCV might not yield an optimal regularization parameter for generalizing across languages or subjects (because the regression model might be able to capitalize on autocorrelation with only one left-out TR). Passing in a split-half or leave-one-segment-out cross-validator for the inner loop might yield a parameter with better generalization performance. No re-analysis required here, just worth considering for the future. (We’ve seen some speed-ups and quality-of-life improvements using the himalaya package rather than scikit-learn for analyses like this.)

Page 23: Can you add another sentence here describing what “perplexity” means in terms of a classification scoring metric here for readers who might be unfamiliar?

Page 25: How were language fROIs defined in Study II? Were functional localizers in English used as in Study I? It wasn’t clear to me from the Methods text.

Page 43: “in the right-hemisphere language areas is r = 0.34, compared to r = 0.35 in the left-hemispheric language network, showing only a slight 3.86% average decrease” —I found this surprising, and I wouldn’t even report this as a “decrease”, as they are surely roughly equivalent from a statistical perspective.

Page 45: “an arbitrary shift corresponding to 2 TRs” —I would describe this in terms of seconds, not TRs, since the TR is fairly arbitrary (e.g., if your TR is ~700 ms like in HCP, 2 TRs will not be an adequate delay).

Page 48: Does the context-reset control analysis also reduce performance on the across-language generalization test? If we see a similar drop for the across-language test, wouldn’t this argue that it’s not an issue of context contamination?

I was expecting to see certain citations, but didn’t find them until the appendix (page 37). Obviously I have a biased reading of the literature, but the paper by Chen et al., 2024 seems relevant enough for the main text; possibly Dehghani et al., 2017, and/or Honey et al., 2012, as well.

Nice that the code is on OSF!

References:

Antonello, R., & Huth, A. (2024). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language, 5(1), 64–79. DOI

Chen, C., Gong, X. L., Tseng, C., Klein, D. L., Gallant, J. L., & Deniz, F. (2024). Bilingual language processing relies on shared semantic representations that are modulated by each language. bioRxiv. DOI

Dehghani, M., Boghrati, R., Man, K., Hoover, J., Gimbel, S. I., Vaswani, A., Zevin, J. D., Immordino-Yang, M. H., Gordon, A. S., Damasio, A., & Kaplan, J. T. (2017). Decoding the neural representation of story meanings across languages. Human Brain Mapping, 38(12), 6096–6106. DOI

Fedorenko, E., Ivanova, A. A., & Regev, T. I. (2024). The language network as a natural kind within the broader landscape of the human brain. Nature Reviews Neuroscience, 25(5), 289–312. DOI

Honey, C. J., Thompson, C. R., Lerner, Y., & Hasson, U. (2012). Not lost in translation: neural responses shared across languages. Journal of Neuroscience, 32(44), 15277–15283. DOI

Malik-Moraleda, S., Ayyash, D., Gallée, J., Affourtit, J., Hoffmann, M., Mineroff, Z., Jouravlev, O., & Fedorenko, E. (2022). An investigation across 45 languages and 12 language families reveals a universal language network. Nature Neuroscience, 25(8), 1014–1019. DOI

Sam Nastase

Open reviews

April 28, 2025 — Open review of de Varda and colleagues:

Major comments:

Minor comments:

References: