Open reviews

April 22, 2024 — Open review of Mischler, Li and colleagues:

Mischler, G., Li, Y. A., Bickel, S., Mehta, A. D., & Mesgarani, N. (2024). Contextual feature extraction hierarchies converge in large language models and the brain. arXiv. DOI


Mischler, Li and colleagues survey a number of large language models (LLMs) and compare their layerwise feature spaces to neural activity measured using intracranial (iEEG) recordings. Unlike recent work comparing scaling effects with increasingly large models (Antonello et al., 2024), this work takes a cross-sectional approach, comparing 12 different models of roughly the same size. The authors find that models with higher performance on language comprehension and reasoning benchmarks tend to better predict brain activity—with model size held constant. More specifically, they show that better-performing models learn a layerwise hierarchy of feature spaces that better maps onto a cortical language processing hierarchy (operationalized as distance from primary auditory cortex). This “hierarchy alignment” is strongest when models are provided with relatively long context windows. They also present a novel result where better-performing models seem to pack more brain-like features into earlier layers. These results are not completely groundbreaking given the recent flurry of papers on this topic, but the work is thorough and I think there’s plenty of novel elements here. The manuscript is well-written, the figures are well-designed, and the results are compelling. I have some relatively high-level theoretical comments, as well as some questions about the statistical reporting.

Major comments:

My first two comments focus on how we should interpret the observed results with respect to timing and dynamics in the brain. The authors extract the average high-gamma envelope from a 100-ms window around the center of each word, and supply these word-by-word response magnitudes to the encoding analysis. This seems reasonable, and in Figure S2 they show similar results for 50-ms and 150-ms windows. But I’m curious about the motivation for choosing this fairly tight window. I wonder if we’re missing anything by excluding neural activity further in time from word articulation (or perhaps looking at maximal encoding within a wider window). For example, previous work (e.g. Goldstein et al., 2022) has shown examples of electrode-wise encoding performance peaking at 300–400 ms after word onset, and I suspect these dynamics may differ across electrodes. I understand that expanding more on the time/lag question is likely to over-inflate the number of analyses and may be beyond the scope of this paper.

This leads to a more theoretical question. As the analyses are currently formulated, we’re observing a match in hierarchical processing pathways—at the same time relative to word articulation in the brain (50–150 ms around the center of each word). Shouldn’t we expect hierarchical processing to also unfold over time? Surely if linguistic information is flowing “up” the cortical hierarchy, lags in synaptic transmission or recurrent activity will introduce lags and other temporal dynamics. The work on temporal receptive windows that the authors cite (e.g. Lerner et al., 2011) would also suggest fairly complex dynamics, especially in natural speech with longer-form narrative structures (Honey et al., 2012; Chang et al., 2022). I very much appreciate their Figure S3, which replicates the core result on hierarchy alignment via response lags rather than distance from primary auditory cortex. Anyway, I think it would be worth discussing the complexity of this issue, potentially as a limitation or direction for future work.

A core question I had while reading through the results was: what actually makes these models different? I assume they have minor architectural differences, differences in the training procedure, and different diets of training data. For example, initially it wasn’t clear to me that they even have the same number of layers (but I can infer from the x-axis in Figure 2A that they all have 32.) Do the authors have any intuitions as to what drives their differences in performance? I was thinking it might be nice to have a supplementary table describing the models in more detail with an eye toward differences, but I suppose much of that information may not be openly available? Alternatively, maybe just a couple sentences on potential differences would suffice.

The authors quantify depth along the cortical processing hierarchy as physical distance (in millimeters) from primary auditory cortex in posteromedial Heschl’s gyrus (pmHG). I’m surprised at how well this seems to work (Figure 2C), given that this is a fairly coarse metric of “depth” along the language processing pathways. Is there a precedent in the field for operationalizing processing depth in this way? The Methods indicate that the authors are simply computing the Euclidean distance (in fsaverage space) between electrodes, rather than, for example, geodesic distance along the cortical surface (e.g. Oosterhof et al., 2011). But even a cortical surface-based metric wouldn’t take into account the complex white matter pathways connecting these areas. For example, posterior temporal electrodes and inferior frontal electrodes may be fairly distant from each other in terms of Euclidean distance, despite being tightly coupled by the arcuate fasciculus. In any case, I don’t there’s anything particularly wrong with their current operationalization of distance (given that more anatomically or functionally detailed alternatives are likely difficult to formulate)—but a bit more discussion of the limitations in (functional/anatomical) interpretability of this coarse metric would be much appreciated. Also, can the authors also provide a supplementary figure where electrodes are colored according to their distance from pmHG, so we can more easily visualize what these distances mean?

For Figure 2B, you report a correlation of .92 with a very low p-value. Is this correlation computed across only 12 samples (i.e. the 12 models)? Or are you including individual electrodes in the correlation computation? I have the same question about the correlation reported in Figure 2D. What is the statistical interpretation of these p-values—the observed relationship would be very unlikely (under a null distribution of no relationship) when sampling from a “population” of language models?

I have a related question about Figure 3A: Is it statistically permissible to report correlations across binned samples and layer indices? These are clearly not independent samples, given that there are a fixed number of sequential bins and layers. My intuition is that binning will also reduce variability in the original samples, thus inflating the correlation. I don’t think this necessarily undermines the relationship observed in Figure 3B, I’m just not sure how best to quantify “hierarchy alignment”.

Minor comments:

One challenge with work of this kind is that the stimuli presented to the human subjects may be too limited to really push the expressivity of these models. Even though 20–30 minutes of speech is a relatively rich, naturalistic language stimulus, it will have low power for detecting relatively infrequent linguistic structures that may further differentiate or tax particular models (Hamilton & Huth, 2020).

Line 162: “Mistral’s tokenizer averages 1.15 tokens per word in our stimulus corpus”—this wording was a bit hard to parse.

Line 240: When you say “delay[ed]”, I would make sure you specify that you mean across layers (not time).

Line 291: This final comment on “AGI” may rub some readers the wrong way; I would make sure this bit of the discussion is measured and not overly speculative.

Line 351: Ideally, PCA should be estimated within each training set and then the transformation applied (without refitting) to the corresponding test set (e.g. using StandardScaler’s .fit_transform(X_train) and .transform(X_test) to avoid train/test leakage. I don’t think this will make any difference for the central claims of this paper.

Supplementary Figure 3: Can you elaborate a bit on how this analysis was performed in the figure caption? For example, the line describing how you “estimated electrode lag using the peak of a 1D temporal receptive field fitted for each electrode” was pretty opaque to me.

References:

Antonello, R., Vaidya, A., & Huth, A. (2024). Scaling laws for language encoding models in fMRI. Advances in Neural Information Processing Systems, 36. link

Chang, C. H., Nastase, S. A., & Hasson, U. (2022). Information flow across the cortical timescale hierarchy during narrative construction. Proceedings of the National Academy of Sciences, 119(51), e2209307119. DOI

Goldstein, A., Zada, Z., Buchnik, E., Schain, M., Price, A., Aubrey, B., Nastase, S. A., Feder, A., Emanual D., Cohen, A., Jensen, A., Gazula, H., Choe, G., Rao, A., Kim, C., Casto, C., Lora, F., Flinker, A., Devore, S., Doyle, W., Dugan, P., Friedman, D., Hassidim, A., Brenner, M., Matias, Y., Norman, K. A., Devinsky, O., & Hasson, U. (2022). Shared computational principles for language processing in humans and deep language models. Nature Neuroscience, 25, 369–380. DOI

Hamilton, L. S., & Huth, A. G. (2020). The revolution will not be controlled: natural stimuli in speech neuroscience. Language, Cognition and Neuroscience, 35(5), 573–582. DOI

Honey, C. J., Thesen, T., Donner, T. H., Silbert, L. J., Carlson, C. E., Devinsky, O., Doyle, W. K., Rubin, N., Heeger, D. J., & Hasson, U. (2012). Slow cortical dynamics and the accumulation of information over long timescales. Neuron, 76(2), 423–434. DOI

Lerner, Y., Honey, C. J., Silbert, L. J., & Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31(8), 2906–2915. DOI

Oosterhof, N. N., Wiestler, T., Downing, P. E., & Diedrichsen, J. (2011). A comparison of volume-based and surface-based multi-voxel pattern analysis. NeuroImage, 56(2), 593–600. DOI