Open reviews

June 19, 2024 — Open review of He & Toyoizumi:

He, Z., & Toyoizumi, T. (2023). Causal graph in language model rediscovers cortical hierarchy in human narrative processing. arXiv. DOI


He and Toyoizumi develop a method for quantifying causal connections from one layer to another in a large language model (LLM), and show that LLM features integrating more features from a prior layer better predict brain activity higher in the cortical processing hierarchy. They validate this relationship by showing that the causal analysis correlates with increasing temporal autocorrelation along the cortical hierarchy. I like how this work pulls together work on temporal receptive windows with LLMs, and the causal analysis seems to map onto the cortical processing hierarchy in a fairly novel and intuitive way. However, I don’t think the causal analysis is described in enough detail for a broad readership to interpret the results; I’m not aware of a precedent for this analysis in the field, so the authors will have to make a convincing case themselves. I also outline a couple concerns about the statistical assessment.

Major comments:

My main comment is that I worry readers will need more information about the causal perturbation analysis to feel confident in interpreting the results. Is there any precedent in the NLP literature for using this method to quantify causal dependencies between features across LLM layers? (I don’t see any pointers to other literature in this section.) If not, then any further validation would be helpful. I’ll list out a few questions I had while trying to understand and interpret this analysis. (1) The described relationship between the causal analysis and PCA was unclear to me. Do you simply mean to say that the causal analysis was applied to the same PCA-reduced features initially used to predict neural activity? (2) From an NLP perspective, does this causality analysis have the same interpretation when applied to PCs as when applied to original embedding features? For example, the PCA will give you linear combinations of embeddings maximizing variance in this particular stimulus set. Does this limit the interpretability of the observed causal relationships? (3) Do you estimate (or apply) the same PCA transformation to both input (layer 4) and output (layer 9) of the causality analysis? Or do you run separate PCA analyses for each layer? (4) I didn’t fully understand how the time delay tau plays into the causal analysis. Are the “time” delays in units of tokens? (5) Pinning the core analyses entirely on the causal graph from layer 4 to layer 9 feels fairly arbitrary; I understand that layer 9 is the best-performing layer for predicting brain data, but why layer 4? Shouldn’t we expect each layer to be inheriting some contextual structure from many if not all previous layers? Figures 9, 10, and 11 in the Appendix indicate that the results do shift around somewhat for different layers in different models. I suspect that the results hold up across different combinations of layers, but it would be reassuring if this were investigated more systematically (rather than a handful of spot checks). (6) How and why was a particular threshold applied to the causality matrix? How does this threshold relate to low in-degree and high in-degree groups of features? In general, without a more thorough description (or looking at the code), I feel like I don’t really understand how this causal perturbation analysis works.

We need more information on how the statistical tests are performed. For example, on page 5 (similarly for page 7), you mention that you adopt “a shuffling approach for the language features”—but we need more information to understand what is being shuffled and what the statistical test is actually evaluating. Are you shuffling embeddings across TRs? This will abolish the temporal autocorrelation across embeddings and will likely yield an overly permissive null distribution. For this kind of within-subject randomization analysis, I would recommend phase randomizing (or circular-shifting) the time series prior to fitting the model, or block-permuting temporally contiguous chunks (e.g. LeBel et al., 2023), both of which largely preserve autocorrelation. However, this kind of within-subject permutation does not license population inference over subjects. You could instead consider using something like a Wilcoxon signed-rank test to assess significance across subjects (e.g. Caucheteux et al., 2023). Are you conducting this statistical test per voxel? If so, then you should control for multiple tests somehow. Or is your test statistic the mean encoding performance across all voxels? (The former is fairly standard, the latter is not.) In any case, the authors should better motivate their choice of statistical tests and more explicitly describe how they perform them.

For the analyses accompanying Figures 6 and 7, you report p-values for the Spearman correlations. Note that these p-values do not support population inference as formulated. In Figure 6, the degrees of freedom in computing the p-value correspond to the number of parcels (where the parcel-level values are averaged across subjects); in Figure 7, the degrees of freedom correspond to the number of tokens. In both cases, I would consider de-emphasizing the p-values so as to avoid misinterpretation (or reformulate the statistical test somehow to yield a more meaningful p-value).

In Figure 3, am I correct in understanding that the x- and y-axes correspond to the 20 PCs for layer 4 and layer 9? If so, I would adjust the x- and y-axis tick values to indicate integers, rather than values like 17.5. What value does the color bar capture here and how should we interpret it?

On page 7, you mention that the scrambling manipulation used by Lerner et al. (2011) “cannot be reproduced in our case.” While this is true, work by Vo, Jain et al. (2023) has shown that manipulating the model inputs instead can replicate a similar hierarchy of temporal receptive fields. I don’t think it’s necessary for the authors to reproduce those results here, but this work should be cited at least.

I’m not sure the map revealed by Figure 12 really tells us anything meaningful about the relationship between model layer and the cortical hierarchy. Layer 9 is simply a better basis than layer 1 for predicting language processing in the brain, so the difference map reveals (in red) that layer 9 better predicts essentially the entire cortical language network.

In Figure 5, I’m not sure what you mean when you say that “the map is thresholded at 1.5 s” (i.e. 1 TR). Can you give us a better intuitive understanding of what this autocorrelation time constant means? How should we interpret the color bar? Wouldn’t it be meaningful to also show voxels with a time constant lower than 1.5 s? I’m curious as to why earlier auditory areas do not show a shorter (more blue) autocorrelation time constant.

In Figures 2, 4, and 5, I’m not sure the Glasser atlas labels really make this brain map easier to interpret. At first I had the impression that you were running the encoding analysis per parcel, but I understand that you’re running voxelwise models then simply visualizing the parcels on top. In any case, I would keep the parcel boundaries for visualization and use the parcel labels in describing the results in the text, but I would consider removing the labels from the visualization.

I’m not sure it’s very useful to focus on the average encoding performance (correlation) across all voxels when reporting the results, as this will include many voxels not involved in language processing which we wouldn’t expect to have high scores. I think you could potentially omit this averaging step entirely when summarizing the results, or otherwise could report the average within some set of brain areas; e.g. Fedorenko’s language ROIs (Lipkin et al., 2023) or a subset of voxels with high intersubject correlation (Nastase et al., 2021).

Minor comments:

The writing could use a bit of work; right now, the writing feels geared toward an ML conference. For example, the Related Work section reads a bit too much like an unstructured list of related papers.

I would slightly reword the title to “Causal graphs in language models rediscover the cortical hierarchy in human narrative processing”.

Page 3: I don’t think “surfaced” is standard terminology; I would say “resampled to the cortical surface” instead.

The sentence with “correlate it to a specific TR” isn’t very clear; maybe “To downsample the embeddings to match the temporal resolution of the fMRI data, we assigned each token to a TR based on the associated timestamps, then averaged the corresponding embeddings within each TR.”

Page 4: Are you using five delays? 2 TRs (3 s), 3 TRs (4.5 s), 4 TRs (6 s), 5 TRs (7.5 s), and 6 TRs (9 seconds)?

Page 4: “, such as OPT” > “ (OPT-125m)”

References:

Caucheteux, C., Gramfort, A., & King, J. R. (2023). Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour, 7, 430–441. DOI

LeBel, A., Wagner, L., Jain, S., Adhikari-Desai, A., Gupta, B., Morgenthal, A., Tang, J., Xu, L., & Huth, A. G. (2023). A natural language fMRI dataset for voxelwise encoding models. Scientific Data, 10, 555. DOI

Lerner, Y., Honey, C. J., Silbert, L. J., & Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31(8), 2906–2915. DOI

Lipkin, B., Tuckute, G., Affourtit, J., Small, H., Mineroff, Z., Kean, H., Jouravlev, O., Rakocevic, L., Pritchett, B., Siegelman, M., Hoeflin, C., Pongos, A., Blank, I. A., Struhl, M. K., Ivanova, A., Shannon, S., Sathe, A., Hoffman, M., Nieto-Castañón, A., & Fedorenko, E. (2022). Probabilistic atlas for the language network based on precision fMRI data from >800 individuals. Scientific Data, 9, 529. DOI

Nastase, S. A., Liu, Y.-F., Hillman, H., Zadbood, A., Hasenfratz, L., Keshavarzian, N., Chen, J., Honey, C. J., Yeshurun, Y., Regev, M., Nguyen, M., Chang, C. H. C., Baldassano, C., Lositsky, O., Simony, E., Chow, M. A., Leong, Y. C., Brooks, P. P., Micciche, E., Choe, G., Goldstein, A., Vanderwal, T., Halchenko, Y. O., Norman, K. A., & Hasson, U. (2021). The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension. Scientific Data, 8, 250. DOI

Vo, V. A., Jain, S., Beckage, N., Chien, H. Y. S., Obinwa, C., & Huth, A. G. (2023). A unifying computational account of temporal context effects in language across the human cortex. bioRxiv. DOI