Open reviews

October 5, 2023 — Open review of preprint by Vo, Jain and colleagues:

Vo, V. A.*, Jain, S.*, Beckage, N., Chien, H. Y. S., Obinwa, C., & Huth, A. G. (2023). A unifying computational account of temporal context effects in language across the human cortex. bioRxiv. DOI


Vo, Jain and colleagues build a language model with a fixed distribution of interpretable processing timescales (a multi-timescale recurrent neural network or MT-RNN) to capture brain activity at different timescales during natural language comprehension. First, they use this model (specifically the encoding weights learned across units with varying timescales at each voxel) to map out cortical gradients of varying processing timescales during natural language comprehension. Next, they show that this model (trained only on intact, naturalistic narrative stimuli) can reproduce three different experimental results from the literature: (1) a cortical hierarchy of increasing temporal receptive windows that was originally revealed by scrambling story stimuli at different timescales; (2) increasing divergence in the activity of higher-level cortical regions for a story stimulus where relatively unobtrusive word substitutions change the overall meaning of the story; (3) processing timescales for context construction and forgetting. I think this is a great paper and I’m looking forward to seeing it published. The manuscript is thorough and well-written, the figures are highly-informative (albeit pretty dense!), and the findings are compelling. Neuroscientists are often stuck with off-the-shelf language models developed in industry, so it’s exciting to see researchers build their own model to ask more specific questions with greater interpretability.

Major comments:

Having worked on similar efforts, I can foresee how a paper like this could be accused of “not showing anything new”—but I do not think that’s a valid criticism here. First, the authors develop a methodological framework for building sophisticated neural network models that can test targeted hypotheses about cortical language processing—I think this is exactly where the field is going, and this paper really sets a precedent on this front: the motivation, development, and evaluation of the model are a tour de force. Second, the model they develop itself is fairly novel, mimicking the contextual representation and next-word prediction of popular large language models, but—while those models are quite difficult to interpret—this model is built with a predefined range of processing timescales in the middle layer. This injects a much-needed level of interpretability into the encoding analysis and results. Not to mention, building and training a model like this is no easy feat. Third, the authors develop a neat “in silico” approach to show that their model reproduces several important experimental findings from the literature. Their encoding model is trained entirely on intact story-listening stimuli/data; they feed experimentally manipulated stories into the model, use the fitted encoding model to generate model-based predictions of brain activity, and evaluate these predicted brain maps against the original results from the literature. Unlike the original findings (which relied on spatial normalization and intersubject correlation analyses), these model-based predictions are higher-resolution and individual-specific; matching these individual-specific predictions to the literature required some finesse. The prior findings were (for the most part) not based on any explicit model at all, but rather derived from experimental manipulations; the move to an explicit model that can holistically account for several of these effects is a major advance.

I appreciate that the authors use classification/probing analyses to validate that their MT-RNN model does in fact capture different linguistic structures in units with different timescales (Figure 6). For example, low-level part-of-speech features are more strongly represented in short timescale units whereas higher-level topic features are more strongly represented in the long timescale units. I wonder if it would be better to present this result/figure earlier in the text, rather than last, to build the reader’s confidence that this model is capturing something meaningful (Figure 1 is already dense enough, though). This leads to my main question: How well does this model actually capture the structure of language? Can we infer anything from the actual magnitude of the decoding performances in Figure 6, rather than just the relative differences in performance across timescale bins? How accurately is this model able to predict upcoming words (e.g. top-1, top-10, top-50 accuracy)? The authors state that “MT-RNN is marginally worse at language modeling than contemporaneous transformer architectures” (page 4), but how much worse is marginally worse? How does this model perform compared to more widely used “benchmark” large language models like GPT-2? Of course we should fully expect this model to perform a bit worse; it doesn’t take advantage of the transformer architecture, maybe it was trained on a smaller corpus(?), it has a predetermined distribution of timescales—this is the necessary price we pay for interpretability. But I still think it would be worth quantifying the model’s performance on the stimuli at hand compared to a more standard model like GPT-2 as a supplementary analysis.

I had a couple questions about the model while reading. First, what’s the motivation for choosing 3–4 words as the timescale for the first layer? Second, is the model taking in full words or subword tokens like many large language models? My impression from reading the Methods was that it takes in full words. I wonder if this also means the model won’t learn certain structures—e.g. morphology—that large language models with subword tokenizers do learn. Third, on my first readthrough, I was asking myself “what are the input vectors in the first layer of this model?” But, as far as I understand from the Methods, the model is initialized with random vectors assigned to each word in the vocabulary and learns word embeddings from the training corpus—is that correct? Lastly, are the layers fully connected?

After reading the Methods, I’m still a little confused how the results from prior studies were reproduced to compare with the current results. Did you fully reproduce the ISC analyses used in all three of these studies? Where did the data come from? Or did you obtain finished-product “results” maps for the original studies somehow?

It wasn’t very clear to me how you arrived at the correlations in Figure 5. The authors want to show that their modeling framework “enables [them] to directly compare across experimental paradigms” by evaluating “the relationship between each replicated effect and the model-based timescale estimate T_v”. I guess I don’t exactly understand what it means to correlate T_v with e.g. the timescale categories from the Lerner analysis. You’re correlating whole-brain maps across voxels—right? Which voxels contribute to this correlation? All of them or some subset? If I understand correctly, a high correlation here would indicate that voxels with longer model-based timescale estimates (T_v) tend to correspond to the higher levels of the Lerner hierarchy (or larger divergence for Yeshurun, etc)—is that right? Anyway, I think this makes sense, but it took me a couple reads, so any effort to smooth out reasoning or better explain Figure 5 would be appreciated.

It wasn’t clear to me in either the Results or Methods exactly what kind of “linear classifiers” the authors are using to probe the language model (particularly page 31). The hyperparameter optimization, cross-validation structure, and model evaluation (scoring metric) could all be laid out a bit more explicitly in the Methods.

I have a question that I’m not quite sure how to articulate: Is there some way to determine how much of the timescale effect this model captures in the brain comes from the predetermined distribution of timescales in the architecture of the model versus the statistical structure of language in the training set? Obviously if the model were trained on scrambled language, it wouldn’t matter that it has built-in timescales, and it would predict the brain poorly. I also don’t really know how to go about evaluating a question like this, but my assumption is there must be some kind of match or “fit” between the timescales the model can express and the statistical structure of real-world language. Another partly related question is: how prevalent is structure at different timescales in the present stimulus set of Moth stories? Does the stimulus set engage units across the full range of timescales or is it biased toward some intermediate range of timescales?

In linking the model-based predictions to the original ISC results, the authors “measured the variance of the predicted response for each condition in each voxel of each participant” because “response variance monotonically varies with ISC.” I had a hard time following this argument (even after reading the Methods on page 28). I understand how ISC will scale with “response variance”—but only variance in the response that’s temporally synchronized across subjects due to the shared stimulus; subjects may have idiosyncratic, individual-specific variance for given stimulus that won’t be factored into the ISC estimate. But can’t a flexible, individualized encoding model potentially capture these idiosyncratic responses to the stimulus? That is, can’t the response variance of the model predictions include (potentially meaningful!) individual-specific components that stretch the analogy to ISC? To be clear, I believe what the authors are telling us here (and follow-up ISC-style analysis helps), but it might be helpful to hold the reader’s hand a bit more in explaining the logic.

I think the model’s inability to reproduce the forgetting result is particularly interesting. For me, it brings to mind the “now-or-never bottleneck” from Christiansen and Chater (2015) where the current linguistic content must be processed (and passed on to the next level?) relatively quickly just to “make room” for more incoming content; and this must happen rapidly at lower-level areas to accommodate the rapid influx of words. Does this framework suggest some kind of memory limitation that, when implemented in a future model, might better reproduce the brain activity? I also wonder whether the cadence or timing of speech could partly drive this effect; this model and most other large language models only care about the sequence of words, not their actual timing. We have a recent paper also suggesting that gradual accumulation of linguistic information over time culminating in rapid flushing at “chunk” boundaries for each level of the hierarchy best captures response lags across cortical areas during naturalistic story-listening (Chang et al., 2022). Anyway, the authors already speculate a bit about this on page 13, and I don’t want to push them to speculate too wildly; just curious about any further thoughts on this.

Plotting the estimated voxel timescales (Fig. 1E) seems to yield a fairly continuous gradient in prefrontal cortex progressing from shorter timescales in somatomotor cortex anteriorly toward longer timescales approaching the frontal pole. Do the authors think this is a robust/stable result? It reminds me of the posterior–anterior hierarchy of control signals with increasingly longer timescales proposed by Koechlin & Summerfield (2007). On the other hand, I wonder if this result creates any tension with strong claims that the prefrontal cortex is fractionated into dedicated language areas and not-strictly-linguistic control areas (e.g. Diachek et al., 2020). I’d be interested to hear how the authors interpret this prefrontal gradient in light of the existing literature.

Minor comments:

When you first introduced the MT-RNN, my first thought was “what objective function?” It might be worth mentioning that the model is autoregressive / performing next-word prediction earlier on.

In Figure 1E (and similar plots, e.g. Figure S2), can you explicitly state the units of the timescale T_v with the colorbar? I assume “7.7” means something like 7.7 words—is that correct?

Page 3: In reference to Figure 1E, you mention longer timescales in angular gyrus—but we can’t really see angular gyrus in Figure 1E as far as I can tell.

Page 30: “the average paragraph length”—What heuristic do the authors use to determine where paragraph breaks occur?

Page 30: “We also repeated this analysis by dividing the MT-RNN units into different timescale bins, and correlating across units within each bin instead of correlating across paragraph chunks.” This more closely follows the method used by Chien and Honey (2020), right? I’m kind of surprised that the voxelwise results behave in a similar enough way to the spatial pattern ISCs.

I understand that the authors have used repeated test sets in many other papers, which is clearly useful for estimating reliability or noise ceilings—but I’m curious what the authors think: Does repeated presentation and averaging of the test story/data potentially wash out novelty responses, or “wash in” predictive, memory-based activity (e.g. Aly et al., 2018; Michelmann et al., 2021)? Have the authors compared prediction performance for the first presentation of the test story versus subsequent presentations? This isn’t pertinent to the current findings and I’m not demanding any further analyses—just curious.

References:

Aly, M., Chen, J., Turk-Browne, N. B., & Hasson, U. (2018). Learning naturalistic temporal structure in the posterior medial network. Journal of Cognitive Neuroscience, 30(9), 1345-1365. DOI

Chang, C. H., Nastase, S. A., & Hasson, U. (2022). Information flow across the cortical timescale hierarchy during narrative construction. Proceedings of the National Academy of Sciences, 119(51), e2209307119. DOI

Christiansen, M. H., & Chater, N. (2016). The now-or-never bottleneck: a fundamental constraint on language. Behavioral and Brain Sciences, 39, e62. DOI

Diachek, E., Blank, I., Siegelman, M., Affourtit, J., & Fedorenko, E. (2020). The domain-general multiple demand (MD) network does not support core aspects of language comprehension: a large-scale fMRI investigation. Journal of Neuroscience, 40(23), 4536-4550. DOI

Koechlin, E., & Summerfield, C. (2007). An information theoretical approach to prefrontal executive function. Trends in Cognitive Sciences, 11(6), 229–235. DOI

Michelmann, S., Price, A. R., Aubrey, B., Strauss, C. K., Doyle, W. K., Friedman, D., Dugan, P. C., Devinsky, O., Devore, S., Flinker, A., Hasson, U. & Norman, K. A. (2021). Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature Communications, 12(1), 5394. DOI