November 28, 2020 — Open review of preprint by Toneva and colleagues:
Toneva, M., Mitchell, T. M., & Wehbe, L. (2020). Combining computational controls with natural text reveals new aspects of meaning composition. bioRxiv.
Toneva and colleagues use ELMo embeddings to disentangle contextual and non-contextual semantic representations in fMRI and MEG during a naturalistic reading task. They apply a regression analysis to the contextual embeddings to isolate the “residual context,” then use a forward encoding analysis to fit these embeddings to the brain data. They demonstrate separate but overlapping contextual and non-contextual responses in posterior and anterior temporal cortex in fMRI, but fail to detect residual context in MEG. The analysis is cutting-edge (difficult given how rapidly these language models evolve), the writing is good, and the figures are excellent. I have a few comments on their analysis choices and interpretation. In general, I think this is a great paper and look forward to seeing it published.
The authors should be cautious in how they interpret and portray the null result for residual context in MEG. For example, in the abstract they write “Surprisingly, we cannot detect supra-word meaning in magnetoencephalography, which suggests that composed meaning is maintained through a different neural mechanism than synchronized firing.” This may set off the “absence of evidence is not evidence of absence” alarm bells for many readers. In some cases (e.g. lines 190–191), you frame this as a simple comparison between the fMRI and MEG results; i.e. we observe the effect in fMRI, but not in MEG. Anytime you’re describing this result, I think it’s probably worth taking the time to explicitly say that this is actually an interesting interaction effect: you in fact do observe strong effects for full embeddings and residual word embeddings in MEG (and residual word embeddings perform poorly in fMRI), while residual context is observed only in fMRI. This should help convince readers that the null result for residual embeddings in MEG isn’t purely an issue of power, data quality, etc. That said, I would also be careful in trying to pin this null result on the underlying physiology (interesting as that may be), since this is one of several explanations. For example, at line 194, you write: “These findings suggest a difference in the underlying brain processes that fMRI and MEG capture”—but this result could also stem from differences in preprocessing, differences in the subject sample (assuming they’re not the same participants), or the fact the fMRI data are smoothed over the span of seconds while the MEG models are fit to each time point at high temporal resolution. The interaction I described above may make these explanations less likely, but doesn’t rule them out.
After reading this, I’m still a bit unsure how to interpret the residual context vectors (maybe due to unfamiliarity with ELMo). I think the analytic interpretation is clear: it’s the residual vector after regressing the full words out of the full context vector. But can you give us a more intuitive example of what this residual context vector “means” or “looks like”? In your example of “Mary finished the apple,” should we expect the residual context vector to have some similarity to the isolated word vector for “eating”? Or is it completely unlike any given word vector? I also have a bit of a hard time understanding the mapping of your exposition on “composition” in the Introduction (which is good) to the full vs residual context embeddings. In some sense, it seems like the full context embedding is the unique and meaningful instance of composition, whereas the residual context embedding (the “supra-word meaning”) is some abstract or experimental construct.
Related to the previous comment, does this notion of “residual context”—where you regress the non-contextualized word vector(s) out of the contextualized vector—have an analog in transformer-based models (e.g. GPT-2; Radford et al., 2019). I assume you could similarly pass single words or groups of words through GPT-2 to achieve the same effect. The way ELMo combines a static or global word vector with a local context vector is very intuitive from the perspective of linguistics, but it’s not obvious that the brain will compartmentalize things so nicely in naturalistic/non-experimental contexts (your PTL result may speak to this). In general, my question is: To what extent do you think the conceptual results about context are ELMo-specific or apply more generally?
In your abstract, and on line 17, I’m not sure I understand the emphasis on the word “correlates.” Seems like what you’re focusing on here is the fact that classical neuroimaging studies on this topic tend to use subtraction paradigms (e.g. composition > non-composition) or parametrically vary the degree of composition required (e.g. manipulating sentence complexity). These sorts of studies rely on very well-engineered (non-naturalistic) experimental manipulations and typically focus on the “process” of compositions—whereas here your analysis instead focuses on “content” of composition. But even then, I would still refer to these neural representations of the content of composition as “correlates” of compositional meaning. My point here is I think drawing the process vs. content (and non- vs naturalistic) distinction is important, but I found the framing in terms of “correlates” to be confusing.
Why do the authors report results in terms of the proportion of significant voxels (or sensors) in Figures 2B, 3, 4? I assume the goal here is to provide a simple summary, but why not (also) report the mean? For example, I wonder whether the binarization step of the significance test when computing the proportion could make these results jump around. Or maybe the proportion better summarizes across voxels/sensors with different ridge regularization parameters? I’m curious what the motivation is here.
Figure S1 suggests that the results are pretty heterogeneous across subjects. For example, S2 has widespread significant prediction for both models, S7 is the only one with an appreciable amount of significant voxels for residual context only, whereas S5 and S8 have hardly any significant residual context at all. Not to mention there’s a decent amount of variability in the spatial topographies of well-predicted voxels across subjects. Some of this might be due simply to discretization in the significance threshold. Regardless, I think the authors should mention and discuss this in greater detail.
I wonder if the 5-TR (10-second, 20-word for MEG) block permutation you report at line 391 might result in a permissive null distribution. This is fewer words than the 25-word context window, so will likely underestimate the temporal autocorrelation in the embeddings. Not to mention, 10 seconds is well within the spread of the hemodynamic response, so the null distribution will be based on time series with less temporal autocorrelation than the original data. I think there is even some evidence for long-range temporal autocorrelation in electrophysiology in the context of temporally-evolving naturalistic narratives (Honey et al., 2012), although I’m not sure to what extent this applies to MEG and this particular reading task. Other approaches to randomization that better preserve temporal autocorrelation might be phase randomization or circular time-shift randomization (e.g. Nastase et al., 2019). To be clear, I don’t think this will really make a qualitative difference, so I’m not asking authors to re-run analysis with different randomization—but wanted to point this out. In describing the randomization tests, I would also say explicitly that any p-values reflect within-subject permutation and do not license population inference; whereas the confidence intervals on the proportions do.
I’m not familiar with the ARCH package used to compute confidence intervals. Can you provide a little more detail? For example, when computing confidence intervals (e.g. for the fMRI sample), you’re pulling 1000(?) random samples of 9 participants with replacement from the original 9 participants, and then computing the median of each bootstrap sample?
Line 45: “computer” > “computational” (or “formal”, “quantitative”, etc)
Line 73–76: Are these two sentences intended to explain roughly the same thing? If so, you might start the second sentence with “That is,” or something like that. I read the second sentence a couple times trying to figure out how it was different from the first sentence.
Figure 2. How do you order the voxels for visualization in the spatial generalization matrices? Are they clustered for visualization? Organized by spatial location in the brain?
Figure 2. Why does the normalized Pearson correlation scale go up to 1.5? I understand that the across-voxel correlations are normalized by within-voxel correlation (across subjects); so this implies that for some voxels there are neighboring voxels in other subjects where the model performs better?
Line 131: “salience” seems odd; maybe “discrepancy”?
Line 156–157: add space before reference to figure
Figure 4: What part of the brain are these fMRI results coming from? The whole brain? Wasn’t obvious from the figure caption.
Line 262: Were the data sampled onto the cortical surface reconstruction? Or was FreeSurfer just used to get a volumetric gray matter mask? In Figure 2A, it looks like the subject-specific data are projected onto a surface template (sulci appear identical across subjects), whereas Figure 2D seems to use subject-specific surface reconstructions. For 2A, which template is used, and which normalization procedure?
Line 263: “25000−31000 cortical voxels were kept”—based on what? Is this just how many gray matter voxels were returned by FreeSurfer?
Line 266: Are the subject samples for the fMRI and MEG datasets separate or overlapping?
Line 267: “This” > “These”
Line 326: remove an “indeed”
Line 340: I assume the train/test splits are temporally continuous segments of TRs (i.e. not groups of randomly-selected TRs)?
Line 352: Does estimating a separate ridge regularization parameter for every time point at a sensor create any issues for interpretation?
Line 363: “voxel” > “sensor”
Line 397: This “ROI/timewindow” terminology seems to come out of nowhere… (would also switch to “time-window” like line 406)
The decision to exclude the backward LSTM (integrating information from future words) makes sense for this analysis. But generally I’m curious whether the brain may in fact also retroactively adjust prior word representations based on the current word (at least on short time scales). For example, I can certainly imagine this happening within the span of an fMRI TR: the word “banana” arrives at the beginning of the TR, and either the word “split” or “republic” arrives 300 ms later—maybe the “banana” representation shifts within the scope of that single TR. Also reminds me of “retroactive construction” and Dennett’s “multiple drafts” model (Dennett & Kinsbourne, 1992).
Dennett, D., & Kinsbourne, M. (1992). Time and the observer: the where and when of consciousness in the brain. Behavioral and Brain Sciences, 15(2), 183–247.
Honey, C. J., Thesen, T., Donner, T. H., Silbert, L. J., Carlson, C. E., Devinsky, O., Doyle, Werner K., Rubin, N., Heeger, D. J., & Hasson, U. (2012). Slow cortical dynamics and the accumulation of information over long timescales. Neuron, 76(2), 423–434.
Nastase, S. A., Gazzola, V., Hasson, U., & Keysers, C. (2019). Measuring shared responses across subjects using intersubject correlation. Social Cognitive and Affective Neuroscience, 14(6), 667–685.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog.