Open reviews
September 25, 2022 — Open review of preprint by Yang and colleagues:
Yang, E., Milisav, F., Kopal, J., Holmes, A., Mitsis, G. D., Misic, B., Finn, E. S., & Bzdok, D. (2022). Bringing language to dynamic brain states: the default network dominates neural responses to evolving movie stories. bioRxiv. DOI
Yang and colleagues use hidden Markov models (HMMs) to identify event-like functional network configurations in the public StudyForrest movie-watching dataset, then characterize the semantic content of these network configurations by applying latent semantic analysis (LSA) to an annotation of the movie. This is a methodologically complex paper with a lot of moving parts. There are a couple choice-points in the analysis that don’t feel very convincing. I outline a handful of clarifying questions in hopes of ultimately increasing readers’ confidence in the results.
Major comments:
The first three figures are intended to build up an intuition for the analysis pipeline and results by presenting example results for a handful of semantic contexts—for a particular network in a particular subject. I think there’s value to this approach, but on the other hand I think many readers will worry that the analysis pipeline effectively cherrypicks these examples (i.e.visualizing the handful of highest correlations and most easily interpretable semantic contexts). The latter figures aim for aggregate, group-level effects; but these are less convincing.
My core set of comments revolves around making a convincing case for the HMM solution used in the manuscript. When you introduce the HMM analysis on page 5, I would explain a bit more thoroughly what the HMM algorithm is doing. For readers less familiar with HMMs, does four states mean the entire movie is divided into only four very long-duration network configurations, or that the algorithm finds four shorter-duration states in total that alternate many times across the length of the movie? Why would we expect different networks to all have the same number of states? For example, you show that lower-level systems have shorter dwell times, but I might expect a network including lower-level sensory areas to prefer a larger diversity of states than a higher-level network.
Choosing an HMM solution with the highest correlations to the semantic contexts (Figure S1A)—then examining those correlations in the main analyses—feels a bit circular. In Figure S1B, I don’t really understand why lower similarity of HMM solutions across subjects would be a good thing in the context of stimulus-driven brain activity.
Figure 5A indicates that HMM dwell times are highly variable across subjects. This is concerning as it suggests that the HMM solution may not be closely tied to the movie stimulus itself—which is what drives shared responses across subjects. One way of evaluating the quality of an HMM solution for a given stimulus is to estimate the HMM from one set of subjects and test the model fit on left-out subjects (e.g. https://naturalistic-data.org/content/Event_Segmentation.html); however, Figure 5A suggests the current HMM would not generalize well to new subjects. How can the authors convince readers that this variability in dwell times across subjects is meaningful and not just reflective of a poor HMM solution?
In Figure 5B, we see that most networks have highly variable dwell times, excluding the Vis&AM and Vis&HC networks. It doesn’t seem to me that the Default network will have significantly larger dwell times relative to the other networks due to high variability. In the text (page 12), the authors state that “the highly associative DN showed the longest dwell times across subjects and across brain states”—but they only provide a statistic comparing the Default and Vis networks, and I suspect the significance here is more related to low variance of the Vis network than anything having to do with the Default network. Furthermore, no correction for multiple tests is performed. This interpretive claim should be reined in to conform to the actual statistical tests performed.
On page 9, the authors describe the contributions of various brain areas to HMM state 1, which was ostensibly linked with positive emotional content in the movie stimulus. Does this particular configuration of brain areas correspond to previous findings in the field? Should we see any recognizable structure in the network states plotted in Figure S3? In the Figure S3 caption, can you add a bit more explanation as to how “mean contribution signatures” were computed? I’m not asking the authors to over-interpret (via reverse inference) these brain states; but they’re difficult to interpret and I worry readers won’t be confident that they reflect meaningful network configurations.
I don’t understand why we end up with 200 semantic contexts. How was this “sweet spot” (page 25) selected?
Other than the “given recent updates in anatomical understanding” comment on page 3, I don’t see a very compelling motivation for including the amygdala and hippocampus in these networks. I don’t think there’s any problem with including them, just curious about the thought process behind this.
In Figure 4 and on page 10, you describe using a permutation test to evaluate the significance of correlations between HMM states and semantic contexts. We need a bit more detail than just “shuffling the state presence timeseries” to understand what this null distribution means. For example, are you shuffling the actual states while retaining their identity and temporal duration? Are you shuffling the state labels while retaining the boundaries? Are you shuffling the time points and thus disrupting the temporal contiguity of the states? In the latter case, I don’t think this would yield a very meaningful null distribution. Also, the inset panel in Figure 4 seems to show one null distribution, but shouldn’t we have a separate null distribution for each subject (and each HMM state / semantic context link)? The statement about “statistical significance at p < 1e-14” is not meaningful without somehow controlling for the number of tests: 15 subjects * 14 networks * 200 semantic contexts * 4 states totals to a pretty large number of tests (these numbers may not be exactly correct, but you get the idea).
The authors use a 20-fold cross-validation procedure where they split each subject’s time series into 19 training segments and a 20th training segment to evaluate their partial least squares (PLS) analysis. I have a couple questions: How many time points per subject are in a left-out test segment? Do the authors z-score the segments from different subjects to ensure there aren’t any subject-level mean/variance shifts after concatenation?
In the introduction, you mention that “some state-of-the-art NLP models contain billions of parameters, which may outnumber the >80 million neurons of the human brain.” This seems like a sloppy comparison. I think the estimated number of neurons in the human brain is actually approximately 86 billion (Herculano-Houzel, 2012), and on the scale of 10^15 (1 quadrillion) synapses (DeWeerdt, 2019)—which are probably the more appropriate comparison to adjustable model parameters. This is still quite a bit larger than today’s large language models; e.g. GPT-3, BLOOM, Gopher, and PaLM range from ~100 to ~500 billion parameters. The authors are not using embeddings from large language models in this paper anyway, but rather using the more classical latent semantic analysis (LSA).
While reading, I felt like there were several important and closely-related references that were overlooked. For example, one of the canonical examples of using word embeddings to understand the semantic content of a movie—from Huth and colleagues (2012)—is not cited (not to mention more recent, lower-profile work; e.g. Nishida & Nishimoto, 2018; Vodrahalli et al., 2018). Another example that jumped out to me: Heusser and colleagues (2021) also use HMMs for event segmentation in conjunction with topic modeling.
Both the Introduction and Discussion are quite long (6 and 12 paragraphs, respectively). Both could benefit from more concise language that focuses more closely on the material of the current study. Make sure to avoid overly general or grandiose claims and conclusions.
Minor comments:
There are certain points in the paper where the writing is overwrought to the point that it might hinder the reader’s understanding (e.g. terms like “deep associative brain network layers”; or the sentence “We seamlessly integrate… curated human annotations… algorithmically derived elements… evolving movie narrative”—that’s a lot of extra words!); I would recommend cutting down on the number of adjectives/adverbs and simplifying sentences for the sake of clarity.
Abstract: “The spatiotemporal dynamics of brain states explicitly captured subject-level responses throughout the brain network hierarchy.”—I’m having a hard time understanding what this sentence conveys
Page 3: I’m not sure I follow the logic around the reference to Buzsáki and Mesulam in the introduction; my understanding is that Buzsáki opposes the paradigm of trying to understand the brain in terms of folk psychological and linguistic terms, so I’m not sure the present approach of using more nuanced linguistic terms really provides a qualitatively different alternative that better tracks with Buzsáki’s goals. This isn’t to say I don’t think the present approach is interesting, more just that motivating it from the perspective of Buzsáki / Mesulam doesn’t seem very compelling.
Figure 2B: Make clear in the figure caption that you’re coloring the semantic contexts with the top 10 correlations with each of the four states.
Page 6: “NLP tools from natural language processing” is redundant—maybe “tools from natural language processing (NLP)”
Figure 6: Words like “deepest” and “intimate” make this hard to parse
Page 14: “dwarfted” > “dwarfed”
References:
DeWeerdt, S. (2019). How to map the brain. Nature, 571(7766), S6-S6. DOI
Herculano-Houzel, S. (2012). The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. Proceedings of the National Academy of Sciences, 109, 10661-10668. DOI
Heusser, A. C., Fitzpatrick, P. C., & Manning, J. R. (2021). Geometric models reveal behavioural and neural signatures of transforming experiences into memories. Nature Human Behaviour, 5(7), 905-919. DOI
Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210-1224. DOI
Nishida, S., & Nishimoto, S. (2018). Decoding naturalistic experiences from human brain activity via distributed representations of words. NeuroImage, 180, 232-242. DOI
Vodrahalli, K., Chen, P. H., Liang, Y., Baldassano, C., Chen, J., Yong, E., Honey, C., Hasson, U., Ramadge, P., Norman, K. A., & Arora, S. (2018). Mapping between fMRI responses to movies and their natural language annotations. NeuroImage, 180, 223-231. DOI