Open reviews

May 11, 2021 — Open review of preprint by Dominey: PubPeer

Dominey, P. F. (2021). Narrative event segmentation in the cortical reservoir. bioRxiv. DOI


Dominey uses a recurrent neural network model to capture several temporal dynamics observed in fMRI data during naturalistic narrative comprehension: (1) context construction at varying times scales, (2) rapid context flushing, and (3) slower-evolving events at higher-level cortical areas. The reservoir network model essentially compresses a sequence of prior semantic vectors into a discourse vector that can be deployed immediately during ongoing processing. This is an interesting (and challenging) paper. Most of my comments pertain to the first half of results validating the event segmentations of the reservoir network; namely, I think the statistical assessment of event segmentations can be improved, and we need more details about certain analytic choices (e.g. the number of events). I also think a more thorough explanation of the reservoir network early on would help readers interpret the results (e.g. how is the network trained? how do the different time constants arise?). I include a handful of minor comments and typos at the end.

Major comments:

As someone who’s not very familiar with this kind of network model, I think the manuscript would benefit from a bit more exposition early on about how the reservoir network works. We get some details in the Discussion and the Methods section located at the end of the paper, and at some points the author points to a previous paper (Uchida et al., 2021); but there should be sufficient “standalone” details about the reservoir network in this paper, and these should come early enough in the Introduction/Results that the reader can more easily interpret the findings. For example, you say “the reservoir was trained to generate the discourse vector”—but how exactly? What’s the training objective? Is the reservoir network previously trained on the text corpus somehow? You mention in the Discussion that in the “current modeling effort we do not train the reservoir,” but this comes very late (and is still a bit opaque to me). Another example: How do the temporal dynamics (time constants) of the network arise? Do they emerge out of some sort of training, or are they user-specified parameter settings when wiring the reservoir? (Apologies if these things are obvious and I’m just missing it!)

The choice of applying event segmentation to downsampled whole-brain fMRI data doesn’t seem to have a strong precedent in the related literature. For example, Baldassano et al., 2017, and 2018, use searchlights and/or focus on high-level association areas like angular gyrus, posterior medial cortex, and medial prefrontal cortex. In this case, putative event representations are encoded in finer-grained response patterns within a given region of interest (although not too fine-grained via Chen et al., 2017). It’s not clear to me that these sort of response patterns seen in the literature contribute meaningfully to the coarse, whole-brain response patterns used here, and I’m somewhat surprised the event structure is evident at the whole-brain level (e.g. Fig. 4A). This non-localized approach also seems to be in tension with the idea of a cortical hierarchy with different temporal dynamics. Replicating this analysis with more localized cortical areas implicated in high-level event representation could strengthen the argument; otherwise, better motivation should be provided for using coarse whole-brain data.

There are several places where the choice of the predefined number of events k for the HMM seems arbitrary or insufficiently motivated. For example, why is a HMM with k = 5 used when applying the reservoir network to the Wikipedia-based test narrative generated from four Wikipedia articles? Were multiple values of k assessed (which values?) and k = 5 chosen (based on what criterion?)? Or did you try k = 4, but the result fit poorly? Other examples: when running the HMM on the fMRI data, you specify k = 10; when you examine the fast and slow reservoir subnetworks, you use k = 25 and k = 8—why? If you’re trying multiple values of k here, it should be a systematic comparison and we need to know the criteria for selecting k; for example, I would consider using the t-distance introduced by Geerligs et al., 2021; there’s also a nice demo here.

I’m having a hard time understanding how event boundaries are statistically compared here. For example, you report for the reservoir network that the “ground truth and HMM segment boundaries are highly correlated, with the Pearson correlation r = 0.99, p < 0.0001.” What exactly is being correlated here? Are you correlating a time series of zeros with ones where an event boundary is found? In this case, the degrees of freedom is the number of (autocorrelated) time points. I’m not sure this sort of statistical test is adequate and would advocate for a nonparametric randomization-based approach. For example, Baldassano et al., 2017, use a randomization procedure where they shuffle the boundaries (e.g. 1000 times) while preserving the duration of the events (in conjunction with some metric like the t-distance mentioned above).

In line with the previous comment, when comparing the boundaries (at k = 5) found for the NYT and Wikipedia test narratives, you say you “normalized the resulting event boundaries into a common range, and made a pairwise comparison of the segment boundaries for the two texts.” I’m not really sure what this means. You compared the index of the time points on which each of the four boundaries landed? But your t-value has 5 degrees of freedom, suggesting 6 boundaries were compared… including the first and final time point? Again, I think a nonparametric statistical approach for comparing segmentations (e.g. adapted from Baldassano et al., 2017) would make this more convincing.

In Figure 11, you show the pre-reservoir time-point correlation matrix for Wikipedia2Vec embeddings that serve as input to the network. The lack of slow, event-like structure seems obvious here, but it could be useful to treat this as a more formal “control” model. In other words, if you want to show that the reservoir network captures narrative structure above and beyond the word-level embeddings, it might be worthwhile to show that it provides statistically better event segmentations than the pre-reservoir embeddings.

This paper demonstrates that relatively straightforward recurrent dynamics can reproduce several of the temporal dynamics observed in fMRI data during narrative comprehension. However, the modeling work here doesn’t really touch on the actual content of those high-level event or narrative representations. For example, Baldassano and colleagues relate event representations to situation models. Do we have any interpretation of the discourse vectors represented by the reservoir network (other than summarizing the prior semantic vectors)? You touch on this in the Discussion on page 16, but it might deserve an additional sentence or two.

The figures are described in the main text, but I don’t see any figure captions in the copy of the manuscript provided by the journal or on bioRxiv. Standalone figure captions would be helpful.

Minor comments:

For figures with time-point similarity matrices (e.g. Figs. 2, 4), it would be helpful to see color bars so we have an intuition about the scale of correlations(?) observed.

Abstract: not sure you need “awake” in the first line here

Abstract: expand “HMM” acronym

Page 4: “Wikipedia2ec” > “Wikipedia2Vec”

Page 5: “has to with” > “has to do with”

Page 7: “Braniak” > “BrainIAK”

Figure 7: “Alignement” > “alignment”

Page 8: I would use the full story name and credit the storyteller here: “It’s Not the Fall that Gets You” by Andy Christie (https://themoth.org/stories/its-not-the-fall-that-gets-you)

Page 9: You note the 10 events assigned to the fMRI data and say “Likewise for the Narrative Integration Reservoir, 10 instances were created…” This wording implies to me that the number of reservoir instances relates to the number of events—but I think you want to say that the number of instances is matched to the number of subjects (also 10). Also, you say you “summary results of this segmentation”—but how do you summarize across instances?

Page 14: “remain” > “remaining”

Page 15 “reservoir correlations displayed an abrupt response in the forgetting context” due to “immediate responses to input in the reservoir”—this still wasn’t very intuitive to me… why?

Great to see the code on GitHub!

References:

Baldassano, C., Hasson, U., & Norman, K. A. (2018). Representation of real-world event schemas during narrative perception. Journal of Neuroscience, 38(45), 9689-9699. DOI

Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20, 115-125. DOI

Geerligs, L., van Gerven, M., & Güçlü, U. (2021). Detecting neural state transitions underlying event segmentation. NeuroImage, 236, 118085. DOI