Open reviews

December 11, 2022 — Open review of preprint by Samara and colleagues:

Samara, A., Eilbott, J., Margulies, D. S., Xu, T., & Vanderwal, T. (2022). Cortical gradients during naturalistic processing are hierarchical and modality-specific. bioRxiv. DOI


Samara and colleagues compute low-dimensional gradients in cortical functional connectivity during naturalistic movie-watching and compare these to the well-studied gradients discovered in resting-state connectivity data. They find that movie-watching evokes cortical gradients that differ from resting-state in several interesting ways; for example, frontoparietal and default-mode networks are surprisingly near to each other at the heteromodal pole of multiple gradients, and language areas exhibit more structure than seen in resting-state. They show that these gradients are reliable across movie stimuli and more correlated with behavior than resting-state gradients. Overall, I think the analyses are convincing, the results are interesting, and the manuscript is well-written. Most of my comments/questions are relatively minor points about clarifying the methodology.

Major comments:

One of the supporting results in this paper is that movie-gradients are reliable across runs containing different movie stimuli. This is alternatively interpreted as “movie-gradients are not sensitive to differences in the content/structure of different movies.” However, we know that traditional within-subject functional connectivity (WSFC) analysis is driven by intrinsic fluctuations that largely obscure stimulus-driven connectivity; intersubject functional connectivity (ISFC) analysis, on the other hand, effectively filters out idiosyncratic fluctuations and isolates stimulus-driven connectivity (Simony et al., 2016). I’m very curious what these gradients would look like (and how reliable they would be across stimuli) if the connectivity matrices were computed using ISFC rather than WSFC. I don’t think this analysis is critical for the finding of the current paper, namely because these ISFC-based gradients wouldn’t be as directly comparable to the resting-state gradients as the current WSFC-based gradients. However, it could make for an interesting supplementary analysis at the authors’ discretion.

In panel B of Figure 4, are the maps for gradients 2 and 3 switched? In the text and in e.g. Figs. 1C and 2E, we can very clearly see that movie-gradient 2 is occipital/visual and movie-gradient 3 is superior temporal/auditory/language. But the correlations with “cognition” for gradient 2 seem to correspond to superior temporal cortex while the correlations for gradient 3 seem to correspond to occipital visual cortex. If these aren’t flipped, I’d be curious to understand how the authors interpret this apparent mismatch.

The movie scans comprise short ~1–4-minute clips separated by 20 seconds of rest. The authors splice out the movie segments, discarding the rest periods and subsequent 10 seconds following each rest period. I think this definitely the right choice to avoid the movie responses being contaminated by a widespread stimulus-nonspecific transient response to the onset of each clip. The authors also concatenate volumes across runs before computing connectivity. Do the authors think this splicing/concatenating procedure introduces any artificial/abrupt changes in the signal at the joints? I wonder if it would help to e.g. z-score each segment before re-concatenating the segments.

I didn’t fully follow the logic behind the PCA analysis. Was the goal here just to show that you get similar results with a simpler dimensionality reduction technique? When I read the Methods section where you plan to “compare diffusion embedding to a different dimensionality reduction method,” I thought there was going to be a more substantive comparison and was expecting an interesting difference between the two approaches. To be clear, I don’t think a more substantive comparison is really necessary; if the point of the PCA is to provide a control analysis with similar results, I would frame it more explicitly that way in the Methods section.

It may be worth noting (e.g. in the Discussion) that all of the movie clip stimuli are relatively short in duration (less than 5 minutes). This will preclude any temporal dynamics that may emerge over durations on par with a television episode (20–60 minutes) or feature film (1.5+ hrs). There’s nothing the authors can do about this with the current dataset; my point is that perhaps even more interesting structures would emerge from a stimulus with narrative information evolving over longer time-scales.

At line 216, you mention that thresholding the connectivity matrices “results in sparsity and asymmetry in the thresholded matrices.” My understanding is that the connectivity matrices are already symmetric around the diagonal, so I don’t really understand how thresholding could introduce asymmetry. Relatedly, when you say “threshold,” do you mean that you (1) binarize the matrices or (2) threshold the matrices but retain the actual correlation values for the above-threshold cells?

At line 247, you use Neurosynth to retrieve terms associated with “regions of interest created from 10-percentile bins (i.e., the lowest scores) of the first 3 group-level gradients.” I had a hard time following this given the arbitrary polarity of the dimensions. Does this mean the 10th percentile furthest from the heteromodal frontoparietal/default pole for each of the first three gradients?

In the paragraph starting at 292, you describe your cross-validation procedure with grid search for the ridge penalty term lambda implemented in glmnet. Am I correct in understanding that for a given fold of the 10-fold outer cross-validation loop, you performed nested cross-validation where you search across 100 settings of lambda within the 90% training set for that fold? Then, the best-performing lambda parameter from within that nested cross-validation loop is used to re-estimate the model and predict that left-out 10% test set for that outer fold? So you’ll get a single correlation value or MAE (between predicted and actual test sets) for each of the 10 outer folds, and then you average these 10 values? Then, you repeat this entire procedure 100 times but shuffling samples to ensure you get different cross-validation splits?

At line 324, you mention that “the first 10 gradients explained similar variance in Rest and Movie.” It might be worth actually including the percentage of variance accounted for by the first 10 PCs in each condition; could also be worth providing the numbers in the text on variance accounted for by the first three gradients (corresponding to Figure 1A).

Minor comments:

I found the observation that both frontoparietal control and default-mode networks coexist at the heteromodal pole of the first few gradients very interesting. I think the authors’ discussion of this topic (starting line 507) is great. My only additional thought here was that both of these networks may contain regions near the top of the cortical processing hierarchy with relatively long temporal receptive windows (Hasson et al., 2015). Really neat that this similarity between the two regions may only emerge in naturalistic scenarios containing relatively slowly-evolving narratives.

Line 260: When you say “aligned,” do you mean aligned using Procrustes as you explain at line 239?

Line 270: “principle” > “principal”? (not sure what this intended to mean here)

Line 297: “accounting for” > “accounted for”

Line 703: “exhibit” > “exhibits”?

References:

Hasson, U., Chen, J., & Honey, C. J. (2015). Hierarchical process memory: memory as an integral component of information processing. Trends in Cognitive Sciences, 19(6), 304-313. DOI

Simony, E., Honey, C. J., Chen, J., Lositsky, O., Yeshurun, Y., Wiesel, A., & Hasson, U. (2016). Dynamic reconfiguration of the default mode network during narrative comprehension. Nature Communications, 7, 12141. DOI