# Open reviews

### July 23, 2020 — Open review of preprint by Hawco and colleagues: PubPeer

Hawco, C. S., Dickie, E. W., Jacobs, G., Daskalakis, Z. J., & Voineskos, A. N. (2020). Moving beyond the mean: subgroups and dimensions of brain activity and cognitive performance across domains. bioRxiv. DOI

Hawco and colleagues use HCP task data to demonstrate that there is considerable heterogeneity among individual-participant contrast maps—this rich individual variability usually gets collapsed into a single group-level average map. The authors begin with an exploratory cluster analysis, then use a resampling analysis to demonstrate that the clusters are fuzzy, and that individual variability may be better understood in terms of gradients (using dimensionality reduction). The narrative of the analysis is a bit convoluted, with some tension between discrete clustering and the gradient perspective; and the principal distinction between “deactivation” and “hyperactivation” subjects is difficult to interpret. I try to provide some suggestions in hopes of clearing up these sticking points. That said, this is a rigorous and interesting manuscript, and I look forward to seeing it published.

I have a hard time understanding what it could mean for participants to vary in global (roughly whole-brain) activation/deactivation for a given contrast. Is there a neurobiological interpretation for this? Or is it the result of statistical procedures used in this manuscript or in the preceding HCP pipeline? My first thought was that this may have to do with asynchronous hemodynamic responses relative to the design matrices used in modeling, but the authors provide convincing evidence this is not the case (at least for the WM task; Figure 9). My second thought has to do with using t-values (rather than “beta” coefficients) as input: the t-value is scaled by the variance over TRs, meaning that subjects who are more noisy over time (particularly in a “global” way, e.g. due to head motion) might yield generally lower t-values for a given contrast. Is this a possibility? And if so, can the authors rule it out? I think the later pseudo-simulation where individual contrast maps are de-meaned may relate to this, but in a roundabout way. Another way to approach this might be to examine to what extent these clusters relate to available confound variables (e.g. FD). In any case, I appreciate that the authors show that the clusters are related to behavioral performance and test scores, and that the seemingly global differences are more “dense” in task-relevant areas. I still “buy” the results—just trying to rule out any other alternative explanations.

There’s a particular narrative arc to the analyses in the manuscript: the authors start in the clustering mindset, but then discover that the clusters are pretty fuzzy and that the heterogeneity among subjects may be better described in terms of dimensionality reduction or gradients. I assume this reflects the thought process of the authors as they explored the data and I appreciate this transparency. But the downstream “gradient”-oriented analyses seem to inherit baggage from the preceding cluster analyses. For example, what if instead of these somewhat more difficult-to-interpret N x N cluster probability matrices, we explored the original N x N contrast (dis)similarity matrix? By this I mean the N x N matrix where each value is the Euclidean distance (for example) between the contrast maps for a given pair of subjects in a given task. Isn’t this matrix implied by the initial hierarchical cluster analysis using Euclidean distance? It would be interesting to directly visualize these N x N (dis)similarity matrices, and I wonder whether we might get some complementary insights if the later dimensionality reduction or gradient-oriented analyses were applied to these matrices.

The structure of the ANOVAs was not immediately clear to me when you introduce them on page 8 or 13. I would make it clear (give an example) that for a given cluster solution at e.g. k = 4, you’re implementing a one-way ANOVA where cluster assignment is a factor (IV) with 4 levels and the outcome variable (DV) is a given cognitive test score.

I’m a little confused about how the matrices are reordered in Figure 3. You say “each matrix was individually reordered to maximize the sum of the similarity between adjacent participants.” If I understand correctly, this will de-cluster the participants—right? Is this the desired effect? Could you reorder them in this way while maintaining cluster assignments (e.g. within each cluster)? When first looking at these matrices I thought it might be nice to overlay outlines of the actual diagonal cluster blocks, but then I realized the samples within a cluster may no longer be adjacent. I understand that the point is to see the relatively smooth gradient of similarity among participants, but it might be helpful to explicitly say that this disrupts the cluster assignments.

When working through the methods, a few of the methodological choices read as somewhat arbitrary. For example: Why use Ward linkage? Why k = 2–10? Why 75% of subjects in each resampling iteration? And so on. In most of these cases, I think your choices are probably reasonable defaults or there’s some precedent in the literature (e.g. Ward linkage), or they cover a tractable range of options (e.g. k = 2–10). But it still might help the reader if you better motivate these choices, even if it’s just a few words or a reference indicating this choice is commonly used or whatever.

Some of the language in the Discussion about manifolds and low-dimensionality gives the sense that the individual variability in contrast maps is intrinsically low-dimensional. I agree that it probably is, but I’m not sure this fully justified by the analyses in the manuscript. Most of the analyses simply select a low-dimensional solution a priori for visualization (3 dimensions; cf. Supplemental Figure 14). Have the authors systematically evaluated dimensionality reduction—ideally directly on the actual N x N contrast (dis)similarity matrices—to evaluate variance accounted for or using some cross-validated fit metric at varying dimensionality? I’d be curious if the first few PCs account for most of the variance in the N x N contrast (dis)similarity matrix, or if there’s an obvious elbow in the scree plot. Some analysis along these lines would provide more direct evidence that the individual variability is intrinsically low-dimensional. Alternatively, the authors might opt to be more careful with this language in the Discussion.

This is a very pedantic comment, but is it correct to refer to this resampling analysis as a “bootstrap” if you’re randomly subsampling (75% of subjects per sample) without replacement? My understanding was that the bootstrap entails random samples with replacement (where typically each random sample is the same N as the original sample). To be clear, it doesn’t make sense to resample with replacement in the current context—the current approach seems fine. I’m just curious about the terminology and will leave it up to the authors whether a more generic term like “resampling” might be more appropriate than “bootstrap.”

Page 3, line 8: When you first introduce this, it’s not obvious to me what you mean by a “‘positive’ to ‘negative’ pattern”—like activation vs deactivation? Global or localized?

Page 6, line 18: “For each participant, the HCP provided t-maps for each task were used.” This sentence reads awkwardly to me.

Page 6, line 19: Might be worthing making very explicit here that you’re using first-level t-values (not group-level t-values) where “variance” is across time points (e.g. df = number of TRs).

Page 6, line 24: It might be helpful for some readers if you describe the input dimensions of the data here (and in other passages where it might be helpful).

Page 9, line 24: “age-adjusted”? How?

Page 10, line 19: “forth” > “fourth”

Page 16, lines 1–3: Can you unpack this “maximized the shared variance” bit for motivating the choice of k = 4?

Figure 6: The network outlines aren’t really very visible here… I wonder if they’re really necessary, or if you could just include a legend brain with the networks labeled?

Page 21, line 15: What “distribution” are you referring to here?

This is beyond the scope of the current manuscript, but I’m curious how much of this heterogeneity among subjects might be due to these artificial laboratory tasks. Recent work comes to mind suggesting that these cognitive tasks have poor reliability (Elliott et al., 2020) and do not capture self-report or real-world behavior (Eisenberg et al., 2019). I wonder whether more naturalistic stimuli would yield lower heterogeneity (although they don’t contain obvious “contrasts”).

When you introduce the classical method of averaging across subjects, our work on hyperalignment as a framework for modeling both common and idiosyncratic functional topographies came to mind (e.g. Haxby et al., 2020). Hyperalignment is geared toward accommodating fine-grained functional idiosyncrasies, so I’m not entirely how it would handle the relatively smooth global activations/deactivations observed here.

### References:

Eisenberg, I. W., Bissett, P. G., Enkavi, A. Z., Li, J., MacKinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2019). Uncovering the structure of self-regulation through data-driven ontology discovery. Nature Communications, 10(1), 1-13. DOI

Elliott, M. L., Knodt, A. R., Ireland, D., Morris, M. L., Poulton, R., Ramrakha, S., Sison, M. L., Moffitt, T. E., Caspi, A., & Hariri, A. R. (2020). What is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological Science, 0956797620916786. DOI

Haxby, J. V., Guntupalli, J. S., Nastase, S. A., & Feilong, M. (2020). Hyperalignment: modeling shared information encoded in idiosyncratic cortical topographies. eLife, 9, e56601. DOI