# Open reviews

## August 29, 2019 — Open review of preprint by Chen and colleagues: PubPeer

Chen, G., Taylor, P. A., Qu, X., Molfese, P. J., Bandettini, P. A., Cox, R. W., & Finn, E. S. (2019). Untangling the relatedness among correlations, part III: inter-subject correlation analysis through Bayesian multilevel modeling for naturalistic scanning. bioRxiv, 655738. DOI

This manuscript by Chen and colleagues is the third installment in a series of authoritative statistical treatments of intersubject correlation analyses (ISC) and represents the culmination of several years of work to establish best practices in ISC analysis. In general, the article is well-written and the technical quality is already quite high. In most cases, my comments indicate occasions where I think further elaboration would be helpful for the less statistically-savvy among us. That said, although I’m fairly familiar with simple mixed-effects modeling, I’m not an expert in Bayesian statistics. Some of my comments may stem from general confusion or acute misunderstanding—take this as an indication of where additional explanation might be helpful to readers.

My main comment is to encourage the authors to make the (fairly esoteric) statistical content as accessible/digestible as possible for mere mortals, and thus increase adoption of the method. In the minor comments below, I point out several occasions where I think an additional explanatory sentence would be helpful (e.g., to convey the intuition behind a complex equation). In writing the “BML applied to ISC data” section, I would take the perspective of a graduate student working through this analysis for the first time. For example, you make a series of analysis choices (e.g., comparing models using LOOIC). What is the motivation or precedent for each choice? (Were there alternatives?) Hopefully this can read like a recipe. In the process, I would encourage the authors to take any opportunity they can find to be explicit and prescriptive in making recommendations to the reader in terms of both analysis choices and how to report results.

One of the benefits of abandoning circular time-shift or phase randomization and focusing on the ISC values themselves for statistical analysis is that the same statistical tests can be directly extended to, e.g., spatial ISCs (i.e., intersubject pattern correlations). Do the authors agree? Might be worth pointing out to readers, as it widens the scope of use-cases for your procedure. In this case, each ISC value results from a response pattern correlation in an ROI for a given condition or time-point (although you may want to introduce an additional random effect for condition/time point). We discuss some of these ISC variants from an introductory perspective in Nastase et al., 2019. (Relatedly, but outside the context of this paper, I suspect many of these analyses are also applicable with minor modification to, e.g., pairwise correlations in representational similarity analysis. I suppose this is related to the authors’ Chen et al., 2019, paper on matrix-based analysis, and may also relate to Cai et al., 2019.)

In the introduction, you closely associate naturalistic paradigms and ISC analysis. In general, this is fine, as ISC is suitable for naturalistic stimuli and widely used, but I wouldn’t push the association between them too hard, as there are other perfectly suitable non-ISC analyses—e.g., Bartels & Zeki, 2004, Huth et al., 2012, de la Vega et al., 2019, come to mind.

In the introduction, I would also make it obvious that the voxel-/region-wise multiple comparisons problem is not in any way ISC-specific, but does fit nicely into this unified framework.

From the LME perspective, when you introduce random effects and discuss “Further extensions” (line 269), how does this relate to random intercepts vs. slopes. Which of the many possible random-effects structures are most sensible in the context of neuroimaging and how does this translate to BML? I’m thinking of the recommendation for maximal random-effects structure from Barr et al., 2013. I assume that given reasonable priors in BML, the convergence problems encountered with LME here are less problematic. Along these lines, under what circumstances do you expect these models to become computationally intractable (e.g., instead of 268 ROIs, I foolishly include 50,000 voxels as a random effect). You discuss this a bit at line 275 (equation 19), but the relationship between number of model parameters and number of samples was confusing to me. This may relate to practical concerns about runtime at lines 360 and 415.

When introducing explanatory variables (e.g., lines 156, 288), I would make it very explicit that this also encompasses simple categorical experimental designs like condition A vs B or group A vs B. Is there any discussion of within-subject or paired designs? E.g., I’m imagining a scenario where each subject listens to the same stimulus under two different attentional conditions or task contexts. How does this sort of design fit into the current framework?

Should we be worried about the Gaussian assumption of the LME model (or the Gaussian prior in BML)? I wonder whether we should expect ISCs to be Gaussian-distributed across ROIs; for example, I might expect strong ISCs in a sparse set of ROIs and weak or no effects everywhere else. You discuss this (e.g., line ~528), and the similarity between the LME and BML results suggests there’s no major issue, but I’m still not confident this makes sense in all cases.

The authors mention an AFNI implementation of this procedure in the abstract, but there is no further mention of this later in the manuscript. Does this implementation exist yet? If so, can the authors point to it? If the implementation is finalized, a short appendix providing an example code snippet could be a great addition (at the authors’ discretion).

Line 126: Might be worth also mentioning time-series phase randomization approach, as this is used in several Hasson Lab papers (e.g., Lerner et al., 2011).

Line 133: If I recall, the leave-one-out ISC approach was not evaluated in Chen et al. (2016). Despite the non-independence issue, I’m not sure to what extent it yields inflated FPRs in all contexts (e.g., I would expect it to be largely fine for a paired test).

Line 154: “therefor” > “therefore”

Line 155, 274: An additional sentence explaining the relevance of intraclass correlation (ICC) here (or a pointer to an authoritative/explanatory reference) would be appreciated.

Line 188: What does “false assumption” refer to here? False assumption of independence among voxel-/region-wise tests?

Line 243: Can you provide one additional sentence intuitively explaining why this upper bound increases from .5 to 1.0?

Line ~243: “ISC effect each ROI” > “ISC effect at(?) each ROI”

Line 247: Again, can you provide an additional sentence giving us more of an intuitive understanding of the “paradigm shift” here?

Line 252: An additional sentence briefly describing what “multi-membership modeling” means could be useful early on (in addition to the reference), as this seems important and recurs later in the manuscript.

Line ~280: When describing BML1, you introduce the acronym “RP”. Does this indicate “region-pair”? I’m a bit confused trying to relate this back to the use of “SP” in describing LME1. (On second thought, this may be due to my misunderstanding of how LME and BML differently represent the subject pairs.)

Line ~289: “It light” > “In light”

Line 415: Runtime of three weeks seems pretty long; I’m trying to relate this back to the previous comment linking the number of MCMC chains to CPU. Should we be concerned about this compute time?

Line 536: “dichotomize” > “dichotomizes”

Line 563, 581: “Figures ??” > “Figures 3 and 4”(?)

Line 591: Sentence starting with “For example” could be reworded for clarity.

Line 611: The coarseness of averaging within each ROI could potentially be mitigated by using, e.g., PCA—or better yet, using hyperalignment or SRM transformations (derived from independent data) to identify multiple (orthogonal) response time series shared across subjects. This would complicate visualization, but would provide finer-grained information per ROI. (Just a random thought, not intended for inclusion in manuscript.)

Line 621: This seems particularly problematic for both neighboring ROIs and ROIs that comprise a larger-scale functional network. (I assume that, in the frequentist sense, this roughly corresponds to an assumption that ROIs are independent samples…?) Is there evidence in the current data—e.g., comparable results between BML and LME—that suggest this is not a severe problem?

### References:

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: keep it maximal. Journal of Memory and Language, 68(3), 255–278.

Bartels, A., & Zeki, S. (2004). Functional brain mapping during free viewing of natural scenes. Human Brain Mapping, 21(2), 75–85.

Cai, M. B., Schuck, N. W., Pillow, J. W., & Niv, Y. (2019). Representational structure or task structure? Bias in neural representational similarity analysis and a Bayesian method for reducing bias. PLOS Computational Biology, 15(5), e1006299.

Chen, G., Bürkner, P. C., Taylor, P. A., Li, Z., Yin, L., Glen, D. R., Kinnison, J., Cox, R. W., & Pessoa, L. (2019). An integrative Bayesian approach to matrix‐based analysis in neuroimaging. Human Brain Mapping, 40(14), 4072–4090.

de la Vega, A., Blair, R., Razavi, P., & Yarkoni, T. (2019). neuroscout/neuroscout: 0.3.0.

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210–1224.

Lerner, Y., Honey, C. J., Silbert, L. J., & Hasson, U. (2011). Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. Journal of Neuroscience, 31(8), 2906–2915.

Nastase, S. A., Gazzola, V., Hasson, U., & Keysers, C. Measuring shared responses across subjects using intersubject correlation. Social Cognitive and Affective Neuroscience, 14(6), 667–685.