Open reviews

June 18, 2021 — Open review of preprint by Chen and colleagues: PubPeer

Chen, A. A., Beer, J. C., Tustison, N. J., Cook, P. A., Shinohara, R. T., Shou, H. & the Alzheimer’s Disease Neuroimaging Initiative (2020). Removal of scanner effects in covariance improves multivariate pattern analysis in neuroimaging data. bioRxiv, 858415. DOI


Chen and colleagues introduce their CovBat method for harmonizing scanner effects embedded in the covariance structure across brain features with the goal of improving multivariate analyses. Using both public ADNI cortical thickness data and simulated data, they demonstrate that CovBat mitigates scanner-specific covariance structure, reduces scanner identifiability, and does not negatively affect prediction of clinical/demographic variables (Alzheimer’s disease, sex, age). The manuscript is well-written and the methods seem solid. I have a couple questions about methodological details and interpretation/applicability.

Major comments:

One of the core results here is the following: CovBat reduces scanner identifiability and improves over no harmonization, but does not improve the prediction of clinical/demographic variables over ComBat. This is not always clear in the text. For example, at line 104, you say CovBat “improves detection of meaningful associations”, but it doesn’t really improve over ComBat, just over no harmonization at all—right? Why do the authors think this is the case? If the scanner-based covariance effects are contaminating the signal, shouldn’t removing them improve prediction? Or is it that the scanner-based covariance effects do not interact with whatever clinically-relevant signals the multivariate analysis picks up on?

When you introduce multivariate analyses in the Introduction (as opposed to “mass univariate testing”), it occurs to me that another distinction here is that mass univariate analyses are typically within-sample estimation analyses, whereas multivariate analyses typically rely on out-of-sample generalization (e.g. cross-validation; Poldrack et al., 2020). How does this distinction factor into the current method? For example, does simply cross-validating across scanners similarly mitigate the problem of scanner-specific covariance?

The definition of “scanner” (e.g. at line 128) comprises a unique combination of site, manufacturer, model, head coil, etc—and is of course conflated with subject sample. I’m not overly familiar with the harmonization literature, but is this the typical way of defining “scanner”? For example, this yields 142(!) different “scanners” in the ADNI dataset, and even when dropping “scanners” with fewer than three subjects, leaves us with 64 “scanners” across 505 subjects. I assume this is the standard approach, but I found it striking. Again, this makes me wonder whether this generalization problem could also be addressed by cross-validating across scanners.

At line 214, you capture the covariance structure using 37 PCs that capture 95% of the variance. When I first saw this, the choice seemed fairly arbitrary; why not simply use all PCs? Then at line 328, the number of PCs capturing 95% of the variance switches to 23, but I’m not entirely sure what about the data changed… maybe this is due to constraining the analysis to the three largest groups? At line 346 and Figure 4, you vary the number of PCs from 50–100% variance accounted for (in simulated data) and everything seems to behave systematically. Is there some benefit to using PCs accounting for 95% of the variance (e.g. dropping noisy low-eigenvalue PCs)? The line of reasoning through these points in the manuscript could be a bit more clear.

Starting with section 3.2, the move from the restricted dataset comprising three scanners to the full ADNI dataset wasn’t totally clear to me. Do you restrict yourself to the three largest subdatasets just for the purposes of visualizing covariance matrices? When you move to the full dataset, you’re estimating CovBat for each (of many) scanners, right? This could be a bit more clear in the narrative of the results section.

It might be worth including a bit more pragmatic advice in the “Software” section. For example, how well does this scale to larger samples in terms of runtime, parallelization, memory requirements?

Minor comments:

Figure 1: The row/column labels here are pretty small (i.e. unreadable without a 300% zoom); not sure there’s any easy way to fix this… any way to group or summarize labels?

Line 69: Might be worth adding a few more references here so readers understand the breadth of applications; for example I was expecting to see Wager et al., 2013, here; might also be worth differentiating functional (e.g. Wager et al., 2013) and structural (Koutsouleris et al., 2014, and possibly others).

Line 87: “to protect for clinical effects”—I didn’t fully understand this wording

Line 101: “manufacturer” is not strictly what’s being differentiated here

Line 123: Where exactly do the “62 cortical thickness values” come from? I assume these correspond to brain areas, but is this based on some particular atlas or parcellation?

Line 134: Are subjects assigned the AD and LMCI labels nonoverlapping? (I assume so)

Line 235: Can you provide one sentence to motivate the use of a gamma distribution here? (for math-impaired readers like me)

Line 390: What does this distance “3.14” mean? (i.e. are there units? sum squared error?)

Line 413: Why does CovBat not “entirely remove the severe covariance site effect”?

References:

Poldrack, R. A., Huckins, G., & Varoquaux, G. (2020). Establishment of best practices for evidence for prediction: a review. JAMA Psychiatry, 77(5), 534-540. DOI

Wager, T. D., Atlas, L. Y., Lindquist, M. A., Roy, M., Woo, C. W., & Kross, E. (2013). An fMRI-based neurologic signature of physical pain. New England Journal of Medicine, 368(15), 1388-1397. DOI