August 20, 2019 — Open review of preprint by Nili and colleagues:
Nili, H., Walther, A., Alink, A., & Kriegeskorte, N. (2016). Inferring exemplar discriminability in brain representations. bioRxiv, 080580.
Nili and colleagues provide a detailed statistical treatment of methods for measuring exemplar discriminability and support their claims with both simulations and real fMRI data. This metric —the “exemplar discriminability index” (EDI, originally coined in Henriksson et al., 2015)—summarizes the general discriminability of conditions (rows/columns) in a representational dissimilarity matrix (RDM) without reference to, e.g., a model RDM indexing categories. I think the statistical arguments are generally sound. My main concerns are with framing. The article writes itself into a very niche methodological space when in fact the method is broadly applicable; just needs to be better situated among related methods and better motivated. The implication of statistical malpractice (—but, in practice, not really) is a bit off-putting. Some parts of the manuscript (sections 3.1.4–3.1.7) read a bit like a laundry list, and some parts need to be cleaned up (e.g., missing references, typos). I provide several higher-level comments, as well as a list of more specific items. In general, I think there’s a lot of very useful information here worthy of publication; but the methods might see wider adoption if framed in a more digestible way.
Throughout the manuscript there’s a sort of back-and-forth tension: occasionally the authors are tempted to imply that previous work has committed statistical malpractice: assumptions of the t-test are violated when applied to EDIs (bad! panic!)—but in practice the t-test is sufficiently robust and these violations have no obvious deleterious effect and do not inflate FPRs (phew! okay then…). For example, the first sentence of the discussion, describes the paper as “Motivated by theoretically invalid t-test that was common in the literature”. This makes for difficult reading and I don’t think statistical doom-saying (when everything is probably fine) is a useful narrative tool. I think it’s great to show that EDIs can violate assumptions under certain conditions (mostly edge cases), but that should be framed as didactic and cautionary. If you motivate this paper as redressing some statistical malpractice in a handful of papers, then show the malpractice was benign anyway, the reader asks: what’s the point of this paper? (It almost comes across as a “null result.”) Instead, from the beginning, I would just frame this as an authoritative treatment of a particular analysis with statistical recommendations.
I was initially unsure about the term “exemplar” (and “category,” for that matter). These are loaded terms. From the perspectives of psychology and neuroscience, I wondered whether “exemplar” referred to specific stimuli, specific objects, basic-level categories; and the exemplar of a given category depends on the scope of the category. I also thought of recent work by Clarke and Tyler (2014) that compares object-specific and category-level representations (using RSA in fact, but perhaps not this method specifically). However, what “exemplar” means in this context is specifically each row and column of an RDM—which, psychologically, could correspond subordinate categories or particular stimuli; could correspond to face identities or viewpoints or identity–viewpoint combinations, depending on how the RDM is constructed. Framing EDI in contrast to categories or categorical RDMs was a bit confusing, because EDI generally contrasts with any RSA implementation comparing a neural RDM to a (potentially non-categorical continuous-valued) RDM based on a computational model, behavior, etc. This should be clarified very explicitly early on. For example, in the discussion, the authors say “Note also that the EDI can detect equidistant representational geometries, to which tests of RDM models using correlation coefficients are not sensitive (since there is no variability in the predicted distances that could be explained by a model RDM)”. I’m not sure about “equidistant” per se, but this distinction was striking because it provided a better framing than much of the introduction. In other words, EDI is a sort of “model-free” or “omnibus” way of characterizing representational geometries. Rather than testing a particular model RDM (whether categorical or not), EDI simply tells us “there exist differences among these conditions”.
I think the readers would appreciate more practical, concrete examples of when and how this analysis could be useful. Again, framing this as a retrospective on a niche method used in a handful of existing papers undercuts the importance of the article. What new types of questions does this method afford us?
The typical EDI you describe relies on split-data RDMs as described by Henriksson and colleagues (2015). It might be worth highlighting the experimental design considerations to keep in mind if you plan to compute split-data RDMs. For example, you may want more presentations of a given stimulus/condition per run since you cannot aggregate across all runs in a single GLM. Furthermore, if you’re willing to sacrifice individual-specific response topographies (or if you’re using hyperalignment!), you can compute split-data RDMs across subjects instead of across runs (as we mention in Nastase et al., 2019).
Aside from condition-label randomization tests, the more typical t-test on the EDI treats exemplar/stimulus as a fixed effect and can be skewed by a handful of outstanding exemplars or pairs. I’m wondering if a stimulus bootstrap, as you discuss in Nili et al., 2014—i.e., resampling rows/columns of the RDM with replacement prior to computing EDI—could provide some traction for inference over stimuli. For example, a tight stimulus-bootstrap CI that doesn’t span zero would suggest generalization across stimuli; whereas a tight subject-bootstrap CI suggests generalization to the population of subjects. This is somewhat different from condition-label randomization, which tests H0: there is no systematic association between conditions and labels.
The effect of multivariate noise normalization (in section 3.2 and figure 10) seems huge. I didn’t find any very satisfying explanation for this. Does your method of counting significant results (proportion of p < .05 out of 42 tests) seemingly inflate the benefit of noise normalization. Is there a way to visualize the effect sizes rather than the discretized significance count?
I found Section 3.1.4 a bit confusing. I can’t tell whether we want a metric that’s sensitive to distributional variance or not; I get that there may be a neural interpretation for this and some people want maximal sensitivity, but I’m having a hard time envisioning what this neural interpretation would be. Also, does this argument hold for distance metrics, like correlation distance, that scale each pattern to unit variance?
Page 2, line 28: The reference to Kay et al., 2008, and Mitchell et al., 2008, with regard exemplar-specific pattern estimation seems odd—as these papers both exemplify forward encoding models and it requires some mental gymnastics to interpret them in the framework of exemplar-level RSA. Also, note that these references are not included in the reference list.
Page 2, line 31: You say “within-category effects are often subtle”—is this necessarily true?
Page 2, line 41: I would expand the list of references on the topic of individual face (identity) representation to better reflect the evolving literature on this topic, e.g: Natu et al., 2010; Goesaert & Op de Beeck, 2013; Verosky et al., 2013; Axelrod & Yovel, 2015; Guntupalli et al., 2017; Visconti di Oleggio Castello et al., 2017.
Page 3, line 31: “use” > “used”
Page 3, line 55: “Hendriksson” > “Henriksson”
Page 3, line 55: Alink et al., 2015, is not in the reference list. I’ve included what I think is the correct reference below.
Page 4, line 53: What does “group-level inference with subject as fixed effect” mean? Do you mean inference on, e.g., stimuli while treating subject as a fixed effect? To me, the term “group-level inference” implies “inference on the population from which subjects are sampled” and entails treating subject as a random effect—but I think this is my misunderstanding.
Page 4, figure 1: I would include the term “split-data RDM” in the figure itself (not just the caption) to aid interpretation.
Page 7, line 20: Are these fMRI data previously published? Are they published in Alink et al., 2013? Note that this reference is not in the reference list—and I’m not sure what it’s intended to refer to.
Page 8, line 22: “Geometricaly” > “Geometrically”
Page 13, line 12: “unsignificant” > “non-significant”
Page 13, equation 3: When computing p-values from a permutation-based Monte Carlo null distribution, I would recommend adding 1 to the numerator and denominator to account for the observed value and ensure you don’t get a p-value of zero, according to Phipson & Smyth, 2010.
Page 14, line 24: I didn’t completely follow your method of assigning p-values starting with “We obtain p-values…”
Page 14, line 44: I would also include the terms “false positive” and “false negative” in parentheses here for the sake of clarity.
Page 15, line 47: “allones” > “all-ones”?
Page 18, line 18: I would say “subject bootstrapping (resampling subjects with replacement)”
Page 21, figure 6: The blue dots in the top panel (and to a lesser extent the red dots in the bottom panel) are too small and difficult to see.
Page 27, line 57: Add date for Allefeld citation (2016).
Page 30, line 24: I’m not sure I agree with the use of “subtle” here. EDI is, in the most general case, an omnibus summary statistic across all pairs in the RDM—this doesn’t strike me as subtle. However, I understand that constraining EDI to particular cells of the RDM can get at subtler exemplar distinctions.
Page 32, line 32: The title of section 4.4 suggests that multivariate noise normalization is part of the “test” (i.e., t-test, signed-rank test), when it seems like part of the distance measurement.
Page 34, line 41: “linient” > “lenient”
Page 34, line 47: When using linear regression to model a neural RDM as a weighted sum of model RDMs, we have used rank regression to better approximate the motivation for using Spearman correlation when comparing two RDMs (Nastase et al., 2017).
Page 34, line 48: “similar” > “Similar”
Page 35, line 53: I would also point to work by Aguirre (2007) on, e.g., Type 1 Index 1 sequences for avoiding trial order effects in RSA (if the number of exemplars is small enough).
Page 37, line 54: I’m not sure I’d say “assumption-free” even in the context of a randomization test.
Page 38, line 17: Is Allefeld et al., 2014, cited in the text?
Page 38, line 47: I don’t think Diedrichsen & Kriegeskorte, 2017 is cited in the text.
Page 41, line 17: Add updated issue, pages etc to Walther reference.
Aguirre, G. K. (2007). Continuous carry-over designs for fMRI. NeuroImage, 35(4), 1480–1494.
Alink, A., Walther, A., Krugliak, A., van den Bosch, J. J., & Kriegeskorte, N. (2015). Mind the drift-improving sensitivity to fMRI pattern information by accounting for temporal pattern drift. bioRxiv, 032391.
Axelrod, V., & Yovel, G. (2015). Successful decoding of famous faces in the fusiform face area. PLOS One, 10(2), e0117126.
Clarke, A., & Tyler, L. K. (2014). Object-specific semantic coding in human perirhinal cortex. Journal of Neuroscience, 34(14), 4766–4775.
Henriksson, L., Khaligh-Razavi, S. M., Kay, K., & Kriegeskorte, N. (2015). Visual representations are dominated by intrinsic fluctuations correlated between areas. NeuroImage, 114, 275–286.
Natu, V. S., Jiang, F., Narvekar, A., Keshvari, S., Blanz, V., & O’Toole, A. J. (2010). Dissociable neural patterns of facial identity across changes in viewpoint. Journal of Cognitive Neuroscience, 22(7), 1570–1582.
Goesaert, E., & Op de Beeck, H. P. (2013). Representations of facial identity information in the ventral visual stream investigated with multivoxel pattern analyses. Journal of Neuroscience, 33(19), 8549–8558.
Guntupalli, J. S., Wheeler, K. G., & Gobbini, M. I. (2016). Disentangling the representation of identity from head view along the human face processing pathway. Cerebral Cortex, 27(1), 46–53.
Nastase, S. A., Connolly, A. C., Oosterhof, N. N., Halchenko, Y. O., Guntupalli, J. S., Visconti di Oleggio Castello, M., Gors, J., Gobbini, M. I., & Haxby, J. V. (2017). Attention selectively reshapes the geometry of distributed semantic representation. Cerebral Cortex, 27(8), 4277–4291.
Nastase, S. A., Gazzola, V., Hasson, U., & Keysers, C. Measuring shared responses across subjects using intersubject correlation. Social Cognitive and Affective Neuroscience, 16(4), 667–685.
Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLOS Computational Biology, 10(4), e1003553.
Phipson, B., & Smyth, G. K. (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1), 1585.
Verosky, S. C., Todorov, A., & Turk-Browne, N. B. (2013). Representations of individuals in ventral temporal cortex defined by faces and biographies. Neuropsychologia, 51(11), 2100–2108.
Visconti di Oleggio Castello, M., Halchenko, Y. O., Guntupalli, J. S., Gors, J. D., & Gobbini, M. I. (2017). The neural representation of personally familiar and unfamiliar faces in the distributed system for face perception. Scientific Reports, 7(1), 12237.