Open reviews

April 13, 2024 — Open review of Borovska & de Haas:

Borovska, P., & de Haas, B. Individual gaze shapes diverging representations in inferior temporal cortex. PsyArXiv. DOI


Borovska and de Haas presented subjects with a naturalistic movie stimulus under both free-viewing and central-fixation tasks. They found that, while the free-viewing task yields higher response magnitudes in ventral temporal (VT) cortex, the central-fixation task yields better aligned response patterns across subjects. This suggests that individual differences in gaze allocation yield divergent information represented in VT. They support this claim by showing that individual differences in gaze trajectories correlate with individual differences in VT representational alignment. This is a concise manuscript and an interesting result. In general, I find the results convincing, but I outline some potential hangups below.

Major comments:

In both the main text and the Supplementary Methods, it wasn’t clear to me how exactly you’re doing the “nearest neighbor” classification. We need to know what kind of distance metric you use to compute nearest neighbors. For example, in the canonical example from Haxby et al., 2011 (page 415), they use a correlation-based one-nearest neighbor classifier; that is, they compute the correlation between all pairs of segments across subjects, then score a “hit” for a given segment if the “predicted” pattern has the highest correlation with the corresponding pattern in the test subject. Note that the choice of distance metric has some important implications. For example, correlation internally mean-centers and scales the two input patterns, effectively ruling out overall (regional-average) response magnitude; Euclidean distance on the other hand does not normalize the input patterns in this way, meaning classification can be driven by overall response magnitude (Misaki et al., 2010). This seems critical to be very clear about given that the free-viewing and fixation conditions yield different response magnitudes (Figure 1c).

I’m having a hard time understanding your operationalization of “representational divergence” via the “(negative) cross-brain decoding accuracy in the fixation condition, after regressing out the corresponding accuracy in the free-viewing condition”. What exactly is being regressed out of what here? You’re regressing out the 171 (negative) free-viewing accuracies (corresponding to pairs of subjects) out of the 171 (negative) fixation accuracies? What’s the reasoning behind regressing out the free-viewing condition? I was wondering if there might be any more direct way to construct the intersubject representational divergence matrix based on continuous distance between response vectors rather than accuracies, but I’m not sure it’ll make much difference. Note that this approach is reminiscent of “intersubject representational similarity analysis” or IS-RSA (Finn et al., 2020).

In the main text, I had a hard time understanding the structure of the central-fixation versus free-viewing task structure, and how/when it relates to the separate eye-tracking sessions. Should we be worried about potential habituation effects between viewing the movie in the scanner and viewing the movie for eye-tracking? For example, there may be memory-related processing on the second viewing of a naturalistic stimulus (Michelmann et al., 2021). What is the within-subject reliability of eye movements for repeated viewing of a naturalistic movie stimulus? I think the design largely controls for these kinds of concerns, but it wasn’t completely clear to me in the writing.

I’m not sure it’s statistically valid to run a t-test across pairs of subjects in this way. Given that there are truly only N = 19 subjects, the 171 pairs are highly non-independent. Supplying the pairs to a statistical test as if they were independent samples with N = 171 will artificially inflate the degrees of freedom and increase false positive rates.

I’m curious about what the between-subject movie-segment classification performance looks like without any hyperalignment. The free-viewing classification with hyperalignment here is at 36% accuracy, while Haxby et al., 2011, report an accuracy close to 70% with hyperalignment (whereas no hyperalignment yields ~30%). I don’t think there’s any straightforward way to compare your accuracies to the Haxby paper, given the difference in hyperalignment implementation (e.g. pairwise vs common space) and other parameters (e.g. number of segments). Providing the non-hyperaligned between-subject accuracy could provide a useful point of reference; probably not be necessary for the main text figure, but could be worth reporting in the supplement text.

Enforcing central fixation dramatically reduces the ecological validity of the movie-viewing task—clearly this is not how we watch movies or perceive the world around us. How does this factor into the authors’ interpretation of the results? Could this be driving the lower response magnitudes for central fixation?

Minor comments:

I would make sure to always notate the chance accuracy as “0.24%” to ensure nobody confuses it with a proportion of 0.24 (i.e. 24%). In Figure 1b and the caption, you simply use 0.24 instead of 0.24% and I had to do a double-take.

What are the black outlines in posterior visual cortex in Figure 1c?

I’m not sure the histogram in Figure 1c is really telling us much more than we already get from the brain map.

Supplementary Methods: I would specify that the acceleration factor is “multiband” or “simultaneous multislice” (I assume) to differentiate from in-plane acceleration like GRAPPA.

Supplementary Methods: I’m not sure 4 mm smoothing is really necessary with hyperalignment, but I think there’s some precedent for doing some minimal smoothing for MVPA analyses.

The Grill-Spector & Weiner (2014) reference 13 has the wrong journal name (“Bone”).

References:

Finn, E. S., Glerean, E., Khojandi, A. Y., Nielson, D., Molfese, P. J., Handwerker, D. A., & Bandettini, P. A. (2020). Idiosynchrony: from shared responses to individual differences during naturalistic neuroimaging. NeuroImage, 215, 116828. DOI

Grill-Spector, K., & Weiner, K. S. (2014). The functional architecture of the ventral temporal cortex and its role in categorization. Nature Reviews Neuroscience, 15(8), 536–548. DOI

Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., & Ramadge, P. J. (2011). A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72(2), 404-416. DOI

Michelmann, S., Price, A. R., Aubrey, B., Strauss, C. K., Doyle, W. K., Friedman, D., Dugan, P. C., Devinsky, O., Devore, S., Flinker, A., Hasson, U., & Norman, K. A. (2021). Moment-by-moment tracking of naturalistic learning and its underlying hippocampo-cortical interactions. Nature Communications, 12, 5394. DOI

Misaki, M., Kim, Y., Bandettini, P. A., & Kriegeskorte, N. (2010). Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage, 53(1), 103–118. DOI