January 6, 2021 — Open review of preprint by Bazeille, DuPre, and colleagues:
Bazeille, T., DuPre, E., Poline, J. B., & Thirion, B. (2020). An empirical evaluation of functional alignment using inter-subject decoding. bioRxiv.
Bazeille, DuPre, and colleagues evaluate several functional alignment algorithms on public datasets. They use both whole-brain and ROI-based inter-subject decoding on independent data to measure the quality of alignment (alongside several other supplementary analyses). Broadly, they demonstrate that functional alignment algorithms improve inter-subject decoding, but to varying degrees under different circumstances. I have a few comments about the appropriate (and representative) application of each algorithm to ensure fair comparisons, as well as some questions about the implementation details. Sorry for such a long-winded review! In general, I think this is a good paper on a difficult topic, and the accompanying software is an excellent contribution.
I’m not sure framing the problem of functional alignment in terms of pairwise alignments—i.e. aligning one single subject to another single subject—is sufficiently motivated in the text. I can understand if this is a simplifying step necessary to make benchmarking more tractable, but it doesn’t really seem representative of how hyperalignment algorithms are used in the wild. Most use-cases I can think of in the literature (e.g. almost all papers following on Haxby et al., 2011, Chen et al., 2015, etc) use hyperalignment to transform all subjects into a single common (i.e. group, consensus, shared) space (excepting e.g. Bazeille et al., 2019, Jiahui et al., 2020). Even with regular old anatomical alignment, the goal is typically to align all subjects to a template, not to align one subject to another. This choice raises some issues. For example, estimating a hyperalignment solution that transforms each subject into a consensus shared space (rather than pairwise alignments) may regularize the alignment process in useful ways. This means that the benchmark performance of some algorithms may not really be representative of how they’re typically used.
Following on the previous point, I think it’s important to explain that the simplified implementation of Procrustes-based hyperalignment here is not the same as the PyMVPA implementation from prior publications. For example, the canonical Procrustes-based algorithm described in Haxby et al., 2011, is applied to a group of subjects using an iterative three-stage alignment-and-averaging procedure (described in the “Hyperalignment” section on page 414): subjects are initially aligned to an arbitrary reference subject, averaged with the reference, realigned to the reference, then a global average is computed, and the final alignment is computed with respect to this averaged reference. (Not to mention bells and whistles like the voxel selection step described on page 18 of the Supplemental Information.) The current manuscript uses a single Procrustes alignment to register one subject to another. In general, I think it’s fine to use a simplified Procrustes-based implementation of hyperalignment—but this should be made explicit (and it may underestimate performance due to some implicit regularization in the iterative approach to deriving a common space).
I’m having a hard time squaring the relatively poor performance of searchlight Procrustes-based hyperalignment here with results from Guntupalli et al., 2016. For example, Figure 6 in Guntupalli et al., 2016 (right panel) shows that estimating hyperalignment from a Hollywood movie yields a pretty sizable improvement in decoding performance for a separate animal species classification experiment. Why do we see such a discrepancy here? One potential explanation is the different implementation of searchlight hyperalignment used here. For example, the current implementation uses pairwise alignment rather than iterative alignment to a single common space described in my previous comment. Regarding line 204, I believe the PyMVPA searchlight hyperalignment algorithm actually sums overlapping transformation matrices, rather than averaging them.
Following on the previous comment, I also wonder if using whole-brain decoding as an evaluation metric obfuscates interesting things happening more locally. I’d be curious to see parcel-wise decoding maps. We might expect to see quite high decoding performance in particular parcels of certain sizes for certain types of stimuli, even if whole-brain decoding seems relatively stable across different parameters. I’m also surprised how stable the whole-brain decoding results are across parcellations (Figure S2; although I think IBC is a suboptimal benchmark dataset). One of the potential downsides of the piecewise approach is that the quality of hyperalignment now hinges on the quality of the parcellation—and the parcel boundaries are estimated prior to hyperalignment, so they’re only as good as the initial (anatomical) alignment. (Not to mention the parcellation is usually derived from averaged data, which we know doesn’t capture the details of individual functional organization; Gordon et al., 2017). The point of searchlight hyperalignment is to find a hyperalignment transformation for the whole cortex that is locally constrained but agnostic to the variability of these predetermined functional boundaries.
At line 552, you mention that the decoding data were smoothed with a 5 mm kernel. Doesn’t this partly defeat the purpose of hyperalignment? If the hyperalignment transformation is estimated from unsmoothed data, then applied to smoothed data, any fine-grained re-mixing across voxels will likely be rendered ineffective (Figure 5 in Guntupalli et al., 2016, suggests pretty spatially fine-grained tuning). The detriment of this smoothing to hyperalignment performance may not be obvious in whole-brain decoding.
From your description of Experiment 2, I was expecting you to both estimate hyperalignment transformations and evaluate decoding within a predefined ROI. But looking at the results for Experiment 2 and Figure 7, I’m confused: what do the terms “piecewise” and “searchlight” mean in the context of a predefined ROI? Do you perform whole-brain (piecewise or searchlight) hyperalignment and then decode only from the predefined ROI? Why not just estimate hyperalignment transformations within the predefined ROI itself? Presumably SRM is run only within the ROI itself, but then why not do the same for the other algorithms?
I have a couple questions about the application of SRM. First, at lines 342 and 437, you imply that SRM only functions on group data, and is not appropriate for subject pairs. I don’t see why this is really true (at least in principle). You can run SRM on a stack of only two subjects; it may not work as well as running SRM on a group of subjects, but this may be true of other alignment algorithms as well. Second, and more importantly, at line 228, you imply that the regularization in SRM makes it applicable to whole-brain data; but I don’t think most SRM users would agree. In most examples from the literature (that I’m aware of), SRM has been applied to ROIs (and not to the whole brain). With this in mind, I don’t think applying SRM to the whole brain is the appropriate (or fair) comparison for Experiment 1 and Figure 5. In my mind, it makes much more sense to apply SRM in piecewise fashion to cortical parcels as you do for piecewise Procrustes and piecewise optimal transport (it should be fast!). I suspect your concern is that following piecewise SRM, you’ll end up with lower-dimensional vectors for each ROI (i.e. not matching the original number of voxels in the whole brain). There are two ways of dealing with this: (1) ignore it and run the decoding analysis on the lower-dimensional whole-brain vectors; or (2) inverse-transform the low-dimensional vectors from shared space into the target subject’s voxel space, thus recovering the original dimensionality, but aligning (and cleaning) the data (BrainIAK’s FastSRM has an “inverse_transform” method for this purpose).
Following on the previous comment, I’m not sure it’s fair to compare decoding after SRM with dimensionality reduction to decoding after another hyperalignment algorithm at the original dimensionality (this is also a valid criticism of some of the comparisons in Chen et al., 2015). For example, I wonder if running PCA with k = 50 dimensions on the Procrustes-based hyperaligned data would yield similar results to SRM with k = 50 dimensions. (This comment is based on a conversation with Feilong Ma.)
At line 696 and Figure S1, you make the general claim that ROI-based decoding is lower than whole-brain decoding. This claim cannot be made in general because it depends heavily on several factors, e.g.: (1) what is being decoded (brain representations for different tasks or stimuli may be spatially local or global); (2) how the ROI is defined (e.g. your visual ROI for BOLD5000 does not extend very far into ventral temporal cortex); and (3) how the decoder is regularized (here you’re using SVM with default hyperparameters). In fact your Figure S1 suggests that ROI-based decoding may be better than whole-brain decoding for BOLD5000 (orange dots).
In Table 1, it would be helpful to see how many time points or samples are used to estimate the hyperalignment transformations. For example, as far as I understand, the IBC dataset has only 53 samples (contrasts) available to drive alignment(?), whereas the StudyForrest movie dataset has (I believe) 3599 time points. This is a huge difference. I would expect training hyperalignment on a naturalistic movie with lots of dynamic, engaging narrative content to yield a more robust, generalizable alignment than training on a handful of experimental conditions or contrasts (as they show in Haxby et al., 2011, Figure 4).
In my experience, functional alignment algorithms are pretty sensitive to standardization (z-scoring); that is, both the data used to estimate the hyperalignment model and the data to which the hyperalignment transformations are applied should be in the same region of response space, or performance will be poor. At line 549, you mention standardizing the data as part of your preprocessing pipeline. I just want to confirm that this means z-scoring the time series per voxel, and that you z-score both the alignment and decoding data in this way. For the decoding data, this means the beta maps to which the hyperalignment transformations are applied inherit the scaling from the standardization step—right? If the beta maps end up being scaled differently, I would expect performance to suffer. (You could, in theory, apply the hyperalignment transformations to the z-scored decoding time series data prior to estimating betas if this is a concern).
A note on my use of terminology: I’m using the term “hyperalignment” to refer to the family of algorithms that recast the inter-subject correspondence problem into a high-dimensional vector space where each dimension corresponds to a voxel and the goal is to align information encoded in this vector space (e.g. align local representational geometry). I consider algorithms such as Procrustes, SRM, and optimal transport to be “hyperalignment” algorithms because they rely on this reformulation of the problem introduced by Haxby and colleagues in 2011 (which stems from representational similarity analysis popularized by Kriegeskorte and colleagues in 2008). The point is to differentiate this family of algorithms from functional (or multimodal) alignment algorithms that, despite using functional data to guide anatomical alignment (e.g. Sabuncu et al., 2010; Conroy et al., 2013; Robinson et al., 2014), are fundamentally grounded in 2- or 3-dimensional brain space (rather than a high-dimensional response space).
Line 117: You say “time-segment matching relies on the same stimulus class to train and test the alignment,” but this is not strictly true. Time-segment matching was introduced as a way to perform classification on a continuous naturalistic stimulus (e.g. a movie) that doesn’t have obvious “classes.” But I can certainly estimate hyperalignment on one movie and evaluate the quality of alignment using time-segment matching on another movie. Or I can estimate hyperalignment on an experimental image-viewing task and evaluate the quality of alignment using time-segment matching in a movie (e.g. Haxby et al., 2011, Figure 4); here the improvement is diminished because image-viewing tasks are impoverished relative to the richness of movie-viewing.
Line 132: I’m not sure the reference to information mapping helps here. I think Kriegeskorte would argue that focusing on representational geometry circumvents the problem of intersubject alignment entirely (because the similarity structure is rotation-invariant).
Line 209: Make clear that this parcellation is applied to aggregate data prior to functional alignment, meaning the parcel boundaries are only as good as the initial (anatomical) alignment.
Line 212: The atlas doesn’t strictly have to be “functional”—anatomically-defined ROIs may work as well.
Line 239: List these parameters (e.g. searchlight radius) explicitly somewhere since you’re using a homebrew implementation of searchlight hyperalignment.
Line 255: Why is “connectivity hyperalignment” included in this list? Connectivity hyperalignment (esp. with ISFC; Nastase et al., 2020) seems applicable for these criteria. However, I wouldn’t include it in this benchmark just because that’s a can of worms.
Line 297: In describing the orthogonality of Procrustes, I would mention that the motivation for this constraint is to estimate an alignment that preserves representational geometry. (The aggregation of searchlight transformations and the dimensionality reduction of SRM will warp representational geometry a bit—but this effect is very small in my experience).
Line 428: What does “standard anatomical alignment” mean?
Lines 567, 572: might be worth expanding acronyms like BASC, MSDL
Figure 5: What are the empty dots (black rings) in the right panels of Figure 5 and 7? In general, the box plots in the right panel are a bit tricky to interpret. I would also make it clear in the title of these plots that they are “relative” to piecewise Procrustes.
Line 623: re-word “Although between dataset variance yields large boxplot”
Lines 216 and 845: You mention that the piecewise approach may introduce artificial discontinuities or staircase effects at parcel boundaries. Do you observe that effect here? This potential drawback seems underappreciated (the searchlight approach should return spatially smooth transformations).
Update the Vázquez-Rodríguez reference from the preprint to the PNAS version.
Bazeille, T., Richard, H., Janati, H., & Thirion, B. (2019). Local optimal transport for functional brain template estimation. In International Conference on Information Processing in Medical Imaging (pp. 237–248).
Chen, P. H. C., Chen, J., Yeshurun, Y., Hasson, U., Haxby, J., & Ramadge, P. J. (2015). A reduced-dimension fMRI shared response model. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems (pp. 460–468).
Conroy, B. R., Singer, B. D., Guntupalli, J. S., Ramadge, P. J., & Haxby, J. V. (2013). Inter-subject alignment of human cortical anatomy using functional connectivity. NeuroImage, 81, 400–411.
Gordon, E. M., Laumann, T. O., Gilmore, A. W., Newbold, D. J., Greene, D. J., Berg, J. J., Ortega, M., Hoyt-Drazen, C., Gratton, C., Sun, H., Hampton, J. M., Coalson, R. S., Nguyen, A. L., McDermott, K. B., Shimony, J. S., Snyder, A. Z., Schlaggar, B. L., Petersen, S. E., Nelson, S. M., & Dosenbach, N. U. (2017). Precision functional mapping of individual human brains. Neuron, 95(4), 791–807.
Guntupalli, J. S., Hanke, M., Halchenko, Y. O., Connolly, A. C., Ramadge, P. J., & Haxby, J. V. (2016). A model of representational spaces in human cortex. Cerebral Cortex, 26(6), 2919–2934.
Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., & Ramadge, P. J. (2011). A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72(2), 404–416.
Jiahui, G., Feilong, M., di Oleggio Castello, M. V., Guntupalli, J. S., Chauhan, V., Haxby, J. V., & Gobbini, M. I. (2020). Predicting individual face-selective topography using naturalistic stimuli. NeuroImage, 216, 116458.
Nastase, S. A., Liu, Y. F., Hillman, H., Norman, K. A., & Hasson, U. (2020). Leveraging shared connectivity to aggregate heterogeneous datasets into a common response space. NeuroImage, 217, 116865.
Robinson, E. C., Jbabdi, S., Glasser, M. F., Andersson, J., Burgess, G. C., Harms, M. P., Smith, S. M., Van Essen, D. C., & Jenkinson, M. (2014). MSM: a new flexible framework for multimodal surface matching. NeuroImage, 100, 414–426.
Sabuncu, M. R., Singer, B. D., Conroy, B., Bryan, R. E., Ramadge, P. J., & Haxby, J. V. (2010). Function-based intersubject alignment of human cortical anatomy. Cerebral Cortex, 20(1), 130–140.