Open reviews

August 14, 2021 — Open review of preprint by Shahdloo and colleagues: PubPeer

Shahdloo, M., Çelik, E., Urgen, B. A., Gallant, J. L., & Çukur, T. (2021). Task-dependent warping of semantic representations during search for visual action categories. bioRxiv. DOI


Shahdloo and colleagues demonstrate that attending to two different action categories (“locomotion” and “communication”; via a visual search task during naturalistic movie-viewing) shifts voxelwise semantic tuning toward the attended action category. This manuscript extends work by Çukur et al., 2013, on attention in object representation to the domain of actions. The authors do a commendable job of controlling for low-level perceptual features that may covary with action semantics, and provide several auxiliary analyses elaborating on the attentional tuning result with respect to nontarget actions and intrinsic action tuning. The methods are very rigorous and thoroughly described, and the figures are excellent. In general, I think this manuscript is already at a pretty high level, and most of my comments boil down to clarifications and curiosity.

Major comments:

I’m trying to understand what exactly the template feature vector looks like for a concept like “locomotion” in the WordNet hierarchy. For example, you say “idealized category responses were constructed as 109-dimensional vectors that contained ones for target categories (either communication or locomotion categories) and zeros elsewhere” (line 659). As far as I understand, “locomotion” is a high-level category in WordNet, so the 1 assigned to “locomotion” also propagates downward to subordinate categories like “walk” and “run”—right? So ultimately, the “locomotion” actions are indexed by a 109-dimensional indicator vector containing 1s for “locomotion” as well as all subordinate words? The authors list all the action categories in Supplementary Methods (SM), but it might be useful to indicate which action words correspond to locomotion, which correspond to communication, and which correspond to neither. I’m also curious how many locomotion/communication words actually occur in the stimuli used here—for example, I can’t imagine words like “canter” occur very many times. In any case, better describing how the WordNet-based feature vectors capture these action categories, and some summary statistics describing how frequently occur in the stimulus would be helpful.

When you use the term “semantic selectivity” in the main text (e.g. the first time you describe computing the tuning shift index at line 252), it wasn’t clear to me what this means. Much later on, in the Experimental Procedures section, you make clear that “selectivity” is the correlation between the estimated semantic tuning for a given voxel and an idealized template vector. These vectors are first projected into the 12-dimensional PCA-based semantic space (which means you’re computing a correlation across 12 values, right?). This way of quantifying “tuning” or “selectivity” seems pivotal enough to be worth describing (at least one sentence) early on.

Related to the previous point, can you spend a sentence or two unpacking/motivating the equation for computing your tuning shift index (TSI) at line 674 (e.g. normalizing to ensure the index spans -1 to 1). In the same section, I don’t really understand how the nontarget TSI is computed. When you say that “action category responses were masked to select the nontarget categories” (line 682), do you mean that time points containing the target categories were excluded? (Or do you mean that model features related to the target categories were excluded? Not sure how you’d compute the correlation between estimated tuning and the templates in this latter case.)

At line 554, you say that “Subjects were instructed to covertly search for target categories in the movies while maintaining fixation.” Does this mean there was no behavioral component to the visual search task (e.g. press a button when you see a “locomotion” action)? I don’t think any overt response or “behavioral results” are strictly necessary—the results stand on their own—but I think this point should be made explicit in the text.

From the theoretical framing in the current manuscript, I don’t really understand how shifting voxelwise tuning toward a target category necessarily expands multivariate representational space. Maybe this is more thoroughly explained in Çukur et al., 2013, but it might be worth adding a sentence to clarify the relationship here. For example, Figure 1b suggests that attention expands multivariate representational space such that items in the attended region of space are more dissimilar (i.e. more discriminable) from each other. How exactly does shifting voxelwise tuning such that the tuning curves of individual voxels become more similar to an idealized tuning curve expand the attended region of multivariate representational space?

This work analyzes attentional shifts in voxelwise semantic tuning for two high-level attentional targets: locomotion and communication. This distinction seems to be captured by variation across PCs 1 and 3, meaning it accounts for a considerable amount of variance in the full dataset (although it may covary with other dimensions of action representation such as “sociality”). To what extent do the authors think these results would extend to other action categories? Obviously expanding the same experimental design to many different action categories would not really be feasible, but I think it would be useful for the authors to discuss limitations of studying only two targets.

The authors put a considerable amount of effort into controlling for lower-level (e.g. motion energy and STIP) perceptual features (which is great!). However, I wonder if this partly impacts how we should interpret certain results. For example, at line 467, the authors mention that attentional shifts are larger in areas like AG and SMG relative to areas like LOC. But if LOC also encodes low-level kinematic features (line 25), maybe the nuisance models will attenuate potential effects observed in this area? I’m somewhat curious what the attentional tuning results would look like if you didn’t control for STIP (or motion energy) features, but I don’t think this is critical for the manuscript.

Minor comments:

How much do the authors think these results depend on the discrete WordNet hierarchy? For example, I’m wondering how these results would extend to the kind of continuous semantic space learned by models like word2vec or GloVe or BERT (e.g. Schrimpf et al., 2020). In that context, it’s not so obvious which words are subordinate categories of other words; but on the other hand, computing tuning as the similarity between the embedding for a given word and the embedding of a target word like “locomotion” is very straightforward.

I’m not sure the authors have performed this analysis, and I don’t think it’s necessary for the manuscript, but: do you think the attentional task also shifts object (i.e. non-action) category tuning? I’m not sure this makes sense given the way that WordNet structures semantic relations, but I can imagine that in a continuous semantic space (e.g. word2vec) object representations may also covary to some extent with actions. Another related question might be: while subjects are performing visual search for particular action categories, does performing an “action task” more generally deemphasize object representation?

In the “Stimuli and experimental design section,” you describe that there were three 10-minute movie clips and six total movie-watching runs. I assume that subjects performed the “locomotion” and “communication” attention tasks in each of the same three runs—i.e. subjects performed the two tasks for the same exact stimuli—but I don’t see this stated explicitly; is this true?

Line 569: The same sample of subjects was used in the passive-viewing task and the passive models were fit separately for each subject—right?

References: I was expecting to see reference to Lingnau & Downing, 2015, when discussing lateral occipitotemporal cortex. For example, at line 25, when you mention that LOC encodes low-level kinematics, this paper came to mind as suggesting that LOC may encode more than just low-level perceptual features.

SM line 98: Aside from the neighborhood parameter, only one regularization parameter is estimated across the combined semantic, STIP, and motion energy feature spaces, right? In other words, banded ridge regression is not used here (Nunez-Elizalde et al., 2019)? (I suspect it doesn’t make much difference, just want to confirm.)

SM line 106: should this be “ln” rather than “lf”?

References:

Lingnau, A., & Downing, P. E. (2015). The lateral occipitotemporal cortex in action. Trends in Cognitive Sciences, 19(5), 268–277. DOI

Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). Voxelwise encoding models with non-spherical multivariate normal priors. NeuroImage, 197, 482–492. DOI

Schrimpf, M., Blank, I. A., Tuckute, G., Kauf, C., Hosseini, E. A., Kanwisher, N. G., Tenenbaum, J. B., & Fedorenko, E. (2020). The neural architecture of language: integrative reverse-engineering converges on a model for predictive processing. bioRxiv. DOI