# Open reviews

### December 31, 2020 — Open review of preprint by Kaiser and colleagues: PubPeer

Kaiser, D., Haeberle, G., & Cichy, R. M. (2020). Coherent natural scene structure facilitates the extraction of task-relevant object information in visual cortex. bioRxiv. DOI

Kaiser and colleagues present a concise set of behavioral and fMRI results demonstrating that presenting objects in non-jumbled scenes facilitates task-relevant object representation. They replicate classical behavioral results showing that jumbling a scene disrupts object and scene categorization tasks, and that non-jumbled scenes yield increased response magnitude in downstream visual areas. They then use response pattern templates for isolated objects from a localizer task to demonstrate that objects are more robustly represented in intact scenes, but only when subjects are performing an object categorization task. I have a couple comments about framing/interpretation, as well as some questions about methodological details.

I have a hard time understanding what exactly we can infer from this jumbling manipulation as a means of evaluating the impact of scene “coherence” or “structure.” I realize this manipulation is inherited from classic work by Biederman, but the jumbling manipulation introduces a variety of distortions that result in very non-naturalistic scenes. For example, jumbling will introduce sharp vertical and horizontal discontinuities at the edges of the quadrants, change low-level spatial frequency content, disrupt light sources/shadows, flip the sky/ground over the horizon line, etc. In other words, the jumbled images are not only “incoherent” (or unusual, surprising), they are not “possible,” “realistic,” “ecological,” etc; they are clearly artificial contrivances. Any of these words are a valid way of describing the manipulation. Terms like “coherence” and “structure” seem to understate the level of artificial distortion introduced into otherwise natural images. Does the full argument about “coherence” still have the same appeal if we instead describe the jumbled images as simply “unrealistic” or “impossible” or “artificial” or “non-naturalistic”? To be fair, the authors touch on this in the Discussion starting at line 481, but I don’t think the “artificiality” of the jumbled scenes is adequately addressed.

The main results of the manuscript hinge on a correlation-based pattern analysis—but we know that Pearson correlation can misbehave in particular situations (Walther et al., 2016). For example, could differences in the baseline estimation (i.e. across runs) drive the correlation effect? Could mean-centering the response patterns (“cocktail blank” removal) impact the correlation effect (Garrido et al., 2013)? To be clear, I trust the results in the manuscript, and this kind of correlation-based approach is widely used. But it would be worthwhile for the authors to discuss (or if possible mitigate) these concerns.

The motivation for backward masking in the design wasn’t immediately obvious to me (presumably this is common practice in the surrounding literature?). I would add a sentence explaining why it makes sense to mask the scene stimuli (e.g. flush the visual system in preparation for the upcoming trial? minimize longer-term conscious processing of the stimulus? whatever it may be).

At line 219, I don’t follow “the T1 image was segmented to obtain transformation parameters to standard MNI-305 space.” Do you mean the T1 was skull-stripped and segmented according to tissue types, then a (linear? nonlinear?) transformation to the MNI-305 template was estimated? Or do you mean surface reconstruction using FreeSurfer (I’m wondering based on the choice of MNI-305)? A little more detail here would be helpful.

Related to the previous comment: Do the ROIs each have the same number of voxels across subjects when projected back from the template into subject space? Is it standard to refer to a big “early visual cortex” ROI like this as “V1”? Are you specifically using V1v and V1d from the Wang et al. 2015 atlas or are other parcels included? (If so, maybe use “EVC” as the acronym/abbreviation.) Seems from Figure 2 that it encompasses more than V1, but maybe that’s just a visualization quirk or due to the inaccuracies of alignment.

I would probably report effect sizes for the behavioral results in the text in addition to the statistics (i.e. the mean accuracy and response times for both intact and jumbled, or at least the differences; qualitatively visible in Figure 1d). Also, at lines 310 and 313, are the t- and p-values for response times for the object and scene tasks actually identical or is this a typo?

When describing the pattern correlation analysis (Methods 2.8 and Results 3.2), I would remind readers that each scene either has a car (cars) or a person (people) in it whether it’s jumbled or not (like you describe at line 124). Anything that can help smooth out the logic for arriving at “object information” when reading through the results would be appreciated.

The authors don’t really discuss the univariate increase in response magnitude for intact scenes (from Figure 2; unless I missed something). How should we interpret this result, both on its own, and with respect to the multivariate results? (Presumably they tell us somewhat complementary things since the correlation-based pattern analysis implicitly z-scores the response patterns, thus ignoring any regional-average difference in response magnitude; Misaki et al., 2010).

I realize there is no significant effect for object information during the scene task (blue bars in Figure 3b)—and you actually mention this in the Discussion at line 535—so I don’t want to read too much into it. But, in addition to the suppression interpretation, do you think this could result from some uninteresting step in the analysis pipeline, like mean-centering response patterns across conditions?

Line 138: Maybe add a brief parenthetical explicitly stating what you mean by “crisscrossed,” e.g: “crisscrossed (i.e. top-left was swapped with bottom-right, and top-right was swapped with bottom-left).”

Line 170: When you say the trial order within a run was “fully randomized,” do you mean each run for each subject had a different randomized trial order? Or did you create four random trial orders (one for each run) and all participants got the same four runs?

Lines 213–215: I would fill in the acquisition parameters with a bit more detail in keeping with OHBM’s COBIDAS report: http://www.humanbrainmapping.org/files/2016/COBIDASreport.pdf

Line 261: What does “baseline” mean here? For example, for the “people vs. baseline” in the t-contrast for the localizer task, does baseline include only fixation or does it also include car images? (This is fairly important for correlation-based pattern analysis.)

Line 266–267: You normalize by “subtracting the average t-value across conditions.” For the localizer run, does this mean the average of the two contrasts (people, cars)?

Lines 272, 377: I’m not sure I’d use the term “cross-correlation” here, since it means something different with signal processing/time series.

Line 382: “each of the scenes” > “each scene condition” (since you’re not doing an exemplar level analysis here)

Line 391: “each the” > “each of the”

Great that the data are available on OSF! I would encourage the authors to share their code as well (via e.g. OSF, GitHub).

### References:

Garrido, L., Vaziri-Pashkam, M., Nakayama, K., & Wilmer, J. (2013). The consequences of subtracting the mean pattern in fMRI multivariate correlation analyses. Frontiers in Neuroscience, 7, 174. DOI

Misaki, M., Kim, Y., Bandettini, P. A., & Kriegeskorte, N. (2010). Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage, 53(1), 103–118. DOI

Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. NeuroImage, 137, 188–200. DOI