Open reviews

January 16, 2024 — Open review of Im and colleagues:

Im, E. J., Shirahatti, A., & Isik, L. (2023). Early neural development of social perception: evidence from voxel-wise encoding in young children and adults. PsyArXiv. DOI


Im and colleagues use voxelwise encoding models to capture the visual and social content of brain activity during movie-watching in an open dataset with subjects ranging in age from 3-year-olds to adults. They find fairly strong encoding of social content across cortex in all of the age groups. In more targeted analyses of areas MT and STS, they find that while the visual motion energy model has strong performance that is stable across age groups, the social model yields significantly higher encoding in adults than 3-year-olds in MT and STS. Analyzing the developmental trajectory of specific social indicator variables suggests that adult-like theory of mind (ToM) responses emerge quite late. The findings seem fairly novel and I think the methods are solid enough, although I pose a number of clarifying questions below. Whatever the authors can do to increase the reader’s confidence that the reduced encoding performance in young children is not simply a byproduct of lower data quality will help.

Major comments:

I suspect that, for many readers, the main concern will be that the lower encoding performance observed for the social model in the youngest age groups could be due to issues in data quality that covary with age (e.g. head motion, task compliance) rather than a “true” development trend in neural activity/cognition. Anything the authors can do strengthen their argument otherwise would really help. In fact, the authors already have a strong counterargument, but it feels understated: at line 315, you point out that strong and stable encoding results for the motion energy model indicate that the reduced encoding performance for the social model in the youngest age groups “cannot simply be attributed to lower quality data in children.” Yes! I would try to unpack this important point a bit more in the Discussion, and maybe even reiterate it more explicitly in the Abstract, Results, etc. Alternatively, the authors could try to more explicitly rule out possible age-varying confounds like head motion in the statistical analyses. For example, you could run the group-level statistical tests for a difference across age groups with a covariate quantifying subject-level framewise displacement (FD).

In a similar vein of building readers’ confidence about data quality, I think it’s probably worth including/reproducing more details about the preprocessing from Richardson and colleagues (e.g. line 73). For a dataset including young children like this, things like head motion could be a real confound and preprocessing details are important. For example, “artifact timepoint”—do you mean scrubbing/censoring TRs with large FD values? What’s the threshold for censoring a frame? Does “PCA-based noise regressors” refer to aCompCor? If so, are they extracted from CSF and/or white matter masks, and how many PCs?

At line 152, you briefly describe the cross-validation procedure: “We trained our model using a random subset of eighty percent of the movie data.” Is there any particular motivation for training on a random subset of time points? This seems like it could be problematic, as both the movie and particularly the BOLD data are temporally autocorrelated. Splicing out random time points will mean that (a) the training and test sets may not retain the temporal autocorrelation of the original data; and (b) neighboring time points that are highly autocorrelated will get split apart into the training and test sets. It might make more sense to split the movie into temporally contiguous segments for cross-validation (e.g. for one fold, train on the first 80% of the movie and test on the last 20%).

I wonder if it would be worth running a group-level post hoc contrast testing for an increasing linear trend in encoding performance across the age groups from 3–4-year-olds to adults. I think this would be relatively straightforward to implement via linear regression or ANOVA (e.g. -2, -1, 0, +1, +2). It might be a nice way to summarize the results, it seems decently well-motivated a priori, and the STS social encoding performance seems to follow a trend like this.

At line 229, the authors report no significant differences between encoding maps in children and adults based on a permutation test combined with controlling FDR. I suspect this is a true null result, but I wanted to highlight a possible “gotcha” here. With 5,000 permutations, if you compute the permutation-based p-value as the proportion of samples from the null distribution that exceed the observed test statistic, the smallest p-value you can get is 1 / 5001 ≈ .0002 (Phipson & Smyth, 2010). Depending on the number of voxels in your whole-brain maps, this p-value may simply not be low enough to survive FDR correction. Note, however, that computing the p-value in this way does not necessarily capture how far your test statistic is into the tail of the null distribution; even if the observed difference is really far into the tail, your p-value will still just be 1 over the number of permutations using this count-based approach. One way to deal with this is to treat your permutation distribution as a normal distribution and compute the z-value and corresponding p-value of the observed difference relative to the null distribution. (That is, concatenate your 5,000 permutation-based differences to your observed difference, z-score all of them, then extract the z-score for that observed difference, and compute a p-value from that z-value.) I just want to rule out the possibility that there’s actually a significant difference here, but we’re not seeing it due to some weird interaction between permutation tests and FDR.

For the ROI analyses, you average across voxelwise model performance values within the ROI, right? Or do you average the time series across the ROI and refit models on the average time series? I assume it’s the former (and I think that’s the better approach); I would make sure to state that explicitly somewhere (sorry if I missed it!).

Minor comments:

Line 49: Is “inter-region correlation (IRC)” a widely-used term in the field? Is this just referring to conventional “functional connectivity”?

Line 73: I wonder whether 5 mm spatial smoothing is not ideal for voxelwise encoding analyses; but I think it’s probably best to still use the “official” preprocessed data (as the authors currently do).

Line 86: cite Adelson & Bergen, 1985

Line 111: “Intersubject Brain Correlation”—I don’t think the “Brain” is necessary here

Line 114: Instead of “stable neural activity over time”, I would say something like “reliable, stimulus-evoked responses”

Line 118: cite the BrainIAK paper by Kumar et al., 2021

Line 123: Do you use the “himalaya” implementation of banded ridge regression? If so, I would make sure to cite Dupre la Tour et al., 2022. (If not, I recommend it! It was a major improvement over tikreg in our work.)

Line 141: I think the typical approach here (e.g. Huth et al., 2012) is to use a finite impulse response (FIR) model where the encoding model is duplicated at several lags to allow ridge regression to flexible fit different hemodynamic lags across voxels/subjects. This will considerably widen the model and probably won’t make a qualitative difference in the results, but could be worth considering for future work.

Figure 1: For the masking step in Figure 1, I would adjust the “ISC whole brain” label to include “mask” somewhere. Also, what data are plotted in the “MT or STS” masks? I would consider plotting both actual ROI masks so we can see exactly where they are.

Line 155: Do we want this error term epsilon in the prediction equation?

Line 182: When you say “feature correlations”, you’re using RSA (i.e. comparing time-point similarity matrices), right? I would make sure to say that explicitly here.

Figure 2: I like this summary of the features/annotations! It might be worth resizing things a bit to align the left and right panels, and to normalize the font sizes.

Line 233: I’m not sure the ROI analysis technically gives you more “statistical power”; this might be better stated as “a more targeted analysis not subject to correction for voxelwise tests” or something like that.

References: Vanderwal and colleagues (2019) also provide good arguments for the utility of naturalistic stimuli in developmental neuroimaging.

References:

Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2(2), 284–299. DOI

Dupre la Tour, T., Eickenberg, M., Nunez-Elizalde, A. O., & Gallant, J. L. (2022). Feature-space selection with banded ridge regression. NeuroImage, 264, 119728. DOI

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210-1224. DOI

Kumar, M., Anderson, M. J., Antony, J. W., Baldassano, C., Brooks, P. P., Cai, M. B., Chen, P.-H. C., Ellis, C. T., Henselman-Petrusek, G., Huberdeau, D., Hutchinson, J. B., Li, P. Y., Lu, Q., Manning, J. R., Mennen, A. C., Nastase, S. A., Richard, H., Schapiro, A. C., Schuck, N. W., Shvartsman, M., Sundaraman, N., Suo, D., Turek, J. S., Turner, D. M., Vo, V. A., Wallace, G., Wang, Y., Williams, J. A., Zhang, H., Zhu, X., Capota, M., Cohen, J. D., Hasson, U., Li, K., Ramadge, P. J., Turk-Browne, N. B., Willke, T. L., & Norman, K. A. (2021). BrainIAK: The Brain Imaging Analysis Kit. Aperture Neuro, 1(4). DOI

Phipson, B., & Smyth, G. K. (2010). Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1), 39. DOI

Vanderwal, T., Eilbott, J., & Castellanos, F. X. (2019). Movies in the magnet: naturalistic paradigms in developmental functional neuroimaging. Developmental Cognitive Neuroscience, 36, 100600. DOI