Open reviews

October 18, 2025 — Open review of Visconti di Oleggio Castello and colleagues:

Visconti di Oleggio Castello, M., Dupré la Tour, T., & Gallant, J. L. (2025). Individual differences shape conceptual representation in the brain. bioRxiv. DOI


Visconti di Oleggio Castello and colleagues introduce a modeling framework for quantifying and interpreting individual differences in brain function in naturalistic paradigms. This framework allows them to “fingerprint” individual subjects in terms of their neural tuning (i.e., regression weights) along features of an explicit computational model. As an example, they quantify rich individual differences in neural representation using a lexical-semantic model in a sample of subjects listening to ~2 hours of naturalistic spoken stories. This is a pretty monumental piece of work, and feels like a very important milestone in a methodology that is becoming increasingly dominant in the field.

Crucially, the framework the authors are developing here should generalize to essentially any complex naturalistic task where we have a decent model (e.g., CNNs for a visual task, LLMs for a listening or speech task, etc). In my understanding, this framework provides a qualitative advance over widely used methodologies on several fronts. First, whereas many methods for examining individual differences in naturalistic paradigms are essentially data-driven (e.g., correlating one subject’s brain data with the brain data of other subjects), this approach is fundamentally model-driven. Individual differences are quantified in reference to an explicit set of model features, which, even for complex models, provides a level of interpretability that data-driven methods cannot easily provide. Second, while a great deal of work on individual differences has focused on resting-state paradigms, these paradigms have no stimulus or overt behavior, and therefore nothing outside the brain to leverage for interpreting the brain activity. Third, in typical fMRI task design, we use highly-controlled experimental manipulations to essentially enforce interpretability at the stage of the data acquisition; but, in this case, we can only measure variability along the handful of dimensions we build into the paradigm (not to mention the poor ecological generalizability of these paradigms). Here, the model gives us the interpretational traction we need to make sense of a complex, naturalistic phenomenon, while allowing us to capture higher-dimensional, more expressive individual differences in a way that is explicit and quantitative.

This manuscript is already in a pretty good state: it’s well-written, the figures are excellent, the methods are rigorous, and the results are interesting. In the following, I try to propose some actionable ideas for improving the manuscript. Most of my comments are clarification questions aimed at making the manuscript more accessible to a broader audience.

Major comments:

My foremost comment relates to the authors’ novel statistical framework for modeling individual differences. First off, the mathematical contribution of this work feels a bit understated: the authors’ framework for quantifying distributional Explained Variance (dEV) explained pages 20–24 is really quite interesting. I’m wary of suggesting the authors spend too much more of the main text unpacking this framework, however; there’s a risk that a manuscript like this could be perceived as purely “methodological” (derogatory)—and it’s much deeper and more interesting than that. That said, my main worry with this manuscript is that many readers will not fully “get” the methodology, and so uptake in the field will be slow.

Let me just list out some of the questions I was asking myself as I was reading the Results and the Methods. For example, the authors say (page 21): “When this transformation is applied to voxelwise lexical-semantic model weights, the individual weights are transformed in the high-dimensional lexical-semantic feature space, but their anatomical location is preserved.” I still don’t have a great intuition for what this means or how optimal transport preserves anatomical information. First of all, I think this does not mean that you retain any one-to-one voxel correspondence between the source and target distributions, right? (For example, if you shuffled voxel indices in the target distribution, nothing would change.) Rather, you move all voxels in the source to match the overall distribution with the target, but track how far each voxel has to move—is that correct? This doesn’t take into account any anatomical spatial relations between source voxels, right? A source voxel in occipital cortex could end up in a location where the nearest voxel in the target distribution is from prefrontal cortex—right? Is it true that the “movement” of one source voxel will be interdependent on the tuning / movement of all other source voxels? Should we be worried about this? Would it be beneficial, for example, to estimate the dEV for neighboring voxels in a more local region of interest? (Whether this methodology could be effectively applied to voxels in a particular region, rather than across the entire cortex, may be a question of interest to other readers, and the authors could provide a comment or recommendation here.) The group distribution of model weights corresponds to all voxels across all N – 1 subjects, right? This way, you avoid having to average potentially mismatching voxels across subjects to create the group distribution (good); but does this scale? The number of samples will get larger for each added subject; will this make the dEV estimation more and more computationally intensive? I think I understand, intuitively, that we’re trying to get an estimate of how far a source voxel has to “move” in the model feature space to match the target, but how that happens in bulk, where voxels are coming from all over cortex, is difficult for me to grasp. Figure 2 is already very helpful, particularly panel C—but maybe some additional labels here highlighting what’s going on for an individual voxel would be helpful? Anyway, my main point here is that any additional hand-holding the authors can provide to answer these kinds of questions as they arise for the reader—to guide their intuition—would be really helpful for both understanding and adoption of this framework.

I suspect that many readers, as they work through the paper, will be asking themselves “how does this relate to behavior?” I understand that it is not trivial to behaviorally measure fine-grained individual differences in the kind of lexical-semantic representation the authors are targeting in the brain. I think the manuscript can stand alone without this component, given that I’m not even sure what I’d ask the authors to do to satisfy this behavioral angle. That said, I think the authors should be prepared (in the text) for readers to get hung up on what they perceive as a missing piece. They might consider dedicating a paragraph of the Discussion to this limitation. Better yet, they could expand how future behavioral work could be linked up with their statistical framework for brain data.

I found the negative values in Figure 3—voxels that are well-modeled, but where dEV is less than the mean—difficult to interpret. What’s the motivation for quantifying this in a bipolar fashion relative to the mean rather than, for example, just plotting this on a unipolar colorbar from lowest variability to highest variability? I worry that this kind of colorbar combined with the authors’ method for scaling alpha by the overall model performance could lead readers to think that areas not engaged by the task or poorly captured by the model, like visual cortex, end up with “mean” individual variability in the middle of the range. I’m curious what the actual values look like for a region that is poorly modeled; my guess is that they would have relatively low variability. In any case, the way these warm values perfectly follow the STS up into the TPJ is a really neat result.

The authors’ core results focus on a lexical-semantic model that has previously been shown to be a fairly strong model for predicting neural activity throughout much of cortex during natural language comprehension (e.g., Huth et al., 2016). I wonder how a poorer model would fare. Presumably, we shouldn’t see such rich individual differences for a poorer (i.e., less fitting to the task, overly simplistic, lower-level, etc) model. I wonder if it would be worth running the individual differences analysis using a weaker model (e.g., the phonemic features?) as a supplementary control analysis or counterexample. If the individual differences are much weaker or more / differently anatomically localized using a low-level control model of this kind, it would highlight the importance of the model itself in this methodology. In general, seeing a failure mode of this methodology (how can this be made to not work well) might increase some readers’ confidence in the methodology overall.

The authors’ methodology hinges on voxelwise regression weights learned from large portions of naturalistic data. One question I had when reading was: I wonder how reliable these weight vectors are within a given subject across stimuli? For example, the authors could estimate weight vectors from one half of their stimuli and correlate those with weight vectors learned from the other half (i.e., split-half reliability). I assume these weight vectors will become more reliable with larger volumes of (and more diverse) stimuli used in training. With a relatively wide model like this, the weight vectors will be increasingly overfit for small volumes or highly constrained stimuli. It could be worth quantifying this reliability across subsets of their data of increasing size in the supplement, to provide “how much data do we need to collect?” guidelines for others interested in adopting this methodology.

There are some methodological choices in the “Model weight adjustments” section of the Methods that could be unpacked a bit more. I think, in general, the authors’ choices here seem reasonable. My main question is: Do these choices have ramifications for their ultimate goal of quantifying individual differences? For example, the authors average weights across the finite impulse response (FIR) lags of 1, 2, 3, 4, and 5 TRs. Doesn’t this mean that potential individual differences in hemodynamic lag and/or processing latency will be effectively discarded in this processing step? Processing lags could be fairly large for spoken stories with hierarchical linguistic and narrative structures unfolding over multiple words (Chang et al., 2022); I could imagine there might be some individual differences here that may be lost. Another example: The authors re-scale each voxel’s weight vector “by the cross-validated split prediction accuracy in the training set.” (Just to confirm, you really do mean “training” here, not “test”—right? I.e., you mean the R-squared on the validation run for a given fold of the inner cross-validation loop; not the within-sample R-squared on the inner training runs, and not the R-squared on the outer test story. This could be described a bit more clearly.) Does this mean that the scale of prediction accuracy itself could drive individual differences? E.g., For a subject with a low prediction accuracy in a particular voxel, that voxel’s weight vector will get scaled toward zero overall, which may reduce(?) the distance of transport relative to the group mean. Unpacking the implications a bit more would be helpful here.

There are certain broad-strokes claims about the field in the Introduction—like, for example, “individual differences are rarely studied in cognitive neuroscience”—that might annoy some readers unnecessarily. For example, I can imagine some cognitive neuroscientists reading that and then saying “wait, I’ve been studying individual differences my entire career” or “what are the authors talking about?—my subfield focuses almost entirely on individual differences” (examples like Fedorenko et al., 2010, Dubois & Adolphs, 2016, Finn et al., 2020 come to mind). Aren’t the many, many papers correlating a neural variable with a behavioral or demographic variable (e.g., in social neuroscience) studying individual differences in some way? I wonder if there’s a way to modify the language here to make sure it doesn’t come across as dismissive to certain parts of the audience you want to address; maybe point to some examples, or hone the language to focus on methodological limitations of common approaches.

Minor comments:

The narrative of the Results is quite good; each section introduces a question well and then addresses it.

Figure 4A: These bootstrap 95% confidence intervals look very tight, with the bottom of the interval well above zero for all bars—but only ~half the bars are significantly greater than zero? Just looking at the CIs, I would have guessed that all of these bars would be highly significant. Can the authors clarify?

Overall, the Methods section is very thorough and very clearly written—nice!

Page 15: In the “Experimental procedure” section, maybe give us the total duration as well; looks like about 2 hours of stories per subject? (I now see you have this on page 17, but maybe mention it earlier as well)

Page 20: Do you want the “Quantifying functional individual differences…” section heading to be nested inside “Group-level cortical map of lexical-semantic tuning”? Might be worth popping this out to the higher heading level.

Page 20: When you say “the set of all lexical-semantic model weights can be modeled as a multivariate Gaussian distribution…”, the “set” you’re referring to here are voxels, correct? I.e., the data points in this Gaussian distribution are individual voxels. Could be worth stating this more explicitly throughout.

References:

Chang, C. H., Nastase, S. A., & Hasson, U. (2022). Information flow across the cortical timescale hierarchy during narrative construction. Proceedings of the National Academy of Sciences of the United States of America, 119(51), e2209307119. DOI

Dubois, J., & Adolphs, R. (2016). Building a science of individual differences from fMRI. Trends in Cognitive Sciences, 20(6), 425–443. DOI

Fedorenko, E., Hsieh, P. J., Nieto-Castañón, A., Whitfield-Gabrieli, S., & Kanwisher, N. (2010). New method for fMRI investigations of language: defining ROIs functionally in individual subjects. Journal of Neurophysiology, 104(2), 1177–1194. DOI

Finn, E. S., Glerean, E., Khojandi, A. Y., Nielson, D., Molfese, P. J., Handwerker, D. A., & Bandettini, P. A. (2020). Idiosynchrony: from shared responses to individual differences during naturalistic neuroimaging. NeuroImage, 215, 116828. DOI

Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453–458. DOI