Open reviews

October 7, 2025 — Open review of Peraza and colleagues:

Peraza, J. A., Kent, J. D., Nichols, T. E., Poline, J. B., de la Vega, A., & Laird, A. R. (2025). NiCLIP: Neuroimaging contrastive language-image pretraining model for predicting text from brain activation images. bioRxiv. DOI


Peraza and colleagues present NiCLIP, a CLIP-based neural network model, that learns to decode cognitive terms from coordinate-based brain activation maps. The model outputs predictions based on task, concept, and domain labels from the Cognitive Atlas. This provides a qualitative advance over previous reverse-inference models (like NeuroSynth) by learning a nonlinear mapping between brain images and words, taking advantage of the rich contextual-semantic representations encoded by large language models (LLMs). This manuscript is very timely and already in pretty good shape. The methodology seems solid and I believe the results. That said, I found certain bits of the text were difficult to follow; for example, I had a hard time understanding the distinguishing features of the models, which comparisons are most important, and which elements of NiCLIP are driving improved performance. Most of the following comments are clarification questions or suggestions that I think the authors can readily address.

In my first read through the manuscript, I had some difficulty following the narrative. I found myself asking questions like “wait, what exactly are the differences between the NiCLIP model and the CLIP model they were discussing in the previous section?” and “what exactly is this model trained and tested on? and is that different from the previous model?” Is the distinguishing feature of the “CLIP” model that it’s text-to-brain, whereas the distinguishing feature of the “NiCLIP” model is that it’s brain-to-text (and also trained on CogAtlas)? Couldn’t you theoretically also decode caption-style text directly from the CLIP model trained on brain images, without the CogAtlas? A good deal of this becomes somewhat clearer upon reading the Methods (at the end), but I think readers would benefit from a little more hand-holding throughout the Results. To be clear, I don’t think this is done poorly at all even in the current version—it’s just that this whole methodology is a complex beast, and very few readers will be familiar enough with all the different components to fully “get it” on the first read. Maybe introducing each section with a question or motivation sentence would help.

Following on the bit from the previous comment about training/testing, should readers be worried about potential leakage between the PubMed Central data used for training and HCP data used for testing the models? Isn’t it possible that some of the PMC training articles are reporting coordinates derived from exactly the same HCP data you use to test the model? I assume the authors ensure these are non-overlapping somehow (or maybe I just don’t fully understand the structure of the data), but I think this could be made more explicit. Related thought: I assume the articles are effectively randomized regarding topic, so that you don’t end up holding out a large chunk of articles on a single topic for a particular test set?

I had some difficulty understanding how exactly the language component of the model is encoding the text. For example, in my own work with LLMs, we’re often using the time series of word-by-word embeddings to capture the meaning of text. Does a single embedding for an entire article comprising thousands of words really capture all the nuances of meaning (the “deep semantic relationships” the authors advertise in the introduction) in that article? I could understand how a whole trajectory of word-by-word embeddings could capture the narrative of an article in a fairly rich, context-sensitive way—but wouldn’t you lose a good bit of this meaning and structure in collapsing the article into a single embedding?

Table 2 is very dense. Can you hold the reader’s hand a bit more as to which numbers we should be comparing? For example, am I correct in understanding that GC-LDA task Recall@4 (17.14) outperforms NiCLIP task Recall@4 (10.71)? Isn’t this comparison a bit surprising?

The authors mention bag-of-words methods like TF-IDF in the Introduction, suggesting that LLMs will improve on this method. This set me up to expect a comparison to TF-IDF—but then I didn’t see it directly mentioned. Is the NeuroSynth baseline model effectively using TF-IDF?

Can you say a little bit more about the metrics (e.g., Recall@k, Mix&Match at 2.2) when you first introduce them, without having to refer to the Methods? For example, in 2.3, you say, “In decoding, Recall@k represents…”—a nice, concise definition like this would also be useful earlier on.

On page 9, you refer to “The reduced and enhanced versions of Cognitive Atlas.” What does “enhanced” mean here? Are you referring to two different versions (1 = reduced, 2 = enhanced) of the CogAtlas, or do you just mean the “reduced” version is also “enhanced”?

On page 9, you say “The predictions of domains consistently showed higher recall rates than tasks and concepts across all models and configurations…” Could this difference just be due to differences in the structure of these target variables? For example, maybe domains has fewer distinct elements than tasks, so it makes for an easier decoding task? Some kind of shuffled/permuted/null baseline could provide a useful point of comparison here.

For the impact of this paper, I think it’s important to ask: How can people actually use this model? Can you provide a little more logistical details (or a recipe) for how others might use the trained model and code for their own research? (I see there are some pointers on the GitHub repo… Are authors planning to build a Neurosynth-style website for this?)

Page 3: “Finally, we examined the extent to which NiCLIP’s capability in predicting subject-level activation maps.”—seems like there’s a word missing here.

5.1.2: “and may not always be factual”—I’m not sure what “factual” would mean in this context, but I get your point; maybe “widely agreed upon” is better?

5.2: “The text encoder is characterized by a projection head and two residual heads, while the image encoder comprises three residual heads.” Is “head” the typical terminology here? I’m not super familiar with CLIP, but my brain is getting interference with the use of “head” in describing the attention heads in each layer?