Open reviews

October 20, 2025 — Open review of Ke and colleagues:

Ke, J., Chamberlain, T. A., Song, H., Corriveau, A., Zhang, Z., Martinez, T., Sams, L., Chun, M. M., Leong, Y. C., & Rosenberg, M. D. (2025). Ongoing thoughts at rest reflect functional brain organization and behavior. bioRxiv. DOI


Ke and colleagues use a paradigm where participants rest for short intervals then report the content of their preceding thoughts to link patterns of resting-state functional connectivity (rsFC) with patterns of thought. They quantify these thought patterns using behavioral ratings on dimensional scales (thought dimensions), verbal reports converted into sentence embeddings (thought content), and human annotations of the verbal reports (thought topics). They first outline some of the patterns of thought they observe over the course of the experiment in individual subjects and across subjects. Next, they show that rsFC correlates with reported thought patterns using both representational similarity analysis (RSA) and connectome-based predictive models (CPM). They include several supporting analyses showing, for example, that models trained to predict thought patterns from rsFC also predict pupil dilation and differentiate imagery-related thoughts in an aphantasic individual from their non-aphantasic twin. Finally, they show that models trained to associate rsFC patterns with thought patterns in their dataset generalize to subject-level behavioral variables in open rsFC data from the Human Connectome Project (HCP).

This is a really neat paper that pulls together a variety of different threads to get some traction on the content of spontaneous thought during rest. The manuscript is well-written and most of the figures are clear. My main comments relate to the interpretation of the design (is this really “rest”?) and what I felt to be a tension between variance in thoughts across trials and across subjects. I hope that articulating some of the points where I felt unsure can help the authors solidify the narrative of the manuscript.

Major comments:

This is a clever design, but one of my first thoughts when reading this manuscript was: “is this really rest?” Is a 30-second period of downtime really enough for people to engage in the kind of cognition we see in, e.g., a 10-minute rest period? Is 30 TRs of neural activity enough to pick up on the dynamics driving connectivity observed in longer rest periods? I wonder if subjects may have perceived this “rest” period as more of a task prompt to “make up your own task and then describe it”. Subjects will presumably habituate to the 30-second intervals and might simply be preparing to answer the very predictable set of upcoming questions in a particular way. (Perhaps the authors have an argument against this “demand characteristics” interpretation given the widespread involvement of DMN across thought processes.) My point here is that the authors should write in a way that will reassure and/or convince readers who might be thinking “this isn’t really rest” all the way through the manuscript. For example, I think their generalization analysis to standard-issue resting-state data from HCP can also provide an argument that the connectivity they measure in these 30-second chunks is something akin to rest. To be clear, I don’t think this is a design flaw; it’s a necessary compromise for interrogating the subject’s thoughts. There’s not much the authors can do other than be very judicious in motivating their design and interpreting their results.

Probably my most central comment on this manuscript is that it feels like there’s a tension in the narrative between within-subject variance in thoughts across trials and trait-like variance in thoughts across subjects. In my mind, capturing the first type of variance is much more interesting and novel. I understand this is kind of a chicken-and-egg problem and that both types of structure are present in the data: a given subject’s thoughts vary over time, but the frequencies of certain thoughts are overall higher in some subjects than others. But the results often hop back and forth between these two types of findings, without fully unpacking the implications, making the narrative a bit hard to follow. For example, in the first set of rsFC results (page 7), the authors show that (1) when averaging FC matrices across trials, similarity in FC matrices tracks with similarity in thought content across subjects (a trait-like effect); and (2) trial-by-trial similarity of FC matrices correlates with trial-by-trial similarity in thought content within subjects. To my reading, the latter of these two findings is much more germane to the question of the dynamic stream of thought during rest. Note that this does not have to strictly be computed within subjects; the authors could also assess whether trial-by-trial similarity between rsFC matrices correlates with trial-by-trial similarity in thought content across subjects, right? I think the CPM analysis asks roughly this question, but I’d be curious to see whether this effect holds up in the simpler RSA-style analysis. On the other hand, I think the HCP analysis can only inform trait-like variance—right? Any effort to clarify how these different results fit into this tension between variability across trials versus variability across subjects would help the narrative.

The authors average the behavioral thought dimension ratings and thought content embeddings in different ways to ask different questions. I’m wondering how to interpret some of these averages. For example, to assess reliability of thought dimensions across sessions, they average thought vectors across trials within each session. These thought dimensions are already z-scored “within participants and dimensions” (page 4, line 12); doesn’t this mean that the averages will tend toward zero (or the midpoint of the scale)? Even aside from the impact of z-scoring, won’t much of the interesting dynamic structure of thought ratings over trials collapse toward zero in the average? If I understand correctly, the resulting average, a sort of static “pancake” of the ratings over time, will contain certain biases—perhaps subject X, in the average, tends to have above midpoint ratings on “My thoughts are in the form of images”. Seems like this will discard much of the interesting structure in subject X’s thoughts. I was wondering if, for example, you could concatenate all the ratings over trials, rather than averaging them, but then the sequence of thoughts will not match across sessions of subjects. As a reader, I think my uncertainty about how to interpret these averages plays into my uncertainty about the tension between the within-subject dynamics of thoughts and across-subject trait-like fingerprints of thoughts (previous comment).

I’m wondering how much temporal autocorrelation we would see in both thought patterns (e.g., dimensional ratings, content embeddings) and rsFC across trials within a session. I imagine that the content of thought in trial t will tend to bleed over into trial t + 1, especially since subjects are talking about and rating their thoughts from the trial t immediately prior trial t + 1. Is there any simple way to quantify this? (I also wonder if the magnitude of autocorrelation would vary from person to person; some people might be more “steady” and others more “jumpy” in their thinking.) Relatedly, the authors used nested mixed effects models to test whether trial number (from 1 to 32) predicts thought dimensions, and find that it does. I wonder if temporal autocorrelation should be accounted for in the model somehow.

I find it difficult to get much out of the brain network plots in Figure 4 panel a—if I squint, maybe I can see some particular connections (or laterality effects?), but it mostly looks like spaghetti. I understand this is likely just the true nature of the data, but is there any way the authors could threshold these maps more stringently, or stretch the scale of line thickness, or anything else to make it easier for readers to get more of an intuition of what’s connected to what in these plots? I understand that the matrices in panel b answer this question more clearly, but if I’m supposed to rely on the matrices for interpretation, then you might consider moving the brain connectivity plots to the supplement entirely. Sidenote about panel b: Could you include a little brain visualization of the networks as part of the labels/key? I don’t know what differentiates, e.g., Visual I and Visual II in this particular atlas, and would have to refer to the atlas paper to find out.

On page 5, you describe how thought patterns become more self-similar within subject and unique compared to other subjects over time. I’m struggling with the intuition for this. How can each subject become increasingly different from all others? Is it possible that this is some kind of statistical artifact of averaging or regression to the mean? (Not to mention, the effect sizes are very small: -.004 and -.001)

Do the authors take any special care to mitigate effects of increased head motion during the verbal response periods immediately following rest. For example, I imagine the head motion measurements (e.g., FD) are much higher magnitude/variance during the speaking periods versus the rest periods. (Plotting the FD trace across an entire run, I would expect it to alternate between high-variance, low-variance, high-variance, low-variance, etc.) Will regressing head motion time series like these out of the full-run time series actually capture noise due to head motion in the rest periods where variance in your confound variable(s) is much lower? My worry is that it will regress out the big head motion in the speaking period and effectively ignore the little head motion in the rest period. We’ve encountered similar issues with paradigms that switch between passive listening (low head motion) and speaking (high head motion) in a single run. One alternative might be to include a simple binary “task” regressor to account for mean changes in signal between the rest and speaking periods. Perhaps a better alternative: Split the head motion parameters into two separate sets of regressors where one set captures the rest period (with zeros for the speaking periods) and the other set captures the speaking periods (with zeros for the rest periods); a linear combination of these should fully recapitulate a single set of regressors, but with the option to assign a higher or lower coefficient to rest versus speaking confounds. I suspect this isn’t a big problem and I think the results are sound enough, but wanted to bring up this possible complication.

Minor comments:

Figure 1: I got confused by the mention of “Twitter-roBERTa” in the figure, given that it’s not mentioned in the figure caption or in the text until very briefly on page 8 and in the Methods at page 23; maybe add a brief mention in the caption/text, or just remove this little annotation from the figure if it’s not very important.

Page 4, line 20: “participants tended to think in the form of images when reflecting on the past, others or positive content, but in words when thinking about the future”—this is such a neat little result!

Figure 3: Why is the null distribution so high in panel d? I was expecting something like 1/9. I suppose this is due to the imbalanced class frequencies across samples (as you mention on page 24)? I would explicitly say the number of classes in the figure caption, and mention in the Methods that empirical chance is higher than expected due to imbalance. (A performance metric like balanced accuracy could also be used to correct for this in the actual data and the null distribution.)

Page 19, line 16: “whether individual’s” > “whether individuals’”

Page 21, line 36: Is it typical to use a nonlinear SVR for CPMs? Seems pretty aggressive to use a nonlinear model like this with relatively few samples. How many samples total are going into the model, N = 50 × 20+ trials?

Page 23, line 32: “Twitter-roBerta-based” > “Twitter-roBERTa-base”

Page 24, line 12: “We assumed a conservative permutation testing”—odd phrasing

Page 25, line 6: “if they have” > “if they had”