Open reviews

September 25, 2023 — Open review of preprint by Kewenig and colleagues:

Kewenig, V., Edwards, C., DEstalenx, Q. L., Rechardt, A., Skipper, J. I., & Vigliocco, G. (2023). Evidence of human-like visual-linguistic integration in multimodal large language models during predictive language processing. arXiv. link


Kewenig and colleagues compare the multimodal vision/language model CLIP—which learns to fuse images with text captions—to human judgments of the predictability of a word given the preceding visual context. They find that human judgments are modestly correlated with the CLIP’s predictability scores across 100 movie clips. Interestingly, they also show that CLIP’s attention mechanism “fixates” on certain parts of the visual input in a similar way to human gaze patterns measured by eye-tracking. This is a neat study and I generally believe the results, but I think it’ll require a bit more work to fortify some of the conclusions. I have some methodological questions, as well as some questions about the controls and the resulting interpretations.

Major comments:

I don’t fully understand how the authors extract a “predictability score” from CLIP. I’m not very familiar with the CLIP architecture and objective, so this is mostly just confusion on my side. My understanding is that CLIP’s image encoder is effectively a visual transformer, which means that, as an image passes through the transformer layers, the attention heads attend to different features in particular image patches. Is this correct? Can the authors provide a little more detail about this architecture; e.g. how is an image broken into patches and how big are they? I believe that CLIP’s text encoder is an autoregressive/causal transformer similar to GPT. I vaguely understand that the model is trying to learn an embedding space that effectively fuses image representations paired with the correct text captions—but I don’t fully understand what the model is trying to “predict”. Is the language encoder trying to predict the next word in a given caption like GPT (the authors mention “next-word prediction” in the abstract)? Does the language encoder’s context window see previous captions from the movie as well or only the current caption? Is the image encoder trying to predict the corresponding text embedding? The authors compute their “predictability score” as “the [softmaxed] dot product of the image feature vector and each text feature vector.” Is this the objective the model is trying to optimize? Is there a precedent in the literature for computing this score? This makes sense to me intuitively; i.e. how closely does the output of the image encoder match the output of the text encoder. But I don’t really understand how this interacts with the objective function of the language encoder. Anyway, a little more guidance here would be helpful readers (like me) who are not particularly familiar with CLIP.

I don’t follow the reasoning behind the “LSTM + ResNet” control model. If I understand the authors’ intentions correctly, they put this model forward as “a comparable model [to CLIP] without attention”. I think the authors can outline the motivation for this choice more clearly and explicitly. What hypothesis is this model testing? My assumption is: we hypothesize that a comparable model—but without any attention mechanism—will yield lower correspondence to human behavior. But I worry that there are many other potential differences that could drive the decrement. For example, does this actually perform captioning well? Shouldn’t the LSTM be able to mimic the attention effect to a certain extent by retaining certain visual features in memory? But it might simply not learn to do this well based on the objective function, the training data, and so forth. The model may learn a set of weights that simply perform poorly overall, which means we can’t easily ascribe any difference (with respect to CLIP) to the attention mechanism alone (Antonello & Huth, 2020, comes to mind). With all this in mind, I don’t really find this to be a convincing control and I would consider relegating it to supplementary materials and downplaying any conclusions drawn from this comparison.

There were several points in the Results section where I couldn’t really tell what exactly was going on without digging through the Methods. I would try to make this a little clearer for a results-first manuscript. (1) I would add one more sentence to the beginning of the Results section describing how the words/clips were chosen (you explain this in the Methods, but the reader shouldn’t have to sift through the Methods to understand the motivation). (2) At the beginning of the Results section, when describing the human experiment, I would make clear whether participants are just seeing the visual content, or whether they’re also hearing dialogue, seeing captions, etc. (3) When introducing the GPT-2 analysis in the Results section, can you make it clear what you input to GPT-2? Is it only the transcript of the language in the movie clip? Is it the transcript as well as captions?

How exactly did you shuffle the attention weights for the perturbation analysis with CLIP? I assume you took all the weights across all heads and shuffled these? (Rather than, e.g. shuffling weights within heads, or shuffling heads while keeping head-specific weights intact?) I agree with the concern that this perturbation may too aggressively “break” the network’s intrinsic structure. Are there any other control analyses that might corroborate this result? For example, have you tried shuffling the pairing between video clips and text (within movie) so that the model receives mismatching image and text inputs?

Can you explain the DBSCAN procedure in a bit more detail? First, what was the motivation for using smoothing and clustering (and this specific clustering algorithm) rather than quantifying similarity between the raw heatmaps? Another question: What exactly were the samples/inputs to DBSCAN? My understanding is you supplied each heatmap separately to DBSCAN, which clusters the heatmap pixel values in the two-dimensional image space? Does DBSCAN ultimately yield discrete, hard clusters? That is, do the heatmap pixels get discretized into “looking” and “not looking”?

Following on the previous comment, I’m a little worried about the permutation test used to assess the significance of Jaccard score for alignment between human gaze and model attention. First, why is the permutation distribution so high? Is it because Jaccard scores are positively biased? The distribution seems fairly narrow, but it’s still worrisome that the mean of the null distribution is .174 and the observed effect is .180. Second, I’m worried that randomly shuffling heatmaps is a weak control because this perturbation will disrupt the intrinsic spatial autocorrelation structure of the original maps. Is there some other way to construct a null distribution? For example, more simply, you could just recompute the Jaccard scores after shuffling paired maps between humans and CLIP; i.e. compute Jaccard scores with mismatching heatmaps.

I think the authors should be a little more careful about how they interpret results from the language-only GPT-2 control model. For example, in the Abstract, the authors report “no alignment for a unimodal state-of-the-art LLM”—but there is some alignment, only weaker. But this makes sense; we should expect some level of alignment, given that (1) the humans are also hearing the language that GPT-2 receives as input (right?), and (2) presumably the linguistic content of the film partly correlates with the visual content. I’m also wondering how much of the discrepancy here is driven by the task instructions in the human experiment. If the human subjects were explicitly told to rate how predictable the target word was given the visual content of the video, this may divert them away from GPT-2-like prediction. (As a counter-example, if the human subjects were told to focus on the language in the video, we would expect more GPT-2-like predictions—right?)

One of the core results—the correlation between human predictability estimates and CLIP’s predictability estimates—is r = .22. This isn’t particularly high, and I’m curious if the authors have any speculative ideas as to why this correlation isn’t higher. Is this due in part to the authors’ selection of words/clips? One thought that occurs to me is that CLIP only operates one frame at a time—right? Humans may be sensitive to dynamic information evolving across frames that CLIP does not have access to; we see a similar divergence in face recognition networks with dynamic video clips of faces (Jiahui et al., 2021).

Minor comments:

You say: “However, no study has yet compared natural language comprehension in multimodal large language models and humans in [a] quantifiable way.”—is this really true?

Some of the figures don’t really look “publication quality” yet. I would increase (and standardize) font sizes across all figures to make the label/axes more readable.

Figure 2. Why do you think the human scores are so bimodal? I would center the x-axis on all three of these panels around zero. Is it worth showing the scatterplot for the correlation between CLIP and human predictability judgments?

Figure 3: Is there any metric of variance we can include as error bars on this bar plot? For example, 95% confidence intervals around the mean correlation computed by resampling human subjects with replacement?

You average attention maps across 5 frames. This corresponds to ~200 ms segments, right? Can you spell out the timing a little more explicitly?

It might be worth discussing the difference between the tasks the humans and CLIP are solving. For example, you say the human task corresponds to a “process [that] mirrors the model’s computations”; but is the human task really comparable to the task CLIP is solving? For example, humans know the target word beforehand—unlike CLIP—and therefore might be looking at particular parts of the image based on this foreknowledge that the model doesn’t have. To be clear, I think the task is fine; it’s an interesting way to probe the alignment between human and model. I just think the authors should make sure the text acknowledges the difference as well.

The fact that CLIP’s attention mechanism can look at multiple places at once, unlike the human, might be worth further discussion. The authors “averaged the attention heatmaps across all attention heads to create a single attention heatmap,” which, combined with the clustering, may dilute the more “inhuman” qualities of multi-headed attention. (This also makes me wonder if there’s any work in machine learning constraining the model to attend to a sequence of singular locations more in keeping with human foveation/saccades; sometimes these kinds of limitations can push the model to learn something more interesting.)

I don’t see the relevant code on the GitHub repository provided (https://github.com/ViktorKewenig)—maybe the repository was unintentionally left private?

References:

Antonello, R., & Huth, A. (2022). Predictive coding or just feature discovery? An alternative account of why language models fit brain data. Neurobiology of Language. DOI

Jiahui, G., Feilong, M., di Oleggio Castello, M. V., Nastase, S. A., Haxby, J. V., & Gobbini, M. I. (2022). Modeling naturalistic face processing in humans with deep convolutional neural networks. bioRxiv. DOI