aiart audio video Words

Rorschach Test

One of the more fascinating aspects of current generative video models is their ability to imagine the intervening frames of a sequence. Give them a first and last frame, and a prompt, and the model attempts to create a clip that moves from A to B in the desired fashion.

To do this the model must imagine new frames – this act of imagining is driven by what the model ‘sees’ in the images, and what it can extrapolate from the internal representations of the model. This is of course a boon for creating transition shots for AI movies, however I am more intrigued by what it can tell us about the subconscious of the model.

The model will attempt this feat no matter what it is presented with. Indeed, the most fascinating results can arise when the model is ‘un-prompted’ without any textual description of the transition.

In this work I have used this technique to transition between the ten original ink blot images from the infamous Rorschach test, developed in 1921 by the Swiss psychiatris, Hermann Rorschach, originally designed as a test for schizophrenia, before becoming more widely adopted as a free-association tool.

In the human test each image is presented to the subject and they are asked to describe what they see.

from Wikipedia:

The general goal of the test is to provide data about cognition and personality variables such as motivations, response tendencies, cognitive operations, affectivity, and personal/interpersonal perceptions. The underlying assumption is that an individual will class external stimuli based on person-specific perceptual sets, and including needs, base motives, conflicts, and that this clustering process is representative of the process used in real-life situations.

What the subject perceives when looking at an inkblot is supposed to reveal aspects of themselves which may not be revealed by direct inquisition.

By presenting the inkblots to an AI model, the inner imaginings of the system are revealed. And we, as observers, are left to interpret the images created by the model’s mind. If the target images have no obvious representational form (e.g. ink blots) the model must first convert the blot into the realm of the real (or at least the version of reality for the model) and calculate how that image should move. Both the model producing the image, and us observing the image, are involved in a similar kind of sense-making – what the hell is that thing?

What gets created in these intervening frames can reveal the biases of the model. For example, the huge proportion of social media video in the training means that the model loves to produce human forms.

Just as Large Language Models derive their ‘intelligence’ from hoovering up millions of documents, these models derive their understanding of how things move from watching an awful lot of ‘content’. By performing a form of ‘psychological test’ on the model we get a glimpse of how it sees the world, and perhaps it reveals some of the horrors it’s seen before…