aiart algo generative text video words

Found in Translation

Found in Translation – American Psycho

As I write, it is early 2024, and it feels like “AI” has very much gone overground. For a lot of people, this means playing with generative systems for producing images (and rudimentary videos) from text prompts with things like Dalle3 or Midjourney, or ‘chatting’ with Large Language Models (LLM), like chatGPT or google Gemini.

The introduction of using everyday human language as an interface to computer systems has been revolutionary. No longer are our interactions with technology constrained by strictly designed interfaces or writing computer programs, with all the associated limitations and complexities. Now we can just talk to them. This brings with it obvious affordances in terms of accessibility and simplicity, but it also brings a great deal of potential confusion.

Language is fuzzy. Language is metaphorical. Language is part of a much larger sphere of understanding, of culture, of what it is to be human. The languages we use as humans are very different to the language encoded in contemporary AI models.

For the entire history of human culture, the only things we have been able to converse with are other humans. As a result, we inevitably treat ‘things that talk’ as being like us – we have no other experience of non-human language users, so when chatting with a language model it is easy to be seduced into thinking we are conversing with a peer.

But LLMs are not like us, they are a statistical simulacra of our written selves, created via the ingestion of billions of documents.

One way to make them more-like-us is to introduce more sensory modalities. As well as reading all the books, it can also look at all the pictures! This has obvious potential advantages, since we are a primarily visual species. But it also introduces another layer of ambiguity and associated ‘hallucinations‘ – the mismatch between our understanding of the world and the eager-to-please pronouncements of our new AI friends.

As a way to investigate this, I arranged a form of Chinese-Whispers between AI systems. I first extracted the keyframes from a scene from American Psycho and asked a multimodal LLM (LLaVA) to describe what it saw. I then took these descriptions and used them as prompts for a Stable Diffusion image generator. Finally I passed these images on to Stable-Video-Diffusion to turn the stills into motion.

Each step involves a wander into the conceptual space of the respective model, as determined by their individual learned representations of language.

Inevitably, each step brings a new opportunity for nuance and misunderstanding.

Below are a selection of text interpretations and their resulting generated images.