Drawing While Talking: How Multimodal AI Interfaces Are Unlocking Embodied Creativity

A designer sketches a rough shape on a tablet while saying "make this more organic, like coral growing." The AI interprets both the visual input and the verbal description, generating a suggestion that reflects the spatial intent of the sketch and the conceptual direction of the speech. This multimodal interaction — where hand, voice, and machine collaborate simultaneously — represents a fundamentally different creative paradigm from typing prompts into a text box. And the evidence suggests it produces fundamentally different creative outcomes.

The Problem with Text-Only Prompting

Shi, Upadhyay, Quek, and Choo (2025), at the Singapore University of Technology and Design and Carnegie Mellon University, begin with a formative study examining how designers actually use generative AI during ideation. Their finding is blunt: text-based prompting disrupts creative flow. Designers who are sketching — working in a spatial, embodied, improvisational mode — must stop, switch to a verbal-analytical mode to compose a text prompt, wait for a response, evaluate it against their spatial intention, and then switch back to sketching. Each mode switch incurs a cognitive cost and interrupts the continuous stream of creative thought.

Their solution, TalkSketch, is an embedded multimodal AI sketching system that integrates freehand drawing with real-time speech input. Designers draw while talking, and the system captures verbal descriptions during the sketching process to generate context-aware AI suggestions. The speech input provides conceptual direction ("more angular," "like a bird in flight," "the texture should feel rough") while the sketch provides spatial constraints. The AI responds to both channels simultaneously, eliminating the mode-switching penalty that text-based prompting imposes.

The design philosophy is that AI should engage the creative process itself — the continuous, embodied, exploratory activity of making — rather than operating as a separate service that delivers finished outputs on request. When the AI is embedded in the sketching experience rather than adjacent to it, the creative loop tightens from minutes to seconds, and the designer maintains the flow state that produces the most exploratory work.

Embodying Prompts as Instruments

The concept of AI-Instruments — systems that embody AI prompts as physical or gestural instruments rather than typed commands — extends this logic beyond sketching. When a prompt becomes a physical action (a gesture, a movement of an object, a manipulation of a tangible interface), the interaction draws on spatial cognition, proprioception, and motor memory rather than purely linguistic processing.

This matters because creative thinking is not exclusively verbal. Architects think spatially. Musicians think temporally and harmonically. Choreographers think kinesthetically. When AI creative tools require all input to pass through a linguistic bottleneck — a text prompt — they exclude the non-verbal cognitive modalities that many creative domains rely on. Multimodal interfaces that accept spatial, gestural, and embodied input open AI collaboration to these non-verbal modalities.

Public Co-Creation and Embodied Engagement

Nimi, Lu, and Martinez Quintero (2025) provide evidence from a different context: a public interactive art installation that uses real-time generative AI to co-create Ukiyo-e-style artwork through embodied inputs. Participants influence AI-generated visuals through physical presence, object manipulation, body poses, and gestures — no text prompting required.

Their observational analysis across 13 sessions reveals several patterns relevant to embodied creativity. Tangible objects with salient visual features (color, pattern) emerged as the primary and most intuitive input method, suggesting that physical manipulation of meaningful objects is a more natural creative interface than abstract linguistic commands. Bodily poses and hand gestures served as compositional modifiers — adjusting spatial relationships and emotional tone. The system's immediate feedback loop enabled rapid learning: participants understood how to influence the output within seconds, without instruction.

The most significant finding concerns cognitive barriers. Embodied interaction dramatically lowers the threshold for creative engagement with AI. Participants who would struggle to compose an effective text prompt — children, elderly visitors, people unfamiliar with AI — interacted fluently through physical gesture. The interface democratized access to AI-assisted creative expression by removing the linguistic skill requirement that text prompting imposes.

From Text to Touch

The convergence of these studies suggests that the text prompt — currently the dominant interface for generative AI — is an artifact of AI's origins in natural language processing, not a reflection of how creative work actually happens. Creative thinking is multimodal, embodied, and continuous. It involves sketching, speaking, gesturing, manipulating materials, and moving through space, often simultaneously. An AI creative tool that accepts input only through typed text is like a musical instrument that can only be played by writing sheet music — it works, but it misses the embodied intelligence that produces the most interesting results.

The design direction is clear: generative AI tools for creative work should support the full range of human expressive modalities — drawing, speaking, gesturing, manipulating — and they should do so in real time, embedded in the creative process rather than separated from it. The question is not how to write better prompts but how to move beyond prompting entirely.

Drawing While Talking: How Multimodal AI Interfaces Are Unlocking Embodied Creativity

The Problem with Text-Only Prompting

Embodying Prompts as Instruments

Public Co-Creation and Embodied Engagement

From Text to Touch

References (2)

Explore this topic deeper