Note: Iris voice mode was launched last year, and the demo above was filmed on June 24, 2024. This article is a retrospective about how we designed it.
Your co-worker is looking at your computer over your shoulder as you ask them for help. "Hey, you see this data? I made this chart from it, do you think there’s a better way to visualize it?" Relatively quickly, you can communicate the problem at hand by pointing out the relevant details on your screen while you talk. When designing voice mode for Iris, we wanted to create a digital analogue for this kind of high-bandwidth communication.
Just like you would with a human co-worker, when you speak to Iris you can "gesture" to certain parts of your screen with the capture tool (quickly accessed with the ⌥S hotkey). This allows you to direct Iris' attention in a way most chat-based LLM apps don't allow.
Visual confirmation that Iris acknowledges your screenshots in voice mode.
The best implementation of voice mode on desktop I’ve seen so far is an unreleased feature in ChatGPT that lets you “share your screen” with it. But even then, there's no way to make it pay attention to a specific portion of the screen except with extra words. "Hey, look at the data in the second to fourth column of the table. I made the chart at the top right of the screen based on it…” The wordiness makes the interaction feel cumbersome.
We drew a lot of inspiration from real-world interaction with people, but there are cases where we deliberately diverged from that approach. You might be surprised that unlike a human co-worker (and unlike ChatGPT), Iris doesn’t speak back to you. LLMs can output large quantities of text very quickly, often generating responses faster than a person could verbally deliver them. A text response ends up being quicker to read, easier to scan, and less intrusive than speech. So we didn't give Iris a voice at all.
We also intentionally don’t allow Iris to interject as you’re talking. LLMs are not good enough at judging when to jump into the conversation (yet). When speaking with ChatGPT, I often pause to collect my thoughts. But it takes the pause as a cue to start talking, and I lose my train of thought. With Iris, we went back to the stone ages and added a simple toggle to start and stop voice input. You get to talk as fast or as slow as you want and think as long as you need without any distractions.
Notably, we do give you the power to interrupt Iris as it’s streaming a response. You can send a new message at any point to cut it off. You are given much more agency and control than you’d have in a human co-worker relationship.
There’s a lot of untapped value in drawing inspiration from human communication in chat-based LLM apps—our “point and talk” design is just one example, but there’s much more to explore. What would it look like to collaborate with Iris the same way you would with a teammate on Zoom? What if you could doodle or highlight parts of your screen while talking? What if Iris could doodle back? Could Iris request temporary access to your mouse or keyboard, like a co-worker during a screen share? Like with the existing implementation of Iris’ voice mode, a successful design likely involves some hybrid mix of replicating human-to-human communication as well as inventing new patterns unique to interacting with LLMs.
Share this post