/case-study

what a voice interview with an agent should feel like

a guided screensharing voice interview where the agent sees what you're doing and asks about it, and the trust mechanics that had to come with it.

> stat ./case-study

  • year2026
  • roleDesign Engineer
  • team1 design engineer + 2 product engineers
  • duration~1.5 weeks · prototype → production
  • stack[claude-code][voice-ui][webgl][three.js][realtime][trust]

End-to-end session: onboarding, screensharing, agent's first question, plasma strip mid-turn.

Before: a voice box that couldn't see the work

Duvo's process interview asks people to walk through how they actually do their work (open the CRM, narrate the steps, talk through the edge cases) so an AI agent can capture it. The first version was a centred chat box with a microphone. It had no way to see the screen, so when someone said "then I check the order history" it skipped three clicks, two filters, and the reason they skip the third tab.

The previous voice-only interview: a centred chat box and a microphone, with no way to see the screen.

Trust was the second problem. Support feedback wasn't about features, it was about how the thing felt:

  • The agent jumped in mid-sentence.
  • It hallucinated when the user paused to think.
  • There was no real pause, only mute.
  • People weren't sure why they were narrating in the first place: the onboarding didn't carry the purpose.

Not a redesign. A category shift: a new primitive (screensharing) and the new model, interaction approach, and trust mechanics around it.

A call, not a chatbot

People don't have a mental model for "AI watches my screen and interviews me." They do for a video call. So the surface borrows the conventions of one: live preview, control bar with mute and pause, transcript on the side, a soft cue when the other side is speaking. Familiarity is a trust mechanic.

Why not Figma

A static frame of a screensharing voice interview is a still photo of a conversation. You can't tell from it whether the conversation is good. Code wasn't a preference. It was the only way to find out if the design worked.

The pivot moment

The reframe didn't come from a sketch. One of the Product Engineers wired up a rough prototype with the realtime model and a call-shaped layout, and tested it with a few people in the office. Half-styled, broken in places, but watching them use it the dynamic was completely different. They leaned in. They paused. They acted like they were on a call.

Prototype to production: about a week and a half.

Onboarding: earning permission to start

Without framing, a screenshare prompt and a mic request read as "you are being recorded and assessed." The onboarding modal exists for one reason: name what's about to happen, scope it, make it consensual. By the time you click Start, the strangeness has been said out loud.

The smallest possible shape that does that job. Not visual richness. Calm.

The onboarding modal: naming what's about to happen before the screenshare prompt fires.

The voice-reactive strip

A thin band between the video preview and the control bar reacts to the agent's voice. When it speaks or thinks, it blooms; otherwise it's almost invisible. Not a core feature. It solves one specific problem: when the agent is processing, silence reads as a freeze, and freezes are where trust breaks. The strip says something is alive.

The plasma strip across one full agent turn: idle, speaking, processing, idle.

To get there, I built a tuner: a playground page wired to fake audio input with knobs for every parameter that mattered (speed, density, glow, colour, response curve, the lot). Turning the knobs while watching the strip react to volume is how plasma won. Organic without being decorative, dialled in by ear and eye, not by guessing in code. Lightweight WebGL shader with a Three.js helper for canvas wiring, so it survives at 60fps next to the real video stream.

The tuner: knobs for speed, density, glow, colour, and response curve, dialled live against the strip.

Pause, properly

In the previous version there was no pause, only mute, and after a long silence the agent would ask "are you there?", exactly when people needed silence to think.

In this one, pause is a coordinated stop across the whole session: screen capture, audio recording, voice client, and turn loop all transition together and resume together. The design answering the original complaint directly, not working around it.

What shipped, and what people did with it

Live and shipped. Initial testing has been clearly positive: people use it intuitively. They pause when they think (not mute themselves), glance at the strip to know whether the agent is processing, and don't ask "what is this thing." That, more than any metric, is the outcome I cared about: the interface stopped being the subject of the conversation and became the medium for it.

What I'd change

Not the design. Shorter feedback loops: more cycles to fine-tune copy, interaction beats, and the end-to-end against the bar we wanted. Most of the polish I'd do differently is at that level, not the level of which colour or which component.

How we worked

Clear split, shared table. The two Product Engineers built the backbone: the realtime agent, message streaming, transcript, and the screensharing pipeline that lets the model see the work. My job was the experience on top of it: turning the user frustrations above into the call-shaped surface, the onboarding that earns permission to start, the coordinated pause, the voice-reactive strip, and the timing and copy across the flow.

Not a solo design hand-off either. From first prototype to last touch in production we were pairing on ideation, debating model behaviour, watching each other's screens while we tuned timings. The category shift only worked because the design and the realtime stack were being shaped at the same table, on the same days. Credit for what shipped is shared.

Built with Claude Code

I built this with Claude Code as my pair. The call-shaped layout, the onboarding modal, the coordinated pause across capture/audio/voice client/turn loop, the plasma strip and its tuner page. All of it was authored in code with Claude Code reading the realtime stack the engineers had wired up, drafting components, and turning ideas into something I could feel in the browser within minutes. My job was deciding what the experience should be. Claude Code's job was getting it on screen fast enough that I could keep deciding.

What's next

Live is the start, not the end:

  • More user testing with real customers in real workflows.
  • Deeper Amplitude instrumentation alongside the next iteration: completion and screensharing-specific behaviours (pause, mute, early end, agent stalls) feeding the next round of decisions.

What this changed about how I work

It's not about the tool. It's about the process, the thinking, and the iterating. I didn't open Figma once. I shipped a real screensharing voice interview, a coordinated pause, an onboarding flow that earned permission to start it, and a small visual touch that kept the surface feeling alive. The medium followed the question.

>cookies: analytics + session replay. details