Excellent teardown! As always the devil in the little details that make an interface feel simple but otherwise sing.
I've been excited for voice-only interfaces like a phone call, but I think they are limiting because they make it hard to get visual context (missing all the nuance you discuss here). One pattern I've noticed is the voice-in, text/other-back interface when talking with voice agents (especially openclaw). In this modality, the user just records open ended voice memos - as long as you want - then the interface responds back with text or maybe voice.
Have you experimented with voice agents or hybrid voice/text agents at Yelp?
hey yes, I agree! Voice is powerful for getting context in (people talk 3x faster than type), but other visual and written formats are often better for outputs.
The challenge that I've seen with voice from experiments we have done and talking to others is getting awareness and adoption. While consumers use voice to talk to in-home devices or in the car, most are still not using it for interactions in traditional apps. Sometimes people are just not aware of it, or they are around other people while on the go and don't want to speak out loud. I know this is changing with developer workflows, so it could evolve fast, but I haven't seen it yet in the mainstream consumer app use.
I think it will take some time and require pushing users to use voice very prominently when it's most valuable. As an example, one of our more successful updates Yelp Assistant updates is that we allow users to upload a photo. While the camera icon is always there in the input field, we don't see a lot of user uploads there. Most of the photo uploads come after we trigger a question, "Do you want to add photos?" which has a quick reply "Add photo" button. This question is limited to certain categories where we know photos are helpful.
I'm definitely excited about the future being flexible so it can match modality to user needs and preferences.
Excellent teardown! As always the devil in the little details that make an interface feel simple but otherwise sing.
I've been excited for voice-only interfaces like a phone call, but I think they are limiting because they make it hard to get visual context (missing all the nuance you discuss here). One pattern I've noticed is the voice-in, text/other-back interface when talking with voice agents (especially openclaw). In this modality, the user just records open ended voice memos - as long as you want - then the interface responds back with text or maybe voice.
Have you experimented with voice agents or hybrid voice/text agents at Yelp?
hey yes, I agree! Voice is powerful for getting context in (people talk 3x faster than type), but other visual and written formats are often better for outputs.
The challenge that I've seen with voice from experiments we have done and talking to others is getting awareness and adoption. While consumers use voice to talk to in-home devices or in the car, most are still not using it for interactions in traditional apps. Sometimes people are just not aware of it, or they are around other people while on the go and don't want to speak out loud. I know this is changing with developer workflows, so it could evolve fast, but I haven't seen it yet in the mainstream consumer app use.
I think it will take some time and require pushing users to use voice very prominently when it's most valuable. As an example, one of our more successful updates Yelp Assistant updates is that we allow users to upload a photo. While the camera icon is always there in the input field, we don't see a lot of user uploads there. Most of the photo uploads come after we trigger a question, "Do you want to add photos?" which has a quick reply "Add photo" button. This question is limited to certain categories where we know photos are helpful.
I'm definitely excited about the future being flexible so it can match modality to user needs and preferences.
Compelling article. Future of UI maybe user bespoke GenUI
Thanks! I'm excited for the future where experiences really feel customized to each user's needs.