When we decided to add a voice assistant to our Drupal website, we expected challenges. Natural language flow, AI responses, context management - all tricky, sure. But the one thing we assumed would just work? Turning spoken audio into text. Click mic, talk, get text. Easy, right?
It wasn’t.
The illusion of simplicity
Modern browsers let you capture audio using JavaScript. That sounds straightforward - and technically it is. But the way a browser records audio and the format it spits out (usually WebM with Opus encoding) is completely incompatible with the formats required by major speech-to-text APIs like Google’s.
To make things even more fun, the browser doesn’t give you proper WAV files, even if you ask for them. You end up with a file that says ".wav" but is secretly still WebM. Trying to feed that to a transcription service? It either breaks or gives you gibberish.
Finding our footing
After some serious rabbit-holing, we realised the only reliable way was to accept the browser’s raw audio format, send it to the server, and convert it to a proper WAV file using FFmpeg - an old-school but mighty tool that developers either fear or adore. Once we had a clean, LINEAR16-encoded WAV file in mono at 16kHz, Google’s speech API finally stopped frowning at us.
Doing it the Drupal way
Of course, we weren’t just building this in isolation - we wanted it clean, testable and native to Drupal 10. So we used private temp storage to hold transcripts per user, made sure our file paths were secure, and injected our dependencies the right way. We even structured transcripts as conversations - keeping user and assistant turns side by side.
Would we do it again?
Yes. But only because now we know the terrain. The real lesson? The web isn’t quite ready for seamless voice interactions out of the box. And Google’s developer experience - with multiple API versions and sprawling documentation - doesn’t help.
But with the right tools (and the right AI co-pilot), it’s entirely doable. And frankly, kind of magical when it works.
Final thoughts
If you’re building something similar, don’t feel bad if it takes longer than you expected. You’re not alone. Just keep pushing forward - browser quirks, format traps and all.
We did it. You will too.