The reason commercial voice assistants stream audio to the cloud is not that local recognition is impossible — it is that the cloud is convenient and lucrative. For wake-word detection specifically (the "Hey Computer" that activates a longer recognition flow), local processing has been viable on microcontrollers since around 2020 and is comfortable in 2026.
This article walks the build: an ESP32-S3 with a digital microphone, listening continuously for a wake word, lighting an LED when it hears one. About $20 of hardware. Privacy-respecting, no internet required.
What "wake word detection" actually means
The system listens to audio continuously, runs a small neural network on each short window (~1 second), and emits a confidence score. When the score exceeds a threshold, the system "wakes" and either triggers an action directly or kicks off a fuller speech recognition flow (which may or may not run locally).
The keyword spotting (KWS) model is purpose-built for this single decision. It does not transcribe speech; it does not understand meaning; it just decides "is the wake word in this 1-second window?". Because the task is narrow, the model can be tiny — under 100 KB.
Bill of materials
- Seeed XIAO ESP32-S3 Sense ($14) — ESP32-S3 with onboard MEMS microphone and microSD slot.
- Alternative: ESP32-S3 DevKit + INMP441 I2S microphone ($10 + $4) — more wiring but more flexibility.
- LED for visual confirmation ($0.50)
The pipeline
sample stream| Buffer[Ring buffer
~1 second] Buffer --> FFT[Compute MFCC
or spectrogram features] FFT --> NN[KWS neural network
~50 KB int8] NN --> Score{Score > threshold?} Score -->|yes| Wake[Wake action] Score -->|no| Buffer
The wake-word pipeline. The audio stream is continuously analysed; only a positive detection triggers downstream action.
The audio capture
The INMP441 (and the XIAO Sense's onboard mic) is an I2S MEMS microphone. The ESP32-S3's I2S peripheral handles the digital audio stream — you configure sample rate (16 kHz), word length (16-bit), and the I2S driver delivers samples through a DMA-backed ring buffer.
#include <driver/i2s.h>
const i2s_port_t I2S_PORT = I2S_NUM_0;
const int SAMPLE_RATE = 16000;
const int FRAME_SIZE = 1024; // ~64 ms at 16 kHz
void setupI2S() {
i2s_config_t cfg = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
.sample_rate = SAMPLE_RATE,
.bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
.channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
.communication_format = I2S_COMM_FORMAT_STAND_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = 4,
.dma_buf_len = FRAME_SIZE,
};
i2s_driver_install(I2S_PORT, &cfg, 0, NULL);
i2s_pin_config_t pins = {
.bck_io_num = 4, // bit clock
.ws_io_num = 5, // word select
.data_in_num = 6, // microphone data
.data_out_num = I2S_PIN_NO_CHANGE,
};
i2s_set_pin(I2S_PORT, &pins);
}
int16_t audio[FRAME_SIZE];
void readFrame() {
size_t bytes_read;
i2s_read(I2S_PORT, audio, sizeof(audio), &bytes_read, portMAX_DELAY);
}Feature extraction
Raw audio at 16 kHz is too dense to feed directly into a neural network. The standard pre-processing converts the audio into a spectrogram or MFCC (Mel-Frequency Cepstral Coefficients) representation, which captures the relevant information in far fewer numbers.
For each ~1 second of audio:
- Slice into ~30 ms frames with 10 ms overlap.
- Apply FFT to each frame.
- Compute power spectrum.
- Map frequencies onto the Mel scale (frequency bins log-spaced like human hearing).
- Take the log and (for MFCC) the discrete cosine transform.
The result is a 2D feature map (time × frequency) typically 49×40 or similar. This is what the neural network sees.
The TFLM example for keyword spotting includes the feature extraction code; you do not write it from scratch.
The model
A small CNN. The TensorFlow micro-speech example uses a 4-layer model with depthwise separable convolutions. Trained on the Google Speech Commands dataset (a public corpus of 30 keywords), it recognises a target word + "unknown" + "silence" with around 90% accuracy.
Model size: 18 KB int8. Inference time on ESP32-S3: ~30 ms per window. With 100 ms hop between windows, the chip runs at 30% inference duty cycle, leaving plenty of headroom.
Custom wake words
The Google Speech Commands dataset is fixed. To recognise a custom phrase like "Hey Builder" or "Computer", you have two paths:
Edge Impulse
Free for hobby projects. Record 100–500 samples of your wake word (multiple speakers, multiple environments). Edge Impulse trains a model and emits a TFLM-compatible deployment package. By far the easiest path; we recommend starting here.
Roll your own
Use TensorFlow's training scripts with your own data. Significantly more work; full control. Useful when Edge Impulse does not match your specific needs.
Either way, expect to record a lot of negative samples (random conversation, ambient noise, similar-sounding words) so the model learns to discriminate, not just to recognise.
The full firmware
Combining I2S, feature extraction, and inference is a multi-hundred-line project. We recommend starting with the official micro_speech example in the TensorFlow examples repository, then porting the feature extraction and model swap as needed.
The skeleton:
void loop() {
readFrame(); // 64 ms of audio
appendToCircularBuffer(audio, FRAME_SIZE);
if (millis() - last_inference > 100) { // every 100 ms
last_inference = millis();
computeFeatures(buffer, features); // ~10 ms
runInference(features, output); // ~30 ms
if (output[WAKE_WORD_INDEX] > THRESHOLD) {
digitalWrite(LED, HIGH);
delay(2000); // visual debounce
digitalWrite(LED, LOW);
}
}
}Power considerations
Always-on listening is power-hungry. ESP32-S3 with I2S microphone running at 16 kHz draws roughly 30 mA continuous. A 2500 mAh battery lasts about 80 hours.
Two patterns help:
- Voice activity detection (VAD) gate. A simple threshold detector that gates the heavyweight model. The model only runs when actual sound is present; the chip mostly idles.
- Hardware accelerator. Some newer chips (ESP32-P4, certain Cortex-M55) have neural network accelerators that drop inference power dramatically. Not yet hobby-priced; coming.
For battery products, expect to sacrifice continuous listening for periodic listening (wake briefly every few seconds, listen for a moment, sleep again).
Privacy and the meaningful difference
The point of local wake-word detection is that audio never leaves the device unless the wake word fires. Even then, what gets sent is the action (lights on, timer set), not the audio. This is the meaningful privacy difference vs. cloud assistants.
For a fully cloud-free voice assistant, the wake word is the easy part. Speech-to-text after the wake word still typically uses a cloud service for full vocabulary; on-device alternatives (Whisper.cpp, Vosk) work but require significantly more compute than an ESP32 has.
Frequently Asked Questions
How well does it work in noisy environments?
Less well. The Google Speech Commands dataset is moderately noise-augmented; the resulting models work in normal homes but degrade in cafes, kitchens with running water, etc. For specific noisy environments, retrain on noise-augmented data from that environment.
Share your thoughts
Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.