TinyML

Voice Activation Without the Cloud

The reason commercial voice assistants stream audio to the cloud is not that local recognition is impossible — it is that the cloud is convenient and lucrative. For wake-word detection specifically (the "Hey Computer" that activates a longer recognition flow), local processing has been viable on microcontrollers since around 2020 and is comfortable in 2026.

This article walks the build: an ESP32-S3 with a digital microphone, listening continuously for a wake word, lighting an LED when it hears one. About $20 of hardware. Privacy-respecting, no internet required.

What "wake word detection" actually means

The system listens to audio continuously, runs a small neural network on each short window (~1 second), and emits a confidence score. When the score exceeds a threshold, the system "wakes" and either triggers an action directly or kicks off a fuller speech recognition flow (which may or may not run locally).

The keyword spotting (KWS) model is purpose-built for this single decision. It does not transcribe speech; it does not understand meaning; it just decides "is the wake word in this 1-second window?". Because the task is narrow, the model can be tiny — under 100 KB.

Bill of materials

  • Seeed XIAO ESP32-S3 Sense ($14) — ESP32-S3 with onboard MEMS microphone and microSD slot.
  • Alternative: ESP32-S3 DevKit + INMP441 I2S microphone ($10 + $4) — more wiring but more flexibility.
  • LED for visual confirmation ($0.50)

The pipeline

flowchart LR Mic[I2S microphone] -->|16 kHz audio
sample stream| Buffer[Ring buffer
~1 second] Buffer --> FFT[Compute MFCC
or spectrogram features] FFT --> NN[KWS neural network
~50 KB int8] NN --> Score{Score > threshold?} Score -->|yes| Wake[Wake action] Score -->|no| Buffer

The wake-word pipeline. The audio stream is continuously analysed; only a positive detection triggers downstream action.

The audio capture

The INMP441 (and the XIAO Sense's onboard mic) is an I2S MEMS microphone. The ESP32-S3's I2S peripheral handles the digital audio stream — you configure sample rate (16 kHz), word length (16-bit), and the I2S driver delivers samples through a DMA-backed ring buffer.

#include <driver/i2s.h>

const i2s_port_t I2S_PORT = I2S_NUM_0;
const int SAMPLE_RATE     = 16000;
const int FRAME_SIZE      = 1024;  // ~64 ms at 16 kHz

void setupI2S() {
    i2s_config_t cfg = {
        .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX),
        .sample_rate = SAMPLE_RATE,
        .bits_per_sample = I2S_BITS_PER_SAMPLE_16BIT,
        .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
        .communication_format = I2S_COMM_FORMAT_STAND_I2S,
        .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
        .dma_buf_count = 4,
        .dma_buf_len = FRAME_SIZE,
    };
    i2s_driver_install(I2S_PORT, &cfg, 0, NULL);

    i2s_pin_config_t pins = {
        .bck_io_num   = 4,   // bit clock
        .ws_io_num    = 5,   // word select
        .data_in_num  = 6,   // microphone data
        .data_out_num = I2S_PIN_NO_CHANGE,
    };
    i2s_set_pin(I2S_PORT, &pins);
}

int16_t audio[FRAME_SIZE];
void readFrame() {
    size_t bytes_read;
    i2s_read(I2S_PORT, audio, sizeof(audio), &bytes_read, portMAX_DELAY);
}

Feature extraction

Raw audio at 16 kHz is too dense to feed directly into a neural network. The standard pre-processing converts the audio into a spectrogram or MFCC (Mel-Frequency Cepstral Coefficients) representation, which captures the relevant information in far fewer numbers.

For each ~1 second of audio:

  • Slice into ~30 ms frames with 10 ms overlap.
  • Apply FFT to each frame.
  • Compute power spectrum.
  • Map frequencies onto the Mel scale (frequency bins log-spaced like human hearing).
  • Take the log and (for MFCC) the discrete cosine transform.

The result is a 2D feature map (time × frequency) typically 49×40 or similar. This is what the neural network sees.

The TFLM example for keyword spotting includes the feature extraction code; you do not write it from scratch.

The model

A small CNN. The TensorFlow micro-speech example uses a 4-layer model with depthwise separable convolutions. Trained on the Google Speech Commands dataset (a public corpus of 30 keywords), it recognises a target word + "unknown" + "silence" with around 90% accuracy.

Model size: 18 KB int8. Inference time on ESP32-S3: ~30 ms per window. With 100 ms hop between windows, the chip runs at 30% inference duty cycle, leaving plenty of headroom.

Custom wake words

The Google Speech Commands dataset is fixed. To recognise a custom phrase like "Hey Builder" or "Computer", you have two paths:

Edge Impulse

Free for hobby projects. Record 100–500 samples of your wake word (multiple speakers, multiple environments). Edge Impulse trains a model and emits a TFLM-compatible deployment package. By far the easiest path; we recommend starting here.

Roll your own

Use TensorFlow's training scripts with your own data. Significantly more work; full control. Useful when Edge Impulse does not match your specific needs.

Either way, expect to record a lot of negative samples (random conversation, ambient noise, similar-sounding words) so the model learns to discriminate, not just to recognise.

The full firmware

Combining I2S, feature extraction, and inference is a multi-hundred-line project. We recommend starting with the official micro_speech example in the TensorFlow examples repository, then porting the feature extraction and model swap as needed.

The skeleton:

void loop() {
    readFrame();                              // 64 ms of audio
    appendToCircularBuffer(audio, FRAME_SIZE);

    if (millis() - last_inference > 100) {    // every 100 ms
        last_inference = millis();
        computeFeatures(buffer, features);     // ~10 ms
        runInference(features, output);        // ~30 ms
        if (output[WAKE_WORD_INDEX] > THRESHOLD) {
            digitalWrite(LED, HIGH);
            delay(2000);                       // visual debounce
            digitalWrite(LED, LOW);
        }
    }
}

Power considerations

Always-on listening is power-hungry. ESP32-S3 with I2S microphone running at 16 kHz draws roughly 30 mA continuous. A 2500 mAh battery lasts about 80 hours.

Two patterns help:

  • Voice activity detection (VAD) gate. A simple threshold detector that gates the heavyweight model. The model only runs when actual sound is present; the chip mostly idles.
  • Hardware accelerator. Some newer chips (ESP32-P4, certain Cortex-M55) have neural network accelerators that drop inference power dramatically. Not yet hobby-priced; coming.

For battery products, expect to sacrifice continuous listening for periodic listening (wake briefly every few seconds, listen for a moment, sleep again).

Privacy and the meaningful difference

The point of local wake-word detection is that audio never leaves the device unless the wake word fires. Even then, what gets sent is the action (lights on, timer set), not the audio. This is the meaningful privacy difference vs. cloud assistants.

For a fully cloud-free voice assistant, the wake word is the easy part. Speech-to-text after the wake word still typically uses a cloud service for full vocabulary; on-device alternatives (Whisper.cpp, Vosk) work but require significantly more compute than an ESP32 has.

Frequently Asked Questions

How well does it work in noisy environments?

Less well. The Google Speech Commands dataset is moderately noise-augmented; the resulting models work in normal homes but degrade in cafes, kitchens with running water, etc. For specific noisy environments, retrain on noise-augmented data from that environment.

Can it recognise multiple wake words?Yes. The model output has a class per word; you train it with N+2 classes (N wake words plus "unknown" plus "silence"). Each adds modest training data and slightly more model complexity.

What is the false positive rate?Highly variable. With a well-tuned threshold, expect 1–5 false positives per day in normal use. Poor tuning can produce dozens.

Could I run it in deep sleep?Not directly — deep sleep stops the CPU. Some chips (ESP32-S3, with the ULP coprocessor; nRF5340 with the dedicated audio core) can run a smaller VAD or simpler keyword spotter while the main CPU sleeps. Not yet trivial; an active area in TinyML.

Is the on-device model accurate enough for a real product?For wake-word detection, yes — this is roughly how the Echo and HomePod do their wake-word stage. For full speech understanding, no — that still happens in the cloud or on a phone, after the wake word fires.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.