TinyML

TensorFlow Lite Micro on ESP32-S3: A Practical Introduction

April 25, 2026 · 13 min read

The phrase "machine learning on a microcontroller" sounded absurd in 2018 and routine in 2026. The shift was not a single breakthrough but the convergence of three things: TensorFlow Lite Micro's tiny runtime (16 KB), modern microcontrollers with vector instructions (ESP32-S3, Cortex-M4, Cortex-M7), and a workflow that lets you train a model in Python on your laptop and deploy it as a header file you compile into your firmware.

This article is the practical introduction. The mental model, the workflow, the constraints, and a working hello-world that runs an actual neural network on an ESP32-S3.

What TinyML actually is

TinyML is the practice of running ML inference on microcontroller-class hardware: kilobytes of RAM, tens to hundreds of MHz of CPU, no operating system. Training does not happen on the device — that still happens on a laptop, in Colab, or on a cluster. The trained model gets quantised (typically to 8-bit integers), converted to a flatbuffer format, and embedded in firmware.

flowchart LR Training[Training
laptop or cloud
TensorFlow / Keras] --> Convert[Convert
tflite_convert
quantise to int8] Convert --> CFile[Generate C array
xxd or convert tool] CFile --> Build[Build firmware
link TFLM runtime] Build --> Deploy[Deploy to MCU] Deploy --> Inference[Run inference
at the edge]

The TinyML workflow: train where compute is plentiful, deploy where it is scarce.

Why bother

Three concrete reasons:

Latency. Cloud inference round-trip: 100–300 ms. On-device inference: 10–100 ms. Critical for wake-word detection and gesture recognition.
Power. Continuously transmitting audio or images to the cloud kills batteries. Running inference locally and only transmitting on a positive event extends battery life by orders of magnitude.
Privacy. Voice and video never leave the device. Increasingly important for consumer products.

What ESP32-S3 brings

The ESP32-S3 (and the newer ESP32-P4) added vector instructions specifically for accelerating neural network inference. A model that runs at 5 inferences per second on the original ESP32 runs at 30+ on the S3. Combined with 512 KB of RAM and external PSRAM support, it has become the default hobby platform for TinyML.

The Cortex-M alternative: STM32H7 (Cortex-M7) and the newer STM32U5 are also capable, with the CMSIS-NN library doing similar acceleration. Performance is comparable to ESP32-S3.

The workflow

Step 1: train a model

Train as usual in TensorFlow / Keras. Make the model small — depth-wise separable convolutions, fewer layers, smaller widths than a server-class model. The general guideline: aim for under 100 KB of weights for an MCU model.

Step 2: convert to TFLite

import tensorflow as tf

# Convert the trained Keras model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type  = tf.int8
converter.inference_output_type = tf.int8
converter.representative_dataset = lambda: ([sample] for sample in calibration_data)

tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)

Step 3: convert to C array

xxd -i model.tflite > model_data.h
# Or use a script that produces a slightly cleaner output

Step 4: include in firmware

#include <TensorFlowLite_ESP32.h>
#include "model_data.h"

// Roughly 50 KB of memory for tensor arena
constexpr int kArenaSize = 50 * 1024;
uint8_t tensor_arena[kArenaSize];

const tflite::Model* model;
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;
TfLiteTensor* output;

void setupModel() {
    model = tflite::GetModel(model_data);
    static tflite::AllOpsResolver resolver;
    static tflite::MicroInterpreter static_interpreter(
        model, resolver, tensor_arena, kArenaSize);
    interpreter = &static_interpreter;
    interpreter->AllocateTensors();
    input  = interpreter->input(0);
    output = interpreter->output(0);
}

float runInference(const float* features, size_t count) {
    // Quantise input from float to int8
    for (size_t i = 0; i < count; i++) {
        input->data.int8[i] = (int8_t)(features[i] / input->params.scale
                                      + input->params.zero_point);
    }
    interpreter->Invoke();
    int8_t out_q = output->data.int8[0];
    return (out_q - output->params.zero_point) * output->params.scale;
}

The hello world: sine wave prediction

The classic introductory model. A tiny neural net trained to predict sin(x). The model fits in under 3 KB; inference takes microseconds. Useless in itself, but demonstrates the entire pipeline working.

Train in Colab in 5 minutes. Convert. Deploy. The ESP32 runs the model and prints predicted sine values that match the ground truth. The first time it works, it feels improbable that 3 KB of weights are reproducing a sine function.

The official TensorFlow examples include this as hello_world. Start there before anything ambitious.

Realistic models you can run

Wake word detection — "Hey Computer" activation. ~50 KB model. 95% accuracy on trained vocabulary.
Person detection — binary "is there a person in this image" on 96×96 grayscale frames. ~250 KB. Roughly 500 ms per inference on ESP32-S3.
Gesture classification — from accelerometer data, recognise "swipe left", "swipe right", "rotate". Tens of KB.
Anomaly detection — an autoencoder trained on normal motor vibration; flags abnormal patterns. ~30 KB.
Keyword spotting — recognise a small vocabulary (10–30 words). 100–500 KB.

What you cannot fit on an MCU: full speech-to-text, real image classification (ImageNet-scale), large language models. The line is roughly 1 MB of weights; below that, plausible on Cortex-M7 or ESP32-S3 with PSRAM.

Quantisation

Models trained in float32 are way too large for MCUs. Quantisation reduces precision (typically to int8) at modest accuracy cost. A 1 MB float32 model becomes a 250 KB int8 model.

Two flavours:

Post-training quantisation: train normally, quantise after. Easy. Accuracy loss usually 1–3%.
Quantisation-aware training: simulate quantisation during training. More complex; accuracy loss often under 1%.

Start with post-training. Move to quantisation-aware only if accuracy loss matters.

The friendlier path: Edge Impulse

Edge Impulse is a hosted platform that handles the entire workflow: data collection, model training, optimisation, deployment. Free tier for small projects; paid for production. Particularly good for sensor-data classification (accelerometer, audio).

For your first non-trivial project, Edge Impulse is dramatically faster than rolling your own. Once you understand the workflow, you can move to raw TensorFlow if you need more control.

Things that go wrong

Model does not fit. The compiled binary including model + TFLM runtime exceeds flash. Reduce model size, use PSRAM for weights, or pick a bigger chip.
Tensor arena too small. Inference fails at AllocateTensors. Increase arena size by trial and error; the library tells you the minimum after a successful allocation.
Output looks like noise. Quantisation parameters not handled correctly. Verify the input is properly scaled to the model's expected input range.
Inference is slow. Build optimisation off, or operations not delegated to the accelerator. ESP32-S3's ESP-NN library handles this when correctly linked.
Accuracy worse on device than on the laptop. Almost always quantisation. Check the calibration dataset is representative of real input distribution.

Frequently Asked Questions

Can I train on the device?

Generally no. Training requires float-precision arithmetic and significant memory. A handful of frameworks support on-device fine-tuning (small adaptations to a pre-trained model), but full training stays on a laptop.

What is the smallest useful TinyML model?Wake-word detection at 50 KB and gesture classification at under 30 KB are routine. Below 10 KB, you are usually doing something simpler than ML (a classifier with hand-engineered features).

How does TinyML compare to running ML on a Raspberry Pi?Raspberry Pi is at least 100× more capable. It runs PyTorch and full TensorFlow. The trade-off: 5 W vs 50 mW of power. For battery devices, the constraint matters.

What hardware should I buy to start?An ESP32-S3 DevKit at $10. Or the Seeed XIAO ESP32-S3 Sense, which includes a microphone and camera in a 21×17 mm package. Both run the standard TFLM examples out of the box.

Is the field still moving fast?Yes. Major changes in the last two years: ESP32-P4 with NPU, Sony's SPRESENSE for ML, neural network accelerators on Cortex-M55 (Helium), MediaTek's NeuroPilot. Specialised silicon is appearing aimed at TinyML; performance per watt is improving rapidly.

Published by the MyByteNest editorial team · Spot a technical error? Tell us.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.