The phrase "machine learning on a microcontroller" sounded absurd in 2018 and routine in 2026. The shift was not a single breakthrough but the convergence of three things: TensorFlow Lite Micro's tiny runtime (16 KB), modern microcontrollers with vector instructions (ESP32-S3, Cortex-M4, Cortex-M7), and a workflow that lets you train a model in Python on your laptop and deploy it as a header file you compile into your firmware.
This article is the practical introduction. The mental model, the workflow, the constraints, and a working hello-world that runs an actual neural network on an ESP32-S3.
What TinyML actually is
TinyML is the practice of running ML inference on microcontroller-class hardware: kilobytes of RAM, tens to hundreds of MHz of CPU, no operating system. Training does not happen on the device — that still happens on a laptop, in Colab, or on a cluster. The trained model gets quantised (typically to 8-bit integers), converted to a flatbuffer format, and embedded in firmware.
laptop or cloud
TensorFlow / Keras] --> Convert[Convert
tflite_convert
quantise to int8] Convert --> CFile[Generate C array
xxd or convert tool] CFile --> Build[Build firmware
link TFLM runtime] Build --> Deploy[Deploy to MCU] Deploy --> Inference[Run inference
at the edge]
The TinyML workflow: train where compute is plentiful, deploy where it is scarce.
Why bother
Three concrete reasons:
- Latency. Cloud inference round-trip: 100–300 ms. On-device inference: 10–100 ms. Critical for wake-word detection and gesture recognition.
- Power. Continuously transmitting audio or images to the cloud kills batteries. Running inference locally and only transmitting on a positive event extends battery life by orders of magnitude.
- Privacy. Voice and video never leave the device. Increasingly important for consumer products.
What ESP32-S3 brings
The ESP32-S3 (and the newer ESP32-P4) added vector instructions specifically for accelerating neural network inference. A model that runs at 5 inferences per second on the original ESP32 runs at 30+ on the S3. Combined with 512 KB of RAM and external PSRAM support, it has become the default hobby platform for TinyML.
The Cortex-M alternative: STM32H7 (Cortex-M7) and the newer STM32U5 are also capable, with the CMSIS-NN library doing similar acceleration. Performance is comparable to ESP32-S3.
The workflow
Step 1: train a model
Train as usual in TensorFlow / Keras. Make the model small — depth-wise separable convolutions, fewer layers, smaller widths than a server-class model. The general guideline: aim for under 100 KB of weights for an MCU model.
Step 2: convert to TFLite
import tensorflow as tf
# Convert the trained Keras model
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
converter.representative_dataset = lambda: ([sample] for sample in calibration_data)
tflite_model = converter.convert()
open('model.tflite', 'wb').write(tflite_model)Step 3: convert to C array
xxd -i model.tflite > model_data.h
# Or use a script that produces a slightly cleaner outputStep 4: include in firmware
#include <TensorFlowLite_ESP32.h>
#include "model_data.h"
// Roughly 50 KB of memory for tensor arena
constexpr int kArenaSize = 50 * 1024;
uint8_t tensor_arena[kArenaSize];
const tflite::Model* model;
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;
TfLiteTensor* output;
void setupModel() {
model = tflite::GetModel(model_data);
static tflite::AllOpsResolver resolver;
static tflite::MicroInterpreter static_interpreter(
model, resolver, tensor_arena, kArenaSize);
interpreter = &static_interpreter;
interpreter->AllocateTensors();
input = interpreter->input(0);
output = interpreter->output(0);
}
float runInference(const float* features, size_t count) {
// Quantise input from float to int8
for (size_t i = 0; i < count; i++) {
input->data.int8[i] = (int8_t)(features[i] / input->params.scale
+ input->params.zero_point);
}
interpreter->Invoke();
int8_t out_q = output->data.int8[0];
return (out_q - output->params.zero_point) * output->params.scale;
}The hello world: sine wave prediction
The classic introductory model. A tiny neural net trained to predict sin(x). The model fits in under 3 KB; inference takes microseconds. Useless in itself, but demonstrates the entire pipeline working.
Train in Colab in 5 minutes. Convert. Deploy. The ESP32 runs the model and prints predicted sine values that match the ground truth. The first time it works, it feels improbable that 3 KB of weights are reproducing a sine function.
The official TensorFlow examples include this as hello_world. Start there before anything ambitious.
Realistic models you can run
- Wake word detection — "Hey Computer" activation. ~50 KB model. 95% accuracy on trained vocabulary.
- Person detection — binary "is there a person in this image" on 96×96 grayscale frames. ~250 KB. Roughly 500 ms per inference on ESP32-S3.
- Gesture classification — from accelerometer data, recognise "swipe left", "swipe right", "rotate". Tens of KB.
- Anomaly detection — an autoencoder trained on normal motor vibration; flags abnormal patterns. ~30 KB.
- Keyword spotting — recognise a small vocabulary (10–30 words). 100–500 KB.
What you cannot fit on an MCU: full speech-to-text, real image classification (ImageNet-scale), large language models. The line is roughly 1 MB of weights; below that, plausible on Cortex-M7 or ESP32-S3 with PSRAM.
Quantisation
Models trained in float32 are way too large for MCUs. Quantisation reduces precision (typically to int8) at modest accuracy cost. A 1 MB float32 model becomes a 250 KB int8 model.
Two flavours:
- Post-training quantisation: train normally, quantise after. Easy. Accuracy loss usually 1–3%.
- Quantisation-aware training: simulate quantisation during training. More complex; accuracy loss often under 1%.
Start with post-training. Move to quantisation-aware only if accuracy loss matters.
The friendlier path: Edge Impulse
Edge Impulse is a hosted platform that handles the entire workflow: data collection, model training, optimisation, deployment. Free tier for small projects; paid for production. Particularly good for sensor-data classification (accelerometer, audio).
For your first non-trivial project, Edge Impulse is dramatically faster than rolling your own. Once you understand the workflow, you can move to raw TensorFlow if you need more control.
Things that go wrong
- Model does not fit. The compiled binary including model + TFLM runtime exceeds flash. Reduce model size, use PSRAM for weights, or pick a bigger chip.
- Tensor arena too small. Inference fails at
AllocateTensors. Increase arena size by trial and error; the library tells you the minimum after a successful allocation. - Output looks like noise. Quantisation parameters not handled correctly. Verify the input is properly scaled to the model's expected input range.
- Inference is slow. Build optimisation off, or operations not delegated to the accelerator. ESP32-S3's ESP-NN library handles this when correctly linked.
- Accuracy worse on device than on the laptop. Almost always quantisation. Check the calibration dataset is representative of real input distribution.
Frequently Asked Questions
Can I train on the device?
Generally no. Training requires float-precision arithmetic and significant memory. A handful of frameworks support on-device fine-tuning (small adaptations to a pre-trained model), but full training stays on a laptop.
Share your thoughts
Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.