TinyML

Person Detection on a Microcontroller

April 25, 2026 · 12 min read

Run-of-the-mill PIR motion sensors are useful but indiscriminate. They trigger on a hot dog, a sun-warmed leaf, the air conditioning kicking on. The next level up — a camera that captures whenever something moves — produces hundreds of useless images for every meaningful one.

Person detection on a microcontroller solves this. The camera sees a frame, the chip runs a tiny neural network on it, and reports yes-or-no on whether a person is present. Only positive detections trigger an alert or capture. The result: an order-of-magnitude reduction in false positives at the cost of a few hundred milliseconds of inference time.

This article walks the build with an ESP32-S3 plus an OV2640 camera. Hardware, model, code, and the realistic performance numbers.

What we are building

The pipeline. PIR is a cheap pre-filter that wakes the chip; the chip captures and runs inference; only confident detections cause action.

Hardware

Seeed XIAO ESP32-S3 Sense ($14) — integrated camera, microphone, microSD slot. The easiest starting point.
Alternative: ESP32-S3 DevKit + OV2640 module ($15–20) — more wiring but works the same.
PIR sensor (HC-SR501) ($2) — pre-filters motion events to save inference cost.
2500 mAh LiPo ($8) — enough for hundreds of detection cycles.

About $25 total. The XIAO ESP32-S3 Sense is the easiest because it ships with the camera physically integrated.

The model

The pre-trained person-detection model from TensorFlow's example repo: a MobileNet-V2 variant trained on a binary "person / no person" dataset, quantised to int8. Roughly 250 KB. Input: 96×96 grayscale image. Output: 2 values (probability of person, probability of no person).

Training your own model is possible but rarely necessary. The pre-trained model is good enough for the typical use cases (security cameras, presence detection, pedestrian counting).

Inference performance

On the ESP32-S3 with vector instructions enabled:

Single inference: ~500 ms on standard ESP32-S3, ~250 ms with ESP-NN optimisations enabled.
Frame capture (OV2640 to grayscale 96×96): ~80 ms.
Total wake-to-decision time: ~600 ms.

Not real-time video. Plenty fast for triggered captures (PIR fires, system wakes, runs one inference, decides). For continuous monitoring at 1–2 fps, also viable.

The firmware

#include <esp_camera.h>
#include <TensorFlowLite_ESP32.h>
#include "person_detect_model_data.h"

constexpr int kInputW = 96;
constexpr int kInputH = 96;
constexpr int kArenaSize = 130 * 1024;
uint8_t tensor_arena[kArenaSize];

const tflite::Model* model;
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;
TfLiteTensor* output;

void setupCamera() {
    camera_config_t config;
    // ... XIAO ESP32-S3 Sense camera pin configuration ...
    config.frame_size  = FRAMESIZE_96X96;
    config.pixel_format = PIXFORMAT_GRAYSCALE;
    config.fb_count = 1;
    esp_camera_init(&config);
}

void setupModel() {
    model = tflite::GetModel(g_person_detect_model_data);
    static tflite::MicroMutableOpResolver<5> resolver;
    resolver.AddDepthwiseConv2D();
    resolver.AddConv2D();
    resolver.AddAveragePool2D();
    resolver.AddReshape();
    resolver.AddSoftmax();
    static tflite::MicroInterpreter static_interpreter(
        model, resolver, tensor_arena, kArenaSize);
    interpreter = &static_interpreter;
    interpreter->AllocateTensors();
    input  = interpreter->input(0);
    output = interpreter->output(0);
}

int detectPerson(camera_fb_t* frame) {
    // Copy frame data into model input (already grayscale 96x96)
    memcpy(input->data.int8, frame->buf, kInputW * kInputH);

    if (interpreter->Invoke() != kTfLiteOk) return -1;

    int8_t person_score    = output->data.int8[1];
    int8_t no_person_score = output->data.int8[0];
    return person_score - no_person_score;  // positive = person
}

void setup() {
    Serial.begin(115200);
    setupCamera();
    setupModel();
    pinMode(PIR_PIN, INPUT);
    esp_sleep_enable_ext0_wakeup((gpio_num_t)PIR_PIN, 1);
}

void loop() {
    if (digitalRead(PIR_PIN) == HIGH) {
        camera_fb_t* frame = esp_camera_fb_get();
        if (frame) {
            int score = detectPerson(frame);
            Serial.printf("score=%d\n", score);
            if (score > 50) {
                // Person detected: save image, send alert, etc.
                handlePersonDetected(frame);
            }
            esp_camera_fb_return(frame);
        }
    }
    delay(100);
}

Threshold tuning

The score is the difference between "person" and "no person" class outputs. Higher score = more confident person detection.

Calibration: capture 100 frames in your specific deployment (lighting, camera angle, scene) and label them. Find the score threshold that gives you the right precision-recall tradeoff. For most security applications, prefer high precision (few false alarms) over high recall (catch every person). A threshold of 50 is a starting point; tune from there.

Practical caveats

The model is trained on adults in normal poses. Children, partial bodies, unusual postures detect less reliably.
Low light kills accuracy. The OV2640 is a poor low-light sensor. For night-time use, IR illumination plus an IR-cut filter removed gives better results.
Camera position matters. The model expects roughly head-and-shoulders view. Looking straight down or far away reduces accuracy.
The pre-trained model has biases. Trained on a specific dataset; certain demographics under-represented. For products at scale, consider retraining on representative data or using a paid service like Edge Impulse to fine-tune.
Running continuously drains the battery. Pair with PIR pre-filtering or a duty cycle (one inference every 30 seconds when idle, faster when something is detected).

Going further

Multi-class detection. Replace the binary model with a multi-class one that distinguishes person, vehicle, animal. More common via Edge Impulse than from scratch.
Bounding boxes. Detection (yes/no) is binary; bounding-box detection (where is the person) is significantly harder on an MCU but possible with TinyML object detection models.
Custom training. Use the existing model architecture, retrain on your own data, deploy. Edge Impulse handles this end-to-end.
Streaming inference. Run inference continuously at 1–2 fps; build a confidence over multiple frames before alerting. Reduces single-frame false positives.

Frequently Asked Questions

How accurate is the pre-trained model?

On the test data: ~90% accuracy. On real deployments, varies wildly with lighting, camera angle, and subject. Plan to tune the threshold for your specific scene.

Why use PIR if the model can detect motion itself?Power. PIR draws 50 µA continuously and wakes the MCU only when motion is detected. Continuous camera capture and inference draws 100× more. PIR pre-filtering extends battery life by a factor of 50 or more.

Can it count people?Not reliably with the binary detection model. For counting, an object-detection model (YOLO-style) gives you bounding boxes and counts. Possible on ESP32-S3 with reduced models; better on more powerful hardware.

Could I integrate this with Home Assistant?Yes. Publish detections to MQTT; Home Assistant subscribes; trigger automations on detection. Standard IoT integration pattern.

What about color images?The standard person detection model uses grayscale because color is unnecessary for the task and tripling the input data triples the inference time. Color matters for distinguishing similar objects (red vs green car) but not for "is there a person here".

Published by the MyByteNest editorial team · Spot a technical error? Tell us.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.