Run-of-the-mill PIR motion sensors are useful but indiscriminate. They trigger on a hot dog, a sun-warmed leaf, the air conditioning kicking on. The next level up — a camera that captures whenever something moves — produces hundreds of useless images for every meaningful one.
Person detection on a microcontroller solves this. The camera sees a frame, the chip runs a tiny neural network on it, and reports yes-or-no on whether a person is present. Only positive detections trigger an alert or capture. The result: an order-of-magnitude reduction in false positives at the cost of a few hundred milliseconds of inference time.
This article walks the build with an ESP32-S3 plus an OV2640 camera. Hardware, model, code, and the realistic performance numbers.
What we are building
cheap pre-filter] -->|wake| MCU[[ESP32-S3]] MCU -->|capture| Cam[OV2640 camera
96 x 96 grayscale] Cam -->|frame| Inference[TFLite Micro
person detection model
~250 KB] Inference -->|score > 0.7| Alert[Send alert
save image
turn on light] Inference -->|low score| Sleep[Back to sleep]
The pipeline. PIR is a cheap pre-filter that wakes the chip; the chip captures and runs inference; only confident detections cause action.
Hardware
- Seeed XIAO ESP32-S3 Sense ($14) — integrated camera, microphone, microSD slot. The easiest starting point.
- Alternative: ESP32-S3 DevKit + OV2640 module ($15–20) — more wiring but works the same.
- PIR sensor (HC-SR501) ($2) — pre-filters motion events to save inference cost.
- 2500 mAh LiPo ($8) — enough for hundreds of detection cycles.
About $25 total. The XIAO ESP32-S3 Sense is the easiest because it ships with the camera physically integrated.
The model
The pre-trained person-detection model from TensorFlow's example repo: a MobileNet-V2 variant trained on a binary "person / no person" dataset, quantised to int8. Roughly 250 KB. Input: 96×96 grayscale image. Output: 2 values (probability of person, probability of no person).
Training your own model is possible but rarely necessary. The pre-trained model is good enough for the typical use cases (security cameras, presence detection, pedestrian counting).
Inference performance
On the ESP32-S3 with vector instructions enabled:
- Single inference: ~500 ms on standard ESP32-S3, ~250 ms with ESP-NN optimisations enabled.
- Frame capture (OV2640 to grayscale 96×96): ~80 ms.
- Total wake-to-decision time: ~600 ms.
Not real-time video. Plenty fast for triggered captures (PIR fires, system wakes, runs one inference, decides). For continuous monitoring at 1–2 fps, also viable.
The firmware
#include <esp_camera.h>
#include <TensorFlowLite_ESP32.h>
#include "person_detect_model_data.h"
constexpr int kInputW = 96;
constexpr int kInputH = 96;
constexpr int kArenaSize = 130 * 1024;
uint8_t tensor_arena[kArenaSize];
const tflite::Model* model;
tflite::MicroInterpreter* interpreter;
TfLiteTensor* input;
TfLiteTensor* output;
void setupCamera() {
camera_config_t config;
// ... XIAO ESP32-S3 Sense camera pin configuration ...
config.frame_size = FRAMESIZE_96X96;
config.pixel_format = PIXFORMAT_GRAYSCALE;
config.fb_count = 1;
esp_camera_init(&config);
}
void setupModel() {
model = tflite::GetModel(g_person_detect_model_data);
static tflite::MicroMutableOpResolver<5> resolver;
resolver.AddDepthwiseConv2D();
resolver.AddConv2D();
resolver.AddAveragePool2D();
resolver.AddReshape();
resolver.AddSoftmax();
static tflite::MicroInterpreter static_interpreter(
model, resolver, tensor_arena, kArenaSize);
interpreter = &static_interpreter;
interpreter->AllocateTensors();
input = interpreter->input(0);
output = interpreter->output(0);
}
int detectPerson(camera_fb_t* frame) {
// Copy frame data into model input (already grayscale 96x96)
memcpy(input->data.int8, frame->buf, kInputW * kInputH);
if (interpreter->Invoke() != kTfLiteOk) return -1;
int8_t person_score = output->data.int8[1];
int8_t no_person_score = output->data.int8[0];
return person_score - no_person_score; // positive = person
}
void setup() {
Serial.begin(115200);
setupCamera();
setupModel();
pinMode(PIR_PIN, INPUT);
esp_sleep_enable_ext0_wakeup((gpio_num_t)PIR_PIN, 1);
}
void loop() {
if (digitalRead(PIR_PIN) == HIGH) {
camera_fb_t* frame = esp_camera_fb_get();
if (frame) {
int score = detectPerson(frame);
Serial.printf("score=%d\n", score);
if (score > 50) {
// Person detected: save image, send alert, etc.
handlePersonDetected(frame);
}
esp_camera_fb_return(frame);
}
}
delay(100);
}Threshold tuning
The score is the difference between "person" and "no person" class outputs. Higher score = more confident person detection.
Calibration: capture 100 frames in your specific deployment (lighting, camera angle, scene) and label them. Find the score threshold that gives you the right precision-recall tradeoff. For most security applications, prefer high precision (few false alarms) over high recall (catch every person). A threshold of 50 is a starting point; tune from there.
Practical caveats
- The model is trained on adults in normal poses. Children, partial bodies, unusual postures detect less reliably.
- Low light kills accuracy. The OV2640 is a poor low-light sensor. For night-time use, IR illumination plus an IR-cut filter removed gives better results.
- Camera position matters. The model expects roughly head-and-shoulders view. Looking straight down or far away reduces accuracy.
- The pre-trained model has biases. Trained on a specific dataset; certain demographics under-represented. For products at scale, consider retraining on representative data or using a paid service like Edge Impulse to fine-tune.
- Running continuously drains the battery. Pair with PIR pre-filtering or a duty cycle (one inference every 30 seconds when idle, faster when something is detected).
Going further
- Multi-class detection. Replace the binary model with a multi-class one that distinguishes person, vehicle, animal. More common via Edge Impulse than from scratch.
- Bounding boxes. Detection (yes/no) is binary; bounding-box detection (where is the person) is significantly harder on an MCU but possible with TinyML object detection models.
- Custom training. Use the existing model architecture, retrain on your own data, deploy. Edge Impulse handles this end-to-end.
- Streaming inference. Run inference continuously at 1–2 fps; build a confidence over multiple frames before alerting. Reduces single-frame false positives.
Frequently Asked Questions
How accurate is the pre-trained model?
On the test data: ~90% accuracy. On real deployments, varies wildly with lighting, camera angle, and subject. Plan to tune the threshold for your specific scene.
Share your thoughts
Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.