The smart doorbell market is roughly Ring vs Nest vs everyone else, and all of them want a monthly subscription to keep your own video accessible. The DIY version of the same thing is genuinely competitive: an ESP32-CAM, a momentary button, a small enclosure, and a Telegram bot or your own server. No subscription, no data going to a cloud you don't control, and the parts cost is around $20.
This is different from the motion-activated security camera we built earlier. That one uses PIR to wake on motion. This one wakes on a button press, which means it has a much narrower trigger and can run on continuous power without exhausting itself. It is also designed to identify visitors quickly — a single still image plus a short audio clip if you add a microphone — rather than to record general activity.
What we are building
capture 800x600 JPEG] ESP -->|optional| Mic[I2S microphone
5 sec clip] Cam --> ESP Mic --> ESP ESP -->|HTTPS POST| Telegram[Telegram bot API] Telegram --> Phone[Your phone
image + caption] ESP -->|GPIO HIGH| Speaker[Optional
chime relay]
Press button, capture image, send via Telegram. Optional indoor chime via relay. The whole flow takes 3–5 seconds end-to-end.
Hardware
- ESP32-CAM (AI Thinker) — $8
- FTDI USB-TTL adapter for flashing — $5 (one-time)
- Momentary push button (waterproof if outdoor) — $3
- 5V 2A power supply (run wired from inside) — $5
- 3D-printed or off-the-shelf doorbell enclosure — $5–15
- Optional: INMP441 I2S microphone for audio — $4
- Optional: relay module + indoor chime/buzzer — $3
About $25 base, $30 with audio. Outdoor enclosure quality matters more than the electronics — the ESP32-CAM is fine in moderate weather inside a sealed box; direct rain or freeze-thaw cycles will kill it within a year.
Why wired power, not battery
The motion-activated security camera in our earlier project ran on 18650 cells because PIR triggers are infrequent. A doorbell button press is also infrequent, but a doorbell needs to respond fast — within 1–2 seconds — and that means staying connected to WiFi continuously. Continuous WiFi is around 80–120 mA on the ESP32-CAM, which kills any reasonable battery in a day or two.
Most existing doorbells already have wired power for the chime — usually 16–24V AC from a transformer. You can convert that to 5V DC with a small AC-DC module ($3) and reuse the existing wire run. If you have to run new wire, USB-C cable through a hidden conduit works.
What goes wrong
- WiFi range. Doorbells live near doors, often the worst spot for WiFi. Add an external antenna (the AI Thinker board has a u.FL connector) or place a WiFi extender nearby.
- Mechanical button bounce. A 3-second debounce in the ISR matches typical visitor behaviour (no normal person mashes the button repeatedly). Lower it for a hyperactive household.
- Cold weather. The ESP32-CAM is rated to −40°C but the lithium battery in any battery-powered version is not. For wired builds, fine.
- The bot's response delay. Telegram is fast (sub-second) but only after WiFi is connected. From cold start the chain is: button press → WiFi authenticate → TLS handshake → Telegram POST → notification. Total around 3 seconds in good conditions, 6–8 in poor ones.
Going further
- Two-way audio. Add an INMP441 microphone and a small speaker. Stream voice via WebRTC.
- Face recognition for known visitors. The ESP32-S3 (different chip, but available in CAM-board variants) has enough horsepower for face detection via TensorFlow Lite Micro.
- Local recording. Add a microSD card; save every visitor image with timestamp.
- Integration with smart home. Replace the Telegram bot with MQTT to Home Assistant.
- Better camera. The OV2640 is mediocre in low light. Swap for an OV5640.
Frequently Asked Questions
How is this different from a Ring doorbell?
Cheaper (one-time $25 vs $100–200 + $5/mo subscription), fully owned (your video, your bot, your storage), but less polished — no slick app, no community-detected porch pirates, no facial-recognition alerts out of the box. Tradeoff is yours.
Can I use this without Telegram?
Yes. Replace the Telegram POST with HTTP POST to your own webhook, or with MQTT publish, or with a SignalCLI bridge. The image is just a JPEG; any service that accepts uploads will do.
How long does the SD card last for recording?
A full-quality JPEG is ~50–100 KB on the OV2640 at 800×600. With one capture per visitor and ~10 visitors a day, a 32 GB card holds years of doorbell history.
Get the complete project package
The article above shows the core firmware and the principles behind it. The complete project package — assembled, tested, and ready to flash — is available by email request. We send it manually, and we read every request.
- Complete Arduino sketch (.ino) with full error handling
- List of required libraries with version numbers
- Printable wiring diagram (PDF)
- Bill of materials with current part numbers
- Build guide and troubleshooting tips
- Configuration template (WiFi, MQTT, etc.)
Share your thoughts
Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.