Embedded

Memory Management on Microcontrollers

There is a particular kind of firmware bug that ships happily, passes every test, and then starts returning units from the field months later with symptoms like "device reboots randomly" or "the BLE connection drops after 72 hours of uptime". Nine times out of ten, it is a memory bug — stack corruption, heap fragmentation, or a forgotten volatile that causes a pointer to read stale data.

Memory management on a microcontroller is not an advanced topic you can defer. It is the topic you need to understand before you write your first driver, because every design decision you make either respects memory constraints or creates a latent bug.

The three regions you have to care about

A microcontroller's RAM is typically partitioned into three regions by the linker script:

.data and .bss (static allocation)

Global and static variables live here. .data holds initialised variables (static int x = 5;). .bss holds zero-initialised variables (static int y;). Both are fixed in size and laid out at link time, which means the compiler knows exactly how much RAM they consume. This is the safest region: you cannot overflow it at runtime.

The stack

Local variables, function parameters, and return addresses live on the stack. The stack grows downward (on every mainstream architecture) from a fixed top address. Its size is set by the linker. If you run out of stack, you crash — silently, in most cases, by overwriting whatever is adjacent in memory.

The heap

Dynamically allocated memory (malloc, new) lives here. The heap grows upward from a fixed bottom address and is managed at runtime by the allocator. On a desktop, the heap is effectively unlimited. On a microcontroller, it is a small, fragile region that can fragment and become unusable even when there is technically free memory available.

On a typical Cortex-M, the layout looks like this:

flowchart TB subgraph RAM [RAM layout] direction TB S[Stack
grows downward
from high memory] F[... free space ...] H[Heap
grows upward] B[.bss
zero-initialised statics] D[.data
initialised statics] end S -.->|if stack overflows
it corrupts these| F H -.->|if heap grows too far
it collides with stack| F

RAM layout on a typical Cortex-M. Static data sits at the low end, heap grows up, stack grows down. When they meet, you crash.

When stack and heap meet in the middle, bad things happen.

Why you should avoid malloc in hard real-time

The general-purpose malloc shipped with your toolchain is a good allocator for desktops, a mediocre allocator for servers, and an actively dangerous choice for a hard real-time microcontroller. There are two reasons.

Unbounded latency

A call to malloc has no upper bound on how long it takes. A typical free-list implementation may walk the list looking for a suitable block, split blocks, coalesce free blocks, and update bookkeeping. On a fragmented heap, this can take milliseconds. If your interrupt handler needs to respond in 100 microseconds, malloc on that path is a bug waiting to happen.

Fragmentation

On a small heap with mixed allocation sizes, fragmentation makes the heap progressively less useful over time. You allocate 100 bytes, free it. Allocate 200, free. Allocate 50, keep. Now your free space is scattered in small fragments, and the next 256-byte allocation fails even though total free memory is 4 KB. The device runs for a week, then stops accepting new connections. You reboot it, which buys another week.

Fragmentation is not always catastrophic — there exist allocators (TLSF, Doug Lea's) designed to minimise it — but it is a risk that scales with uptime, and uptime on embedded devices is measured in months or years.

Static allocation patterns that replace malloc

In production firmware, most teams avoid the heap entirely and use static allocation patterns for everything dynamic. The three patterns you will see most often:

Object pools

Pre-allocate N instances of a struct at compile time. Maintain a free list. When you need one, grab it from the free list. When you are done, return it. No fragmentation, constant-time allocation and free, bounded worst-case behaviour.

#define POOL_SIZE 16
static message_t pool[POOL_SIZE];
static message_t *free_list = NULL;

void pool_init(void) {
    for (int i = 0; i < POOL_SIZE; i++) {
        pool[i].next = free_list;
        free_list = &pool[i];
    }
}

message_t *pool_alloc(void) {
    if (!free_list) return NULL;
    message_t *m = free_list;
    free_list = m->next;
    return m;
}

This is what most RTOS queues, network buffers, and event systems look like under the hood.

Ring buffers

For streaming data (UART bytes, ADC samples, log messages), a fixed-size circular buffer beats everything else. Power-of-two sizes let you replace modulo with a bitmask, which matters on small parts without hardware divide.

Arena allocators

If you need to allocate objects of varying sizes with a known lifetime, an arena lets you bump a pointer forward for each allocation and free the whole arena at once when you are done. This works well for per-request or per-frame allocations. No fragmentation, no free list, constant time.

Stack sizing: the one thing most projects get wrong

The default stack size in your startup file is probably 2 KB or 4 KB. Nobody changes it until something breaks. When something does break, it breaks in a hard-to-debug way, because stack overflow on bare metal typically corrupts .bss silently and the symptom shows up somewhere unrelated.

How to measure worst-case stack usage

Two techniques, used together:

  1. Static analysis with -fstack-usage. GCC emits a .su file per object file listing the stack size of each function. Tools like puncover aggregate these and build a call graph, telling you the deepest call chain. This gives you a guaranteed upper bound (assuming no indirect calls or recursion).
  2. Stack painting at runtime. Fill the stack with a known pattern at boot, run the system through its paces, then scan the stack to find the high-water mark. This tells you what actually happened as opposed to what could theoretically happen. Most RTOS implementations do this automatically.

If you only have time for one, paint the stack. It will catch 95% of issues and takes ten minutes to set up.

Interrupts and stack

Interrupt handlers run on the main stack (on Cortex-M) or on a dedicated interrupt stack (configurable). Every interrupt pushes a frame, and nested interrupts stack on top of each other. Your worst-case stack calculation must include the deepest interrupt nesting multiplied by the largest interrupt handler's frame.

Memory-mapped hardware: why volatile matters

On a microcontroller, hardware peripherals appear as special memory addresses. Writing to address 0x40020000 might set a GPIO pin high. Reading from 0x40020010 might return the current ADC value. From the C language perspective, these are just pointers to integers — but the compiler has no way of knowing that the value can change between reads without any code writing to it.

If you forget volatile on a peripheral register pointer, the compiler may optimise away reads (reading the value once and caching it in a register) or reorder writes in ways that break timing-sensitive protocols. These bugs are almost always subtle, usually appear under optimisation, and usually disappear when you add printf debugging (which changes the compiler's scheduling).

The rule: any pointer to hardware memory is volatile. Any variable shared between an interrupt handler and main code is volatile. Vendor CMSIS headers already do this correctly; the mistake is usually in hand-rolled abstractions.

DMA and cache coherency

On smaller Cortex-M parts (M0, M3, M4) there is no data cache, and DMA just works. On Cortex-M7 there is a data cache, and DMA becomes a minefield. If the CPU writes to a buffer and then triggers a DMA transmit, the DMA controller may read stale data from main memory while the new data is still in the cache. If DMA writes incoming data to a buffer and then the CPU reads it, the CPU may see stale data from its cache instead of the fresh bytes in memory.

The fix: place DMA buffers in non-cacheable memory regions, or explicitly invalidate/clean the cache around DMA operations. The vendor's HAL usually provides functions for this. Ignoring it produces bugs that manifest only occasionally and only under certain timing conditions.

The linker script is your friend

Most embedded engineers treat the linker script as a magical file they never touch. Learning to read it is one of the highest-leverage skills in firmware work. The linker script tells you exactly where everything lives, how big each region is, and what happens when one overflows.

When a new engineer joins the team, the first exercise we run is: open the linker script, identify RAM size, stack size, heap size, flash size. Map a symbol to its location. Add a new section. Once you understand the linker script, memory bugs stop being mysterious.

Frequently Asked Questions

How do I detect a stack overflow at runtime?

The simplest method is to set up an MPU (Memory Protection Unit) region at the bottom of the stack that traps on write. Cortex-M3 and up have an MPU. If you cannot use the MPU, place a known canary value at the stack boundary and check it periodically. If it is corrupted, you overflowed.

When is it OK to use malloc on a microcontroller?

During initialisation, before real-time operation begins, to allocate long-lived buffers whose size is known only at runtime. After that, treat malloc as forbidden. Some teams link against a malloc that always returns NULL (or aborts) to enforce this at link time.

How big should my heap be on a small MCU?

Zero, in most cases. If you are not using dynamic allocation, set the heap region to zero in your linker script. The RAM is more valuable as static allocation or stack. If you are using an RTOS, its kernel heap is a separate region from malloc's heap and is configured separately.

Is there a tool that shows me where my RAM is going?

Yes. arm-none-eabi-size on your ELF binary gives a high-level summary. The map file produced at link time shows every symbol and its location; search for the largest ones. puncover produces a web dashboard of this information. For runtime memory analysis, RTT-based tools or SEGGER Ozone are what most teams use.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.