DMA

DMA Explained: Why Your CPU Should Not Move Bytes

If you have ever looked at the CPU usage of an audio firmware or a high-bandwidth sensor system and watched it sit at 80% just doing memcpy, the likely culprit is missing DMA. A microcontroller without DMA is one that wastes most of its compute on shuffling bytes between memory and peripherals — a job that can be offloaded entirely to dedicated hardware.

This article is the practical introduction. What DMA is. When to reach for it. The patterns that work and the ones that produce mysterious crashes. Real code for the most common cases.

What DMA actually is

Direct Memory Access is a peripheral whose job is to move data between two memory locations without involving the CPU. The CPU configures it once with source, destination, length, and trigger, and then the DMA peripheral does the work autonomously while the CPU runs other code or sleeps.

flowchart LR subgraph Without_DMA [Without DMA] CPU1[CPU] -->|busy reading byte
then writing byte
repeat 1000 times| Peripheral1[Peripheral] CPU1 --> Memory1[(Memory)] end subgraph With_DMA [With DMA] CPU2[CPU] -.->|configure once| DMA[DMA
controller] DMA -->|moves data autonomously| Peripheral2[Peripheral] DMA --> Memory2[(Memory)] CPU2 -->|free to do other work
or sleep| OtherWork[Other work] end

Without DMA, the CPU is the bottleneck for every byte. With DMA, the CPU configures a transfer and walks away.

The CPU still has to set up the transfer (typically a few register writes) and react to its completion (typically an interrupt). The cost of those setup events is amortised across the entire transfer. For a 1024-byte transfer, the CPU does roughly 10 instructions of work instead of 3000. That difference matters.

When DMA is the right answer

ADC sampling at high rates

Reading a sample at 10 kHz means an ADC interrupt every 100 µs. Each interrupt has overhead (40–100 cycles). At 200 MHz, that is enough headroom; at 16 MHz, you spend half your CPU just on interrupt overhead. DMA + circular buffer: the ADC writes samples directly into a buffer, and the CPU processes batches.

UART transmit and receive

Sending a 1-KB block over UART at 115200 baud takes 87 ms. Without DMA, the CPU is interrupted every byte. With DMA, the CPU configures the transfer once and is interrupted once at completion.

SPI for displays and SD cards

Pushing a frame to a 320×240 ILI9341 display is 153,600 bytes per frame. Without DMA, on a 100 MHz chip, you barely get 30 fps because each byte requires CPU intervention. With DMA, the CPU is free for game logic / UI / sensor reading while the display update happens in parallel.

I2S audio

Playing 44.1 kHz stereo 16-bit audio is 176.4 KB/s. CPU-driven audio is impossible on most microcontrollers. DMA-driven audio is routine.

Memory-to-memory copies

Some DMA peripherals can do memory-to-memory transfers. Useful for copying large structures, but offers smaller wins than peripheral transfers because the CPU's memcpy is already fast.

DMA modes

Three patterns that cover most real use cases:

Single-shot

Configure source, destination, length. DMA performs N transfers and stops, raising a completion interrupt. The CPU receives the interrupt and either sets up the next transfer or processes the data.

Circular

DMA wraps around when it reaches the end of the buffer, continuing forever. Used for continuous data streams where the CPU consumes data at a slower rate than it arrives. Combined with half-buffer interrupts (interrupt at 50% and 100% of buffer fill) for double-buffered processing.

Double-buffer / ping-pong

Two buffers; DMA fills one while the CPU processes the other; they swap when the DMA finishes one. Eliminates any chance of the CPU reading data the DMA is currently writing.

Code: ADC + DMA on STM32

Continuous sampling of an ADC pin into a circular buffer. The CPU never reads the ADC directly.

#define BUFFER_SIZE 1024
static uint16_t adc_buffer[BUFFER_SIZE];
static volatile bool half_ready = false;
static volatile bool full_ready = false;

void setup_adc_dma(void) {
    // Enable clocks for ADC, DMA, GPIO
    __HAL_RCC_ADC1_CLK_ENABLE();
    __HAL_RCC_DMA2_CLK_ENABLE();

    // Configure ADC1 for continuous conversion mode, single channel
    // (HAL setup omitted for space)

    // Configure DMA: peripheral to memory, circular, 16-bit
    DMA_HandleTypeDef hdma_adc;
    hdma_adc.Instance = DMA2_Stream0;
    hdma_adc.Init.Channel = DMA_CHANNEL_0;
    hdma_adc.Init.Direction = DMA_PERIPH_TO_MEMORY;
    hdma_adc.Init.PeriphInc = DMA_PINC_DISABLE;
    hdma_adc.Init.MemInc = DMA_MINC_ENABLE;
    hdma_adc.Init.PeriphDataAlignment = DMA_PDATAALIGN_HALFWORD;
    hdma_adc.Init.MemDataAlignment = DMA_MDATAALIGN_HALFWORD;
    hdma_adc.Init.Mode = DMA_CIRCULAR;
    hdma_adc.Init.Priority = DMA_PRIORITY_HIGH;
    HAL_DMA_Init(&hdma_adc);

    // Start DMA transfer; ADC will trigger DMA on each conversion
    HAL_ADC_Start_DMA(&hadc1, (uint32_t*)adc_buffer, BUFFER_SIZE);
}

// Half-transfer complete: process first half
void HAL_ADC_ConvHalfCpltCallback(ADC_HandleTypeDef* hadc) {
    half_ready = true;
}

// Full transfer complete: process second half
void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef* hadc) {
    full_ready = true;
}

void loop() {
    if (half_ready) {
        process_samples(&adc_buffer[0], BUFFER_SIZE / 2);
        half_ready = false;
    }
    if (full_ready) {
        process_samples(&adc_buffer[BUFFER_SIZE / 2], BUFFER_SIZE / 2);
        full_ready = false;
    }
}

The ADC samples continuously at whatever rate the timer/clock is configured for; DMA writes each sample to adc_buffer; halfway and full-buffer interrupts notify the main loop to process. The CPU is free between processing batches.

Cache coherency on Cortex-M7

The Cortex-M7 introduced data caches. They make the CPU faster but introduce a class of bug that does not exist on Cortex-M0 to M4: the cache and main memory can disagree about what a buffer contains.

Two scenarios where this bites:

CPU writes, DMA reads

You fill a buffer in code and then start a DMA transmit. The CPU's writes may still be sitting in the data cache, not yet flushed to main memory. The DMA reads from main memory and gets stale data.

Fix: clean (write back) the cache before starting the DMA.

SCB_CleanDCache_by_Addr((uint32_t*)tx_buffer, sizeof(tx_buffer));
start_dma_transmit();

DMA writes, CPU reads

DMA fills a buffer (e.g. UART receive). The CPU then reads the buffer, but the cache may still contain old data from before the DMA wrote. The CPU sees stale values.

Fix: invalidate the cache before reading.

SCB_InvalidateDCache_by_Addr((uint32_t*)rx_buffer, sizeof(rx_buffer));
process_received_data(rx_buffer);

Easier alternative: place DMA buffers in non-cacheable memory (a region declared NORMAL_NONCACHEABLE in the MPU configuration). The buffers are slightly slower for the CPU to access but the cache coherency problem disappears.

Cortex-M0 to M4 chips have no data cache. None of this applies. If you eventually move to a Cortex-M7 (STM32H7, RT1062), expect to spend an afternoon debugging once on this exact issue.

Things that quietly break

  • Buffer in stack memory. DMA configured with a buffer that lives on the stack of a function that has returned. The buffer is now part of the unused stack and is overwritten by subsequent function calls. Use static or global buffers, not stack.
  • Buffer not aligned. Some DMA controllers require the buffer to be aligned to its element size (32-bit transfers need 4-byte alignment). Mis-alignment fails silently or causes hard faults.
  • DMA channel conflict. Two peripherals trying to use the same DMA channel. The chip's reference manual lists which channels can serve which peripherals; check before assigning.
  • Forgetting to clear the transfer-complete flag. The flag stays set; subsequent transfers never trigger the "done" callback because the chip thinks it is already done.
  • Length too short. The DMA stops at the configured length but the peripheral keeps generating data. Bytes are lost. Check that your buffer matches the expected data rate times the processing interval.

Frequently Asked Questions

Does ESP32 have DMA?

Yes, with caveats. The ESP32 classic has limited DMA support compared to a typical STM32 — mainly for SPI, I2S, and UART. The ESP32-S3 expanded this significantly with general-purpose DMA. The Arduino ESP32 core hides most of the configuration; ESP-IDF exposes it directly.

Can I do memory-to-memory DMA on Cortex-M0?

Most Cortex-M0 chips do not have DMA at all. Cortex-M0+ in larger chips (STM32L0, nRF52810) does have DMA and supports memory-to-memory in some implementations. Check the reference manual for your specific chip.

Is DMA always faster than the CPU?

For peripheral I/O, almost always yes. For pure memory copies, the CPU's memcpy is often comparable to DMA because both are limited by memory bandwidth. The win for DMA is in freeing the CPU to do other work in parallel, not raw transfer rate.

How do I debug DMA?

The two questions to ask: did the transfer happen at all (check the transfer-complete flag), and did the data arrive correctly (inspect the buffer with a debugger). Logic analyzers help when DMA drives a peripheral interface (SPI, UART) — you can see whether the bytes went out as expected.

Can DMA cause power consumption to increase?

Slightly. The DMA controller has its own clock and small power draw when active. The savings from CPU sleep usually outweigh this, but on extremely low-power designs it is worth measuring. Disable DMA clocks when not in use.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.