Making STM32 Ethernet Work with Cache Enabled

This article explains how turning on CPU cache on modern STM32 chips can silently break Ethernet DMA and cause weird, hard-to-debug network issues. It walks through why this happens and shows simple, practical ways to fix it by keeping Ethernet buffers out of cached memory or properly syncing the cache so the CPU and DMA see the same data.

Overview

The world of microcontrollers was peaceful and predictable, until someone introduced advanced interconnect buses. Unhappy with that, someone else introduced caches.

Basically, “modern” microcontrollers have a bunch of internal buses that can be switched or routed. The CPU core has a bunch of buses, and other buses interconnect peripherals.
Some peripherals are slave stuff, like RAM
Some peripherals are bus masters, like DMA
As, historically, accessing generic RAM through a bus has been slower than accessing special purpose RAM “tightly coupled” to the CPU, someone thought a cache, sitting between the CPU and memory, could improve this.

When reading, the cache gets populated on the first read, serving subsequent reads
When writing, there are two strategies:

Write-through: data is written to the cache (write-allocate, or not: no-write-allocate) and to the memory. This last operation can be deferred (write-allocate only).
Write-back: data is written to the cache. To write to memory, the cache needs to be “flushed”. When the cache contains data that has not been written to memory, it is said it is “dirty”, so another word for this operation is to “clean” the cache

When there is more than one bus master, be it other core or a DMA controller, things can get out of sync quickly.

CPU writes, DMA reads: CPU → Cache → Memory → DMA

Write-through cache: no problem
- no-write-allocate: Tx just works
- write-allocate: the CPU may miss the DMA marking the descriptor available, as next CPU read will be from the cache →”No descriptors available” (see DMA writes, CPU reads below)
Write-back cache: as CPU written data can still be on the cache, the DMA controller may read stale data. The cache needs to be flushed/cleaned before the DMA process starts

DMA writes, CPU reads: CPU ← Cache ← Memory ← DMA

as the memory space where the DMA controller writes can be cached, since this controller writes to memory, data in the cache gets stale, it contains data from a previous CPU read that is no longer valid. The cache needs to be invalidated when the DMA controller finishes, before the CPU starts reading.
This is a simplification, things are like that only if data is FULLY aligned to a cache line and not shared with anything other. See Appendix C

Objective

We'd like to integrate Mongoose into existing projects which may already have MPU and I/D caching set. Ideally, if the wizard just drops the “mongoose/” directory into such a project, it should just work. The exact mechanism is to be decided, perhaps it could be driven by the preprocessor definition in mongoose_config.h, but ideally, it should just work without any extra manual definition. Therefore, we do not control the MPU / caching settings of the project, but need to adapt to them.

Possible Strategies

As can be imagined, the above is not free. Caches work by holding lines of data, so to invalidate an area requires iteratively doing it for every line involved. Areas should align to cache line size, too. The same happens for flushing operations. Working on the whole cache at once is a big penalty for the rest of the actions, and if done frequently is even worse than disabling the whole cache (we need to not only read/write memory anyway, but also act on the cache)

Strategy 1: Avoid the cache

Depending on the internal busses, some memory sections can be non-cached, that is, the bus switch/matrix connects memory to CPU bypassing the cache. We can place our buffers in those sections and relax. This is by far optimum as it does not require any additional actions, and doesn't fiddle with hardware it doesn't need (the cache is a man in the middle no one asked for, it doesn't serve any purpose here, particularly in our architecture where DMA-written memory goes to a queue and is later copied to a buffer before being processed; or just written once and sent, the other way)

Memory area: Unfortunately, Cube doesn't seem to define an area for this purpose, so our efforts would fail when people liberally cut & paste. ST linker files have a single RAM block defined.
Absolute memory positioning: Ugly, but should work. However, there's a major caveat: there is no way to place a block of memory in an absolute location with GCC. it is with clang, even with Keil or IAR, but not with GCC. We can place a pointer, but to transform that into a block, we need to play with the linker file.

To reinforce this choice, there's also the fact that in some cases the ETH-DMA is unable to access some RAM sections, so we need to craft our linker files to map to usable sections. We've also done things like this for NXP iMXRTs in the past.

Strategy 2: Mark as non-cacheable

If the processor has an MPU, some areas of memory can have specific attributes. We can mark them so the cache will not hold data coming from those locations. The problem with this is. that there are few MPU regions, and who are we to decide how our customers will use their MPU. Even though most won't care, those who do, need to be aware of what we're doing.
The processor may not have an MPU, or an RTOS might want to do better use of its few regions. or . In fact, the Arm Cortex-M33 does not have a standard way to work this out, as caches are vendor extensions.

This seems to be ST's preferred way. Their lwIP drivers work with pbufs allocated and managed by lwIP, aligning to cache lines would probably be overkill.

Strategy 3: Live with the cache

We invalidate the proper cache region before reading DMA-written memory, and flush/clean the proper cache region after writing memory that will be read by the DMA.

Some people like to reconfigure the cache to be write-through. From our perspective, this is not much different from just disabling it. It requires the same amount of intrusion and system configuration. Besides that, most of the effort is on the CPU read side of things.

A full Ethernet frame would use 48 32-byte cache lines. Every time a frame is received, the cache is takenfrom those tasks that may make use of it, just to be invalidated some microseconds later. If the cache has a way to detect those lines that are being used frequently and keep them, things are fine; otherwise, things may become slower than with cache disabled. This is an argument in favor of strategy #2, when #1 is not possible, and that is feasible and convenient.

Hidden gotchas

As there are a lot of buses, some processors might want to reorder accesses, and even compilers might be tempted to reorder instructions, some specific actions need to wait for others to have finished. This is the reason why Data Synchronization Barrier instructions are sprinkled over the code.

Conclusion

Mongoose tries to apply strategy #1 when possible, that is, we circumvent the cache in all those architectures that have simple means to do that; like STM32F and STM32H5.

When that is not possible, we resort to strategy #3 and our driver will handle all related cache coherency actions.

Appendix B contains solutions for typical architectures.

Appendix A: Microcontroller data

ST board in Wizard	non-cached areas (ETH-DMA accessible)	ETH-DMA accessible cached areas (instead)	cache line size
f207
f429
f746	DTCM (64K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2
f756	“
f767	DTCM (128K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2
h563	SRAM3 (320K@0x20050000) SRAM1 (256K@0x20000000) SRAM2 (64K@0x20040000) RM 2.1 Fig.1; 2.3.2 Fig.2; 9.3
h573	“
h723	*NONE*	SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6	8 32-bit words
h735	*NONE*	“	“
h743	*NONE*	SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 7	“
h745	*NONE*	SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 6	“
h747	*NONE*	“	“
h753	*NONE*	= H743	“
h755	*NONE*	= H745	“
h7s3l8	*NONE*	SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6	“
n657	*NONE*	AXISRAM2 (1024K@0x34100000) RM 2.1.2 Fig.1; 2.3.2 Table 1; 3.5.1; 10.3; 17.2 Table 84; CubeIDE	“
portenta-h7	= H747

Appendix B: Microcontroller recipes

Test methodology

Test it works as is.
Do the changes, test it works. Do not enable I-Cache yet.
Make it fail: comment out the eth_ram attribute line from mongoose_config.h
Make it work again, now enable I-Cache and try again. The instruction cache may cause some actions to be performed before the underlying stuff through the internal buses has taken effect; in that case, we need to know where things crash or hang, understand why, and apply a fix (usually place a barrier or force a read)

Sometimes there's no way to make it fail, e.g.: STM32H5

STM32F

Tested on STM32F746. DTCM size could be made smaller, to allow the linker to actually use what we don't. that's designer criteria (other stuff can also be done, too).

// mongoose_config.h: place ETH DMA buffers into .eth_ram
#define MG_ETH_RAM __attribute__((section(".eth_ram")))

/* linker script */
MEMORY {  
 flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k  
 dtcm(rwx) : ORIGIN = 0x20000000, LENGTH = 64k            /* ETH accessible */
 sram(rwx) : ORIGIN = 0x20010000, LENGTH = 256k
}  
_estack = ORIGIN(sram) + LENGTH(sram);

SECTIONS {  
  .vectors : { KEEP(*(.isr_vector)) } > flash  
  .text : { *(.text* .text.*) } > flash  
  .rodata : { *(.rodata*) } > flash

  .eth_ram : { *(.eth_ram .eth_ram*) } > dtcm AT > flash   /* ETH buffers */

// main.c
int main(void) {  
  hal_init();  
  MG_INFO(("HAL initialized, starting firmware..."));  

  SCB_EnableDCache();  
  SCB_EnableICache();

  mongoose_init();

STM32H5

RM 9.3:

The DCACHE1 is placed on Cortex-M33 S-AHB bus and caches only the external RAM memory region (OCTOSPI and FMC), in address range [0x6000 0000:0x9FFF FFFF] of the memory map.
Indeed, by placing a bus matrix demultiplexing node in front of the DCACHE1, S-AHB bus memory requests addressing SRAM region or peripherals region (respectively in ranges [0x2000 0000:0x3FFF FFFF] and [0x4000 0000:0x5FFF FFFF]) are routed directly to the main AHB bus matrix, and the DCACHE1 is bypassed.

So, nothing is actually needed. The following just places buffers and descriptors in the best place to optimize bus length and access Tested on STM32H563. SRAM3 size could be made smaller, to allow the linker to actually use what we don't. That's designer criteria (other stuff can also be done, too).

// mongoose_config.h: place ETH DMA buffers into .eth_ram
#define MG_ETH_RAM __attribute__((section(".eth_ram")))

/* linker script */
MEMORY {  
 flash(rx) : ORIGIN = 0x08000000, LENGTH = 2048k  
 sram(rwx) : ORIGIN = 0x20000000, LENGTH = 320K
 sram1(rwx) : ORIGIN = 0x20050000, LENGTH = 320K          /* ETH accessible */
}  
_estack = ORIGIN(sram) + LENGTH(sram);

SECTIONS {  
  .vectors : { KEEP(*(.isr_vector)) } > flash  
  .text : { *(.text* .text.*) } > flash  
  .rodata : { *(.rodata*) } > flash

  .eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash   /* ETH buffers */

// hal.h
static inline void hal_system_init(void) {  
 SCB->CPACR |= ((3UL << 20U) | (3UL << 22U)); // Enable FPU  
 __DSB();  
 __ISB();  
 DCACHE1->CR |= BIT(DCACHE_CR_EN);  // Enable DCACHE
}

STM32H7

Tested on STM32H723. SRAM1 size could be made smaller, to allow the linker to actually use what we don't. That's designer criteria (other stuff can also be done, too.). Actually, we almost use it all, so in case someone wants larger buffers, SRAM2 (contiguous) has to be enabled, too. Other devices have larger SRAM1s

// mongoose_config.h: place ETH DMA buffers into .eth_ram
#define MG_ETH_RAM __attribute__((section(".eth_ram")))

/* linker script */
MEMORY {  
 flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k  
 sram(rwx) : ORIGIN = 0x24000000, LENGTH = 128k   /* AXI SRAM in domain D1 */  
 sram1(rwx) : ORIGIN = 0x30000000, LENGTH = 16k   /* SRAM in domain D2 */  
}

_estack = ORIGIN(sram) + LENGTH(sram);

SECTIONS {  
  .vectors : { KEEP(*(.isr_vector)) } > flash  
  .text : { *(.text* .text.*) } > flash  
  .rodata : { *(.rodata*) } > flash

  .eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash

hal.h

static inline void hal_system_init(void) {  
 SCB->CPACR |= ((3UL << 10 * 2) | (3UL << 11 * 2)); // Enable FPU  
 __DSB();  
 __ISB();  
 RCC->AHB2ENR |= RCC_AHB2ENR_D2SRAM1EN; // Enable SRAM1 in D2
}

// main.c
int main(void) {  
  hal_init();  
  MG_INFO(("HAL initialized, starting firmware..."));  

  SCB_EnableDCache();
  SCB_EnableICache();
  MG_INFO(("D-Cache and I-Cache enabled"));

  mongoose_init();

STM32N6

Tested on STM32N657. Apparently Cube always uses AXISRAM2, so we don't need to do any memory placement. I see declarations for a .noncacheable section in the linker script, along with attributes to assign this tag, and macros to extract begin and end addresses of such a section, because there are no placement rules for it. Probably Cube has or will have provisions to get this section address and mark it in the MPU.

Cube

Enable DCACHE

Appendix C: the long stories

Usually, things are not that simple; otherwise, who'd need Engineers ?

DMA writes, CPU reads, but the linker likes the room we left

If buffers or descriptors do not occupy a whole cache line, and something else is generously placed there by our friend the linker, things get tough.

If the cache is WBWA (write-back write-allocate), there can be dirty lines that get flushed to memory while the DMA controller is doing its job. This will trash the current transfer. E.g.: a buffer and some variable share a cache line. The DMA controller starts when a frame comes, something we are not aware of. We write to that variable, or have written before, so that line is now dirty, because of that variable, regardless of the rest of the line (the buffer), that is being written by the DMA controller. If the cache controller decides it is time to clean, it will flush that line to memory, trashing what the DMA controller has just written.

We can't flush and invalidate the cache before the DMA starts, unless there is some “in-the-middle” IRQ triggered at frame start and before the DMA starts, though, IMHO, that would render the whole stuff useless, because we delay processing the frame to clean the cache, that wouldn't have been there in the first place.

Even if we do, the cache needs to be invalidated, as said above, when the DMA controller finishes and before the CPU starts, because, if the CPU wants to access one of those lines (either because a variable is actually read in code or it can do speculative memory accesses and wants to outsmart us), it may have triggered a read while the DMA controller was writing. So, we need two expensive invalidations.

One more on alignment

As invalidation/flushing is done on a line basis, functions doing that iterate through the lines based on the starting address and length passed as their arguments. If this is not coincident with cache hardware boundaries, parts of the buffers may not get invalidated/cleaned.

DMA descriptors and cache lines

Usually, one DMA descriptor does not fit exactly in a cache line. If a second descriptor follows the first one in one of the cache lines, flushing the first descriptor after modifying it, may trash changes done by the DMA controller on the second descriptor.
Avoid temptation and align usable units to cache lines, always.
But. What if descriptors are fixed size ? Not a linked list, just an array, they have a fixed size so we can't avoid having more than one descriptor in a cache line ?
Then. Then the DMA engine needs to be stopped before we flush a descriptor. What if the DMA was in the middle of updating a descriptor ? Then we may trash it. That should be done when the DMA controller is idle. The cache should first be invalidated, then the second descriptor read, to be repopulated with the descriptor data, then we modify the first descriptor, the cache should then be flushed, to write our changes to the first descriptor. Here, “first” and “second” are relative terms, “first” being the one we are working on, and “second” the other one (or ones) in the cache line. Usually, the sanest thing is to work on the whole descriptor array.
This constantly stops DMA and is prone to accumulating frames in the controller FIFO, leading to frame loss. Keep reading.

Fortunately, the Synopsys IP in the H7 has a “Descriptor Skip Length” field that comes to the rescue.

Cache eviction

A dirty cache line can be flushed without our explicit request, to make room for another line. This means that, on a multi-threaded system, another thread might be called while we are working on a descriptor, and flush part of it before we finish. This is similar to a write-through operation, so it won't harm, just make broken code look like it “sometimes works”. BUT this can couple with the above and produce a subtle side effect. That is, in a multi-threaded environment, the stop/invalidate/modify/flush operation should be atomic, or we must guarantee that there will never be a dirty cache line when a DMA controller is able to also be writing, and the processor be interrupted to switch to another task that might require use of the cache and cause an eviction.

Mongoose

Mongoose is an embedded web server and all-in-one TCP/IP stack with TLS, MQTT, firmware OTA

Quick Links

Contact

Cesanta, 13 Edward Pl, Dublin 4, Ireland

+353 1 592 5476

support@cesanta.com