Cache (in)coherency

Rationale

The world of microcontrollers was peaceful and predictable, until someone introduced advanced interconnect buses. Unhappy with that, someone else introduced caches.

Basically, “modern” microcontrollers have a bunch of internal buses that can be switched or routed. The CPU core has a bunch of buses, and other buses interconnect peripherals.
Some peripherals are slave stuff, like RAM
Some peripherals are bus masters, like DMA
As, historically, accessing generic RAM through a bus has been slower than accessing special purpose RAM “tightly coupled” to the CPU, someone thought a cache, sitting between the CPU and memory, could improve this.

When reading, the cache gets populated on the first read, serving subsequent reads
When writing, there are two strategies:

When there is more than one bus master, be it other core or a DMA controller, things can get out of sync quickly.

CPU ←→ Cache ←→ Memory ←→ DMA

Objective

We’d like to inject Mongoose into existing projects which may already have MPU and I/D caching set. Ideally, if the wizard just drops the “mongoose/” directory into such a project, it should just work. The exact mechanism is to be decided, perhaps it could be driven by the preprocessor definition in mongoose_config.h, but ideally, it should just work without any extra manual definition.
Therefore, we do not control the MPU / caching settings of the project, but need to adapt to them.

Possible Strategies

As can be imagined, the above is not free. Caches work by holding lines of data, so to invalidate an area requires iteratively doing it for every line involved. Areas should align to cache line size, too. The same happens for flushing operations. Working on the whole cache at once is a big penalty for the rest of the actions, and if done frequently is even worse than disabling the whole cache (we need to not only read/write memory anyway, but also act on the cache…)

Strategy #1: Avoid the cache

Depending on the internal busses, some memory sections can be non-cached, that is, the bus switch/matrix connects memory to CPU bypassing the cache. We can place our buffers in those sections and relax. This is by far optimum as it does not require any additional actions, and doesn’t fiddle with hardware it doesn’t need (the cache is a man in the middle no one asked for, it doesn’t serve any purpose here, particularly in our architecture where DMA-written memory goes to a queue and is later copied to a buffer before being processed; or just written once and sent, the other way)

To reinforce this choice, there’s also the fact that in some cases the ETH-DMA is unable to access some RAM sections, so we need to craft our linker files to map to usable sections. We’ve also done things like this for NXP iMXRTs in the past.

Strategy #2: Mark as non-cacheable

If the processor has an MPU, some areas of memory can have specific attributes. We can mark them so the cache will not hold data coming from those locations. The problem with this is… that there are few MPU regions, and who are we to decide how our customers will use their MPU. Even though most won’t care, those who do, need to be aware of what we’re doing.
The processor may not have an MPU, or an RTOS might want to do better use of its few regions… or … In fact, the Arm Cortex-M33 does not have a standard way to work this out, as caches are vendor extensions.

This seems to be ST’s preferred way. Their lwIP drivers work with pbufs allocated and managed by lwIP, aligning to cache lines would probably be overkill.

Strategy #3: Live with the cache

We invalidate the proper cache region before reading DMA-written memory, and flush/clean the proper cache region after writing memory that will be read by the DMA.

Some people like to re-configure the cache to be write-through. From our perspective, how is this any different from just disabling it ? It requires the same amount of intrusion and system configuration. Besides that, most of the effort is on the CPU read side of things, so…

A full Ethernet frame would use 48 32-byte cache lines... every time you receive a frame you take the cache from those tasks that may make use of it, just to be invalidated some microseconds later. If the cache has a way to detect those lines that are being used frequently and keep them, things are fine; otherwise, things may become slower than with cache disabled. This is an argument in favor of strategy #2, when #1 is not possible, and that is feasible and convenient.

Hidden gotchas

As there are a lot of buses, some processors might want to reorder accesses, and even compilers might be tempted to reorder instructions, some specific actions need to wait for others to have finished. This is the reason why Data Synchronization Barrier instructions are sprinkled over the code.

Appendix A: Microcontroller data

ST board in Wizard non-cached areas (ETH-DMA accessible) ETH-DMA accessible cached areas (instead) cache line size
f207
f429
f746 DTCM (64K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2
f756
f767 DTCM (128K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2
h563 SRAM3 (320K@0x20050000) SRAM1 (256K@0x20000000) SRAM2 (64K@0x20040000) RM 2.1 Fig.1; 2.3.2 Fig.2; 9.3
h573
h723 NONE SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6 8 32-bit words
h735 NONE
h743 NONE SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 7
h745 NONE SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 6
h747 NONE
h753 NONE = H743
h755 NONE = H745
h7s3l8 NONE SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6
n657 NONE AXISRAM2 (1024K@0x34100000) RM 2.1.2 Fig.1; 2.3.2 Table 1; 3.5.1; 10.3; 17.2 Table 84; CubeIDE
portenta-h7 = H747

Appendix B: Microcontroller recipes

Test methodology

  1. Test it works as is…
  2. Do the changes, test it works. Do not enable I-Cache yet.
  3. Make it fail: comment out the eth_ram attribute line from mongoose_config.h
  4. Make it work again, now enable I-Cache and try again. The instruction cache may cause some actions to be performed before the underlying stuff through the internal buses has taken effect; in that case, we need to know where things crash or hang, understand why, and apply a fix (usually place a barrier or force a read)

Sometimes there’s no way to make it fail, e.g.: STM32H5

STM32F

Tested on STM32F746. DTCM size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).

mongoose_config.h

#define MG_ETH_RAM __attribute__((section(".eth_ram")))

link.ld

MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k
dtcm(rwx) : ORIGIN = 0x20000000, LENGTH = 64k
sram(rwx) : ORIGIN = 0x20010000, LENGTH = 256k
}
_estack = ORIGIN(sram) + LENGTH(sram); /* stack points to end of SRAM */

SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash

.eth_ram : { *(.eth_ram .eth_ram*) } > dtcm AT > flash

main.c

int main(void) {
// Cross-platform hardware init
hal_init();
MG_INFO(("HAL initialised, starting firmware..."));
SCB_EnableDCache();
MG_INFO(("D-Cache enabled"));
SCB_EnableICache();
MG_INFO(("I-Cache enabled"));

mongoose_init();

STM32H5

RM 9.3:
The DCACHE1 is placed on Cortex-M33 S-AHB bus and caches only the external RAM
memory region (OCTOSPI and FMC), in address range [0x6000 0000:0x9FFF FFFF] of the
memory map.
Indeed, by placing a bus matrix demultiplexing node in front of the DCACHE1, S-AHB bus
memory requests addressing SRAM region or peripherals region (respectively in ranges
[0x2000 0000:0x3FFF FFFF] and [0x4000 0000:0x5FFF FFFF]) are routed directly to the
main AHB bus matrix, and the DCACHE1 is bypassed.

So, nothing is actually needed. The following just places buffers and descriptors in the best place to optimize bus length and access
Tested on STM32H563. SRAM3 size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).

mongoose_config.h

#define MG_ETH_RAM __attribute__((section(".eth_ram")))

link.ld

MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 2048k
sram(rwx) : ORIGIN = 0x20000000, LENGTH = 320K
sram3(rwx) : ORIGIN = 0x20050000, LENGTH = 320K
}
_estack = ORIGIN(sram) + LENGTH(sram); /* End of RAM. stack points here */

SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash

.eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash

hal.h

static inline void hal_system_init(void) {
SCB->CPACR |= ((3UL << 20U) | (3UL << 22U)); // Enable FPU
__DSB();
__ISB();
DCACHE1->CR |= BIT(DCACHE_CR_EN);
}

STM32H7

Tested on STM32H723. SRAM1 size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).
Actually, we almost use it all, so in case someone wants larger buffers, SRAM2 (contiguous) has to be enabled, too. Other devices have larger SRAM1s

mongoose_config.h

#define MG_ETH_RAM __attribute__((section(".eth_ram")))

link.ld

MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k
sram(rwx) : ORIGIN = 0x24000000, LENGTH = 128k /* AXI SRAM in domain D1 */
sram1(rwx) : ORIGIN = 0x30000000, LENGTH = 16k /* SRAM in domain D2 */
/* 2.3.2: remaining SRAM is in other (non-contiguous) banks,
DTCM @0x20000000 is in domain D1 and not accessible by the ETH DMA controller in domain D2
@0x24020000 can be either AXI or ITCM (2.4 Table 8)
SRAM @0x30000000 is in domain D2 and not directly available at startup to be used as stack (8.5.9 page 366)
SRAM @0x38000000 is in domain D3 and not directly available at startup to be used as stack (8.5.9 page 366) */
}

_estack = ORIGIN(sram) + LENGTH(sram); /* stack points to end of SRAM */

SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash

.eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash

hal.h

static inline void hal_system_init(void) {
SCB->CPACR |= ((3UL << 10 * 2) | (3UL << 11 * 2)); // Enable FPU
__DSB();
__ISB();
RCC->AHB2ENR |= RCC_AHB2ENR_D2SRAM1EN; // Enable SRAM1 in D2
}

main.c

int main(void) {
// Cross-platform hardware init
hal_init();
MG_INFO(("HAL initialised, starting firmware..."));
SCB_EnableDCache();
MG_INFO(("D-Cache enabled"));
SCB_EnableICache();
MG_INFO(("I-Cache enabled"));

mongoose_init();

STM32N6

Tested on STM32N657. Apparently Cube always uses AXISRAM2, so we don’t need to do any memory placement. I see declarations for a .noncacheable section in the linker script, along with attributes to assign this tag, and macros to extract begin and end addresses of such a section, because there are no placement rules for it. Probably Cube has or will have provisions to get this section address and mark it in the MPU.

Cube

Enable DCACHE

Appendix C: the long stories

Usually, things are not that simple; otherwise, who’d need Engineers ?

DMA writes, CPU reads, but the linker likes the room we left

If buffers or descriptors do not occupy a whole cache line, and something else is generously placed there by our friend the linker, things get tough.

We can’t flush and invalidate the cache before the DMA starts, unless there is some “in-the-middle” IRQ triggered at frame start and before the DMA starts, though, IMHO, that would render the whole stuff useless, because we delay processing the frame to clean the cache, that wouldn’t have been there in the first place…

One more on alignment

As invalidation/flushing is done on a line basis, functions doing that iterate through the lines based on the starting address and length passed as their arguments. If this is not coincident with cache hardware boundaries, parts of the buffers may not get invalidated/cleaned.

DMA descriptors and cache lines

Usually, one DMA descriptor does not fit exactly in a cache line. If a second descriptor follows the first one in one of the cache lines, flushing the first descriptor after modifying it, may trash changes done by the DMA controller on the second descriptor…
Avoid temptation and align usable units to cache lines, always.
But… What if descriptors are fixed size ? Not a linked list, just an array, they have a fixed size so we can’t avoid having more than one descriptor in a cache line ?
Then… Then the DMA engine needs to be stopped before we flush a descriptor. What if the DMA was in the middle of updating a descriptor ? Then we may trash it… That should be done when the DMA controller is idle. The cache should first be invalidated, then the second descriptor read, to be repopulated with the descriptor data, then we modify the first descriptor, the cache should then be flushed, to write our changes to the first descriptor. Here, “first” and “second” are relative terms, “first” being the one we are working on, and “second” the other one (or ones…) in the cache line. Usually, the sanest thing is to work on the whole descriptor array.
This constantly stops DMA and is prone to accumulating frames in the controller FIFO, leading to frame loss. Keep reading.

Fortunately, the Synopsys IP in the H7 has a “Descriptor Skip Length” field that comes to the rescue.

Cache eviction

A dirty cache line can be flushed without our explicit request, to make room for another line… This means that, on a multi-threaded system, another thread might be called while we are working on a descriptor, and flush part of it before we finish. This is similar to a write-through operation, so it won’t harm, just make broken code look like it “sometimes works”.
BUT… this can couple with the above in an evil way… that is, in a multi-threaded environment, the stop/invalidate/modify/flush operation should be atomic, or we must guarantee that there will never be a dirty cache line when a DMA controller is able to also be writing, and the processor be interrupted to switch to another task that might require use of the cache and cause an eviction.