Cache (in)coherency
Rationale
The world of microcontrollers was peaceful and predictable, until someone introduced advanced interconnect buses. Unhappy with that, someone else introduced caches.
Basically, “modern” microcontrollers have a bunch of internal buses that can be switched or routed. The CPU core has a bunch of buses, and other buses interconnect peripherals.
Some peripherals are slave stuff, like RAM
Some peripherals are bus masters, like DMA
As, historically, accessing generic RAM through a bus has been slower than accessing special purpose RAM “tightly coupled” to the CPU, someone thought a cache, sitting between the CPU and memory, could improve this.
When reading, the cache gets populated on the first read, serving subsequent reads
When writing, there are two strategies:
- Write-through: data is written to the cache (write-allocate, or not: no-write-allocate) and to the memory. This last operation can be deferred (write-allocate only).
- Write-back: data is written to the cache. To write to memory, the cache needs to be “flushed”. When the cache contains data that has not been written to memory, it is said it is “dirty”, so another word for this operation is to “clean” the cache
When there is more than one bus master, be it other core or a DMA controller, things can get out of sync quickly.
CPU ←→ Cache ←→ Memory ←→ DMA
CPU writes, DMA reads:
CPU → Cache → Memory → DMA
- Write-through cache: no problem
- no-write-allocate: Tx just works
- write-allocate: the CPU may miss the DMA marking the descriptor available, as next CPU read will be from the cache →”No descriptors available” (see DMA writes, CPU reads below)
- Write-back cache: as CPU written data can still be on the cache, the DMA controller may read stale data. The cache needs to be flushed/cleaned before the DMA process starts
- Write-through cache: no problem
DMA writes, CPU reads:
CPU ← Cache ← Memory ← DMA
- as the memory space where the DMA controller writes can be cached, as this controller writes to memory, data in the cache gets stale, it contains data from a previous CPU read that is no longer valid. The cache needs to be invalidated when the DMA controller finishes, before the CPU starts reading.
- This is a simplification, things are like that only if data is FULLY aligned to a cache line and not shared with anything other. See Appendix C
- as the memory space where the DMA controller writes can be cached, as this controller writes to memory, data in the cache gets stale, it contains data from a previous CPU read that is no longer valid. The cache needs to be invalidated when the DMA controller finishes, before the CPU starts reading.
Objective
We’d like to inject Mongoose into existing projects which may already have MPU and I/D caching set. Ideally, if the wizard just drops the “mongoose/” directory into such a project, it should just work. The exact mechanism is to be decided, perhaps it could be driven by the preprocessor definition in mongoose_config.h, but ideally, it should just work without any extra manual definition.
Therefore, we do not control the MPU / caching settings of the project, but need to adapt to them.
Possible Strategies
As can be imagined, the above is not free. Caches work by holding lines of data, so to invalidate an area requires iteratively doing it for every line involved. Areas should align to cache line size, too. The same happens for flushing operations. Working on the whole cache at once is a big penalty for the rest of the actions, and if done frequently is even worse than disabling the whole cache (we need to not only read/write memory anyway, but also act on the cache…)
Strategy #1: Avoid the cache
Depending on the internal busses, some memory sections can be non-cached, that is, the bus switch/matrix connects memory to CPU bypassing the cache. We can place our buffers in those sections and relax. This is by far optimum as it does not require any additional actions, and doesn’t fiddle with hardware it doesn’t need (the cache is a man in the middle no one asked for, it doesn’t serve any purpose here, particularly in our architecture where DMA-written memory goes to a queue and is later copied to a buffer before being processed; or just written once and sent, the other way)
- Memory area: Unfortunately, Cube doesn’t seem to define an area for this purpose, so our efforts would fail when people liberally cut & paste. ST linker files have a single RAM block defined.
- Absolute memory positioning: Ugly, but should work. However, there’s a major caveat: there is no way to place a block of memory in an absolute location with GCC… it is with clang, even with Keil or IAR, but not with GCC. We can place a pointer, but to transform that into a block, we need to play with the linker file.
To reinforce this choice, there’s also the fact that in some cases the ETH-DMA is unable to access some RAM sections, so we need to craft our linker files to map to usable sections. We’ve also done things like this for NXP iMXRTs in the past.
Strategy #2: Mark as non-cacheable
If the processor has an MPU, some areas of memory can have specific attributes. We can mark them so the cache will not hold data coming from those locations. The problem with this is… that there are few MPU regions, and who are we to decide how our customers will use their MPU. Even though most won’t care, those who do, need to be aware of what we’re doing.
The processor may not have an MPU, or an RTOS might want to do better use of its few regions… or … In fact, the Arm Cortex-M33 does not have a standard way to work this out, as caches are vendor extensions.
This seems to be ST’s preferred way. Their lwIP drivers work with pbufs allocated and managed by lwIP, aligning to cache lines would probably be overkill.
Strategy #3: Live with the cache
We invalidate the proper cache region before reading DMA-written memory, and flush/clean the proper cache region after writing memory that will be read by the DMA.
Some people like to re-configure the cache to be write-through. From our perspective, how is this any different from just disabling it ? It requires the same amount of intrusion and system configuration. Besides that, most of the effort is on the CPU read side of things, so…
A full Ethernet frame would use 48 32-byte cache lines... every time you receive a frame you take the cache from those tasks that may make use of it, just to be invalidated some microseconds later. If the cache has a way to detect those lines that are being used frequently and keep them, things are fine; otherwise, things may become slower than with cache disabled. This is an argument in favor of strategy #2, when #1 is not possible, and that is feasible and convenient.
Hidden gotchas
As there are a lot of buses, some processors might want to reorder accesses, and even compilers might be tempted to reorder instructions, some specific actions need to wait for others to have finished. This is the reason why Data Synchronization Barrier instructions are sprinkled over the code.
Appendix A: Microcontroller data
| ST board in Wizard | non-cached areas (ETH-DMA accessible) | ETH-DMA accessible cached areas (instead) | cache line size |
|---|---|---|---|
| f207 | |||
| f429 | |||
| f746 | DTCM (64K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2 | ||
| f756 | “ | ||
| f767 | DTCM (128K@0x20000000) RM 2.1 Fig.1; 2.2.2 Fig.2 | ||
| h563 | SRAM3 (320K@0x20050000) SRAM1 (256K@0x20000000) SRAM2 (64K@0x20040000) RM 2.1 Fig.1; 2.3.2 Fig.2; 9.3 | ||
| h573 | “ | ||
| h723 | NONE | SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6 | 8 32-bit words |
| h735 | NONE | “ | “ |
| h743 | NONE | SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 7 | “ |
| h745 | NONE | SRAM1 (128K@0x30000000) SRAM2 (128K@0x30020000) SRAM3 (32K@0x30040000) RM 2.1 Fig.1; 2.3.2 Table 6 | “ |
| h747 | NONE | “ | “ |
| h753 | NONE | = H743 | “ |
| h755 | NONE | = H745 | “ |
| h7s3l8 | NONE | SRAM1 (16K@0x30000000) SRAM2 (16K@0x30004000) RM 2.1 Fig.1; 2.3.2 Table 6 | “ |
| n657 | NONE | AXISRAM2 (1024K@0x34100000) RM 2.1.2 Fig.1; 2.3.2 Table 1; 3.5.1; 10.3; 17.2 Table 84; CubeIDE | “ |
| portenta-h7 | = H747 | ||
Appendix B: Microcontroller recipes
Test methodology
- Test it works as is…
- Do the changes, test it works. Do not enable I-Cache yet.
- Make it fail: comment out the eth_ram attribute line from mongoose_config.h
- Make it work again, now enable I-Cache and try again. The instruction cache may cause some actions to be performed before the underlying stuff through the internal buses has taken effect; in that case, we need to know where things crash or hang, understand why, and apply a fix (usually place a barrier or force a read)
Sometimes there’s no way to make it fail, e.g.: STM32H5
STM32F
Tested on STM32F746. DTCM size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).
mongoose_config.h
#define MG_ETH_RAM __attribute__((section(".eth_ram")))
link.ld
MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k
dtcm(rwx) : ORIGIN = 0x20000000, LENGTH = 64k
sram(rwx) : ORIGIN = 0x20010000, LENGTH = 256k
}
_estack = ORIGIN(sram) + LENGTH(sram); /* stack points to end of SRAM */
SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash
.eth_ram : { *(.eth_ram .eth_ram*) } > dtcm AT > flash
main.c
int main(void) {
// Cross-platform hardware init
hal_init();
MG_INFO(("HAL initialised, starting firmware..."));
SCB_EnableDCache();
MG_INFO(("D-Cache enabled"));
SCB_EnableICache();
MG_INFO(("I-Cache enabled"));
mongoose_init();
STM32H5
RM 9.3:
The DCACHE1 is placed on Cortex-M33 S-AHB bus and caches only the external RAM
memory region (OCTOSPI and FMC), in address range [0x6000 0000:0x9FFF FFFF] of the
memory map.
Indeed, by placing a bus matrix demultiplexing node in front of the DCACHE1, S-AHB bus
memory requests addressing SRAM region or peripherals region (respectively in ranges
[0x2000 0000:0x3FFF FFFF] and [0x4000 0000:0x5FFF FFFF]) are routed directly to the
main AHB bus matrix, and the DCACHE1 is bypassed.
So, nothing is actually needed. The following just places buffers and descriptors in the best place to optimize bus length and access
Tested on STM32H563. SRAM3 size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).
mongoose_config.h
#define MG_ETH_RAM __attribute__((section(".eth_ram")))
link.ld
MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 2048k
sram(rwx) : ORIGIN = 0x20000000, LENGTH = 320K
sram3(rwx) : ORIGIN = 0x20050000, LENGTH = 320K
}
_estack = ORIGIN(sram) + LENGTH(sram); /* End of RAM. stack points here */
SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash
.eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash
hal.h
static inline void hal_system_init(void) {
SCB->CPACR |= ((3UL << 20U) | (3UL << 22U)); // Enable FPU
__DSB();
__ISB();
DCACHE1->CR |= BIT(DCACHE_CR_EN);
}
STM32H7
Tested on STM32H723. SRAM1 size could be made smaller, to allow the linker to actually use what we don’t… that’s designer criteria (other stuff can also be done, too…).
Actually, we almost use it all, so in case someone wants larger buffers, SRAM2 (contiguous) has to be enabled, too. Other devices have larger SRAM1s
mongoose_config.h
#define MG_ETH_RAM __attribute__((section(".eth_ram")))
link.ld
MEMORY {
flash(rx) : ORIGIN = 0x08000000, LENGTH = 1024k
sram(rwx) : ORIGIN = 0x24000000, LENGTH = 128k /* AXI SRAM in domain D1 */
sram1(rwx) : ORIGIN = 0x30000000, LENGTH = 16k /* SRAM in domain D2 */
/* 2.3.2: remaining SRAM is in other (non-contiguous) banks,
DTCM @0x20000000 is in domain D1 and not accessible by the ETH DMA controller in domain D2
@0x24020000 can be either AXI or ITCM (2.4 Table 8)
SRAM @0x30000000 is in domain D2 and not directly available at startup to be used as stack (8.5.9 page 366)
SRAM @0x38000000 is in domain D3 and not directly available at startup to be used as stack (8.5.9 page 366) */
}
_estack = ORIGIN(sram) + LENGTH(sram); /* stack points to end of SRAM */
SECTIONS {
.vectors : { KEEP(*(.isr_vector)) } > flash
.text : { *(.text* .text.*) } > flash
.rodata : { *(.rodata*) } > flash
.eth_ram : { *(.eth_ram .eth_ram*) } > sram1 AT > flash
hal.h
static inline void hal_system_init(void) {
SCB->CPACR |= ((3UL << 10 * 2) | (3UL << 11 * 2)); // Enable FPU
__DSB();
__ISB();
RCC->AHB2ENR |= RCC_AHB2ENR_D2SRAM1EN; // Enable SRAM1 in D2
}
main.c
int main(void) {
// Cross-platform hardware init
hal_init();
MG_INFO(("HAL initialised, starting firmware..."));
SCB_EnableDCache();
MG_INFO(("D-Cache enabled"));
SCB_EnableICache();
MG_INFO(("I-Cache enabled"));
mongoose_init();
STM32N6
Tested on STM32N657. Apparently Cube always uses AXISRAM2, so we don’t need to do any memory placement. I see declarations for a .noncacheable section in the linker script, along with attributes to assign this tag, and macros to extract begin and end addresses of such a section, because there are no placement rules for it. Probably Cube has or will have provisions to get this section address and mark it in the MPU.
Cube
Enable DCACHE
Appendix C: the long stories
Usually, things are not that simple; otherwise, who’d need Engineers ?
DMA writes, CPU reads, but the linker likes the room we left
If buffers or descriptors do not occupy a whole cache line, and something else is generously placed there by our friend the linker, things get tough.
- If the cache is WBWA (write-back write-allocate), there can be dirty lines that get flushed to memory while the DMA controller is doing its job. This will trash the current transfer. E.g.: a buffer and some variable share a cache line. The DMA controller starts when a frame comes, something we are not aware of. We write to that variable, or have written before, so that line is now dirty, because of that variable, regardless of the rest of the line (the buffer), that is being written by the DMA controller. If the cache controller decides it is time to clean, it will flush that line to memory, trashing what the DMA controller has just written.
We can’t flush and invalidate the cache before the DMA starts, unless there is some “in-the-middle” IRQ triggered at frame start and before the DMA starts, though, IMHO, that would render the whole stuff useless, because we delay processing the frame to clean the cache, that wouldn’t have been there in the first place…
- Even if we do, the cache needs to be invalidated, as said above, when the DMA controller finishes and before the CPU starts, because, if the CPU wants to access one of those lines (either because a variable is actually read in code or it can do speculative memory accesses and wants to outsmart us), it may have triggered a read while the DMA controller was writing. So, we need two expensive invalidations.
One more on alignment
As invalidation/flushing is done on a line basis, functions doing that iterate through the lines based on the starting address and length passed as their arguments. If this is not coincident with cache hardware boundaries, parts of the buffers may not get invalidated/cleaned.
DMA descriptors and cache lines
Usually, one DMA descriptor does not fit exactly in a cache line. If a second descriptor follows the first one in one of the cache lines, flushing the first descriptor after modifying it, may trash changes done by the DMA controller on the second descriptor…
Avoid temptation and align usable units to cache lines, always.
But… What if descriptors are fixed size ? Not a linked list, just an array, they have a fixed size so we can’t avoid having more than one descriptor in a cache line ?
Then… Then the DMA engine needs to be stopped before we flush a descriptor. What if the DMA was in the middle of updating a descriptor ? Then we may trash it… That should be done when the DMA controller is idle. The cache should first be invalidated, then the second descriptor read, to be repopulated with the descriptor data, then we modify the first descriptor, the cache should then be flushed, to write our changes to the first descriptor. Here, “first” and “second” are relative terms, “first” being the one we are working on, and “second” the other one (or ones…) in the cache line. Usually, the sanest thing is to work on the whole descriptor array.
This constantly stops DMA and is prone to accumulating frames in the controller FIFO, leading to frame loss. Keep reading.
Fortunately, the Synopsys IP in the H7 has a “Descriptor Skip Length” field that comes to the rescue.
Cache eviction
A dirty cache line can be flushed without our explicit request, to make room for another line… This means that, on a multi-threaded system, another thread might be called while we are working on a descriptor, and flush part of it before we finish. This is similar to a write-through operation, so it won’t harm, just make broken code look like it “sometimes works”.
BUT… this can couple with the above in an evil way… that is, in a multi-threaded environment, the stop/invalidate/modify/flush operation should be atomic, or we must guarantee that there will never be a dirty cache line when a DMA controller is able to also be writing, and the processor be interrupted to switch to another task that might require use of the cache and cause an eviction.