MESI, MOESI, MESIF, NUMA
With the arrival of multi-core processors, the concept of multiple and shared resources has largely been accepted, both on the hardware side and by operating systems. In one of today’s processors, a single memory controller is used by multiple cores, with, between them, a cache hierarchy (to store the data that’s most frequently used) that’s sometimes unique to each core and sometimes shared so as to create a system that both allows the controller to be shared efficiently and maximises performance. With respect to the operating system, if there are multiple cores they all share the same memory.
The question of shared resources does however raise new problems: what happens when one core wants to access a piece of memory data used by another core? Which ‘version’ of the piece of data is the right one, that stored in the RAM or that stored in the cache?
Some coordination quickly becomes necessary…
Mechanisms have been created to handle conflicts, most often the MESI protocol which allocates states (M
nvalid) to cache lines so as to enable a minimum of coordination between cores.
Used until Nehalem by Intel, MESI is a protocol that:
- ensures cache coherencey and memory coherency
- enables cores to work together
Let’s take the example of a core A which needs to read a piece of data in the memory. Using MESI, it will first of all find out if any of the other cores is using this data. A request is then sent to all the other cores. If none of the other cores are using the data that has been requested, the memory controller will look for it in the memory and then send it to the cache that has requested it. However if another core B has already requested this data previously to read it, it will also have marked this cache line as E
xclusive. This is when interaction between the cores kicks in as core B then goes into the S
hared state to indicate that it isn’t alone in using this data and it sends the data straight to core A (known as forward
) which then also takes on ownership of this copy (known as instance). The system works particularly well when a maximum of two cores can access the same piece of data. If however the same piece of data is marked as S
hared on several cores, all these cores reply to the request! Several identical responses then transit between the cores, using up bandwidth for nothing. If the shared state is poorly managed, it can have its limitations.
For our second example, let’s say that the piece of data requested by our core A hasn't simply been read by core B but that it has read it and then modified it. In practice core B will have moved this cache line from the E
xclusive to the M
odified state. This state indicates that the data in the main memory is dirty
or no longer valid and that the cache line is only present in the current cache in its true state with respect to the data it contains. If our core A then requests this data, core B will have to carry out a whole series of operations to ensure coherency:
- Write the data back to the main memory so as to synchronise the changes (writeback)
- Change the sate of the cache line to Shared
- Send the updated copy to core A (forward)
These operations are of course costly, first of all because the memory controller is involved!
To correct the issues previously mentioned, several developments have been made to MESI, AMD introducing MOESI. This protocol changes the situation with respect to the two points we brought up above, introducing the O
wned state. If we take our first case, core B can change from E
xclusive to O
wned mode before sending the copy. Up until here, there’s little difference, but if however a third core wants to access this data, using MESI, cores A and B will respond simultaneously. MOESI helps avoid this: the cores marked as S
hared no longer respond to requests! Only the core marked as O
wned will respond, reducing traffic.
In the second case, our core B in M
odified mode will, instead of carrying out multiple operations (writeback, change in mode S
hared, forward) simply go into O
wned mode before sending the data to core A. This saves on bandwidth, which is real progress
As of Nehalem, Intel abandoned MESI in favour of the MESIF protocol to which a new state, F
orwarding, was added. Thus in our first example, when core A requests the data, core B changes from E
xclusive to F
orwarding mode. In this precise case, it will act like the MOESI O
wned mode, namely it will be alone in responding to requests so as to reduce traffic.
In the second case however, MESIF contributes nothing. Although on paper MESIF may not seem as attractive, as always in computing, it’s a question of striking the best compromise: an MOESI implementation can be more complex than an MESIF.
What about with a multi-CPU system?
Here we have described a relatively simple situation where we only have a single processor with several cores with a cache system and a single memory controller. What happens in a modern multi-CPU system where each processor has its own controller and its own memory bars?
There are two possibilities. The simplest but not necessarily the most intuitive consists in duplicating data. Like with a RAID for hard drives, each memory controller stocks a copy of the data. The available memory is therefore divided into two on a bi-socket platform. With reads, it’s easy. The cores of each processor have a local copy of the data that they’re interested in. With writes, each change has to be carried forward simultaneously in all memory spaces.
The optimum mode is still thought to be that of aggregating each processor’s memory within a large common memory space that can be used as such by the operating system. From a theoretical point of view, processor A simply needs to be allowed to use the memory in processor B. This is what the two QPIs in the processor are for. In the case of the SNB-Es, these links are clocked at 4 GHz, which gives us 32 GB/s of usable bandwidth in each direction.
There are however two problems. The first is a practical one: with there being just a single memory space as viewed by the operating system, how is this space shared between the two sockets? The traditional method consists of mixing up the memory banks. This means that at any moment, an application will have half its data on each of the sockets, independently of the processor where it is executed.
The other possibility is to use an intelligent protocol which requires the collaboration of the operating system. This is what’s known as NUMA, for Non Uniform Memory Access. In NUMA mode, the operating system takes on board the fact that there are two distinct logical memory spaces, a bit like the way the kernel takes HyperThreading into account or the AMD FX architectures in the form of modules. The operating system will then allocate the memory in the socket which corresponds to the processor core that is executing the thread supposed to be using the memory. With the MESIF protocol we were discussing higher up the page (or MOESI for AMD), where an application shares the data between several threads, memory transfers will operate when necessary.
On paper, the NUMA mode seems to be the best but as is often the case things aren’t as necessarily as simple as you might first think. First we compared latency and the multithreaded memory bandwidth using RMMT and Aida64. Note that for these theoretical tests, we turned off HyperThreading as well as four cores on each processor. The reason behind this limitation comes from the fact that RMMT doesn’t support more than eight threads at once, a problem we’ll come back to later. Eight bars of 4 GB of registered DDR3 memory clocked at 1066 (CAS7) were installed for these tests:
To recap, in mirroring mode only 16 GB are available. In NUMA Off and NUMA On modes, 32 GB are available, but if NUMA is off, the memory space is shared on both sockets. Note that quite logically, Mirroring mode is the least efficient in terms of memory writes. Each write completed is sent to both memory controllers at the same time, saturating the QPI bus’ 32 GB/s.
If we turn the mirroring off, the write bandwidth climbs again. We’re still partially limited by the QPI bus but the fact that the local controller and the distant controller are used alternately mitigates the problems. Turning NUMA on allows us to maximise performance with a big gain in reads, writes and latency as each thread then uses the local memory of the socket on which it is executed.
Of course theoretical performance has to face some practical counter examples! We measured performance in 7-Zip, the value given being compression time in seconds:
7-Zip is slowest with NUMA on. This is of course a particular case. Here the software uses a dictionary of data that is common between all threads and access to this is shared. In this case NUMA can cause a slight loss in performance, which seems to be linked to the use of the MESIF coherency protocol. As always in computing, it’s about strking the right compromise and while, for general usage turning NUMA on is always advisable, like HyperThreading in some cases, it can also be slightly counterproductive. Depending on the type of application being used, the memory controllers can be configured according to its needs. For the tests which follow we opted for the default configuration, which, 7-Zip excepted, is systematically the most effective.