As others have explained, nothing in particular on common hardware. However, there is a catch: The compiler must refrain from performing certain optimizations, unless it can prove that other threads don't access the memory locations in question, e.g.:
std::array<std::uint8_t, 8u> c;
void f()
{
c[0] ^= 0xfa;
c[3] ^= 0x10;
c[6] ^= 0x8b;
c[7] ^= 0x92;
}
Here, in a single-threaded memory model, the compiler could emit code like the following (pseudo-assembly; assumes little-endian hardware):
load r0, *(std::uint64_t *) &c[0]
xor r0, 0x928b0000100000fa
store r0, *(std::uint64_t *) &c[0]
This is likely to be faster on common hardware than xor'ing the individual bytes. However, it reads and writes the unaffected (and unmentioned) elements of c
at indices 1, 2, 4 and 5. If other threads are writing to these memory locations concurrently, these changes could be overwritten.
For this reason, optimizations like these are often unusable in a multi-threaded memory model. As long as the compiler performs only loads and stores of matching length, or merges accesses only when there is no gap (e.g. the accesses to c[6]
and c[7]
can still be merged), the hardware commonly already provides the necessary guarantees for correct execution.
(That said, there are/have been some architectures with weak and counterintuitive memory order guarantees, e.g. DEC Alpha does not track pointers as a data dependency in the way that other architectures do, so it is necessary to introduce an explicit memory barrier in some cases, in low level code. There is a somewhat well-known little rant by Linus Torvalds on this issue. However, a conforming C++ implementation is expected to shield you from such issues.)