Hi,
NickJohnson wrote:
Modern processors definitely don't load memory on a byte-by-byte basis; the reality is much more complex. At the very least, the smallest unit of data transfer between main memory and the last level of cache (L2 or L3) is the size of a last level cache line, which is (e.g. on Haswell) 64 *bytes* per line, or 512 bits. This ignores prefetching, unaligned accesses, and other latency-hiding mechanisms that might increase how much data is transferred.
Yes - modern 80x86 typically loads cache lines.
NickJohnson wrote:
So, in reality, the processor is not capable of doing single-byte transfers or even 128-bit transfers to/from memory. The real transfers are much larger.
For "uncached" areas (e.g. memory mapped IO) the CPU will read/write individual bytes when software tells it to; including areas of RAM that are configured as "uncached" (e.g. the firmware's SMM area). In addition, it's possible to force the CPU to write a byte to RAM even for "write-back" cached areas (e.g. by using a MASKMOVDQU instruction where all bytes are masked except one that's followed by an SFENCE or MFENCE).
Columbus wrote:
Why is the smallest addressable data always 1 Byte or 8 Bits wide?
Why hasn't someone extended it to 16 Bits?
There are CPUs that don't allow misaligned loads, that are only capable of (e.g.) accessing 16 bits or 32 bits of data from RAM. The problem is that you end up emulating byte accesses in software (e.g. doing a "load, shift, mask" instead of a 1-byte load, and doing a "load, mask, shift, or, store" instead of a 1-byte store) and it ends up being significantly slower. The alternative is for software to never use anything smaller than the CPU's minimum access size (e.g. "CHAR_BITS == 16" in C) which can waste a lot of memory and make caches less efficient and it ends up being significantly slower.
Columbus wrote:
Wouldn't that reduce the size and/or complexity of some mechanics.
Maybe one could introduce a 32 bit wide "Byte" (smalles addressable unit).
It would reduce the complexity of the CPU a little (and make the CPU slower in practice). However, CPU manufacturer's are trying to do the opposite - they've got a budget of "many millions of transistors" and are trying to find ways of using those transistors to improve performance. For an example, Intel's Haswell CPUs are using around 1.4 billion transistors, and Apple's A8 chip (which contains an ARMv8 core) is using around 2 billion transistors. They can afford to use a few extra transistors to improve the performance of byte accesses.
Cheers,
Brendan