devc1 wrote:
What is the memory access granularity of an x86_64 CPU ?
Typically 64 bytes, at least with write-back caching enabled (L1 cache line size).
devc1 wrote:
The result was a massive difference compared to load/store inside a single page with 450 ms against 1900 ms.
Yeah, the manuals warn about unaligned accesses crossing page boundaries. Inside of a cache line, the effects are barely measurable, across cache lines, there is some effect, across page boundaries, latency spikes.
devc1 wrote:
Sometimes (at first) accessing address 0x1000 is so slow but accessing 0x1010 is so fast.
Same L1 cache line. So accessing 0x1000 has already created the TLB entry and the L1 cache line, and then the access to 0x1010 hits the same cache line.