Optimized memory functions?

01000101 · **Posted:** Sat Oct 04, 2008 8:17 pm

Hey,
I've been brainstorming up some ways to optimize core memory functions such as 'memset' and 'memcpy'.
I wanted to make it so that when I do something like 'memset(&cArray, 0x00, 128);', that there wouldn't have to be a loop of 128 char copy commands. Most memset functions I have seen make it so that it loops while the 'count' variable decrements and just does a char copy over and over. I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it, same goes for 32 bit OSs as well.

here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.

Code:

void memcpy(void *dest, void *src, uint64 count)
{
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; }
  return;
}

void memset(void *dest, uint8 sval, uint64 count)
{
  uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval'
  val |= ((val << 8) & 0xFF00);
  val |= ((val << 16) & 0xFFFF0000);
  val |= ((val << 32) & 0xFFFFFFFF00000000);
    
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; }
  return; 
}

say you want to do a memset of 128 bytes, this would perform 16 64-bit transfers instead of 128 8-bit transfers. Also, say if you wanted to memset 18 bytes, it would perform 2 64-bit and 1 16-bit operations, instead of 18 operations.

What do you think of this type of optimization? What would be the pitfalls?

[edit]btw, the memxxx_transfers_xx variable is just a log variable, and if I wasn't as lazy, would have taken out.[/edit]

Colonel Kernel · **Posted:** Sat Oct 04, 2008 11:21 pm

You need to handle the case where the starting address is not aligned properly.

CodeCat · **Posted:** Sun Oct 05, 2008 6:44 am

You want to optimise the code in the loop as much as possible, so you should reduce the amount of variables to change inside the loop. For memset in fact all you need is a single int64 pointer, and then increment that pointer every iteration. Before the loop, create a pointer to the point past the last 64 bit value to be written: int64* end = (int64*) dest + (count /

; Then the while loop just has to check whether the loop pointer equals the end pointer each time, which comes down to just a 'cmp' and a 'je' instruction.

Also is using inline assembly an option? If so, you might want to consider using the movsd/movsq instructions to automate the moving, and to further reduce the amount of code inside the loop.

bewing · **Posted:** Sun Oct 05, 2008 7:11 am

I have seen memsets that do what you are doing -- as Colonel Kernel says: you need to do a few byte operations to align the pointer on a qword boundary to begin with -- then do the 64 bit ops -- then maybe a few more byte ops to finish up. Don't bother with the dword and word ops, I'd say.

Also, precalculate the number of 64b loops, at least, and count them down to 0. It is always more efficient to test on 0.

And there are no pitfalls.

Brendan · **Posted:** Sun Oct 05, 2008 7:31 am

Hi,

Some quick notes:

For copying or filling small areas of memory, the code to handle the first few bytes (alignment) and last few bytes will cause branch mis-predictions that will cost more than they're worth. For areas less than 128 bytes it's probably better to avoid these branches by moving bytes (e.g. REP MOVSB/STOSB).
For large areas, cache management is important - prefetch cache lines (or pre-load them if prefetch isn't supported) before you need them and flush cache lines as soon as you're finished with them to avoid polluting the cache (e.g. CLFLUSH instruction). For copying data, this includes the source and the destination. For example, if you write one dword in a cache line then the CPU may need to read the rest of the cache line's data before it can do the write (even if the rest of the cache line's data will be overwritten by later writes).
For very large areas, consider setting the destination area to "write combining" so that the CPU won't need to read any cache lines that are being modified. Unfortunately, setting pages to write combining involves invalidating the TLB entry for each page (INVLPG). Without using the PAT attributes in page tables you'd need to modify MTRRs and flush all caches (on all CPUs at the same time) which can be *expensive*. Don't forget to change back to "write-back" when you're done.
The CPU's string instructions (e.g. REP MOVSD/Q) are optimized so that under certain conditions they work on cache lines instead of individual dwords or qwords. The conditions depend on the CPU - for some source and destination must be aligned (but not for others). This also has some setup costs, and the CPU won't do this if the area is too small (to avoid the setup costs).
Using SSE can simplify cache management (non-temporal stores). SSE can also complicate things (alignment issues). For e.g. if source and destination aren't aligned you can get good performance by doing aligned load followed by shifting followed by aligned store.
A generic routine can't be optimized for any specific case. With several different special case routines a programmer can choose the best routine for the situation. For example, you could have a routine specifically designed for small memory areas that just uses "REP MOVSB" (and is always inlined - e.g. written as a macro) to minimize startup overheads.
Don't assume your code is better - always measure the performance on a variety of different computers. Something that's faster on one CPU may be slower on another CPU, and something that's faster one one CPU may be slower on a different computer with the exact same CPU because of different RAM speeds.
Avoiding the need to copy or fill memory is always going to be faster.

I'd also recommend finding out what the compiler does already. Most compilers have additional support for optimized memcpy() and memset() functions, and don't treat them like normal library functions. It's possible that your own "optimized" versions will make performance worse.

Cheers,

Brendan

jgraef · **Posted:** Sun Oct 05, 2008 1:15 pm

This are my memcpy and memset functions. They are optimized for 32bit.

Code:

void *memcpy(void *dest,const void *src,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint32_t *src32 = (uint32_t*)src;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t *src8 = ((uint8_t*)src)+num_dwords*4;
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = src32[i];
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = src8[i];
  }
  return dest;
}

Code:

void *memset(void *dest,int val,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t val8 = (uint8_t)val;
  uint32_t val32 = val|(val<<8)|(val<<16)|(val<<24);
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = val32;
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = val8;
  }
  return dest;
}

So what they basically do is to handle 4 bytes at once and then when less then 4 bytes are left they do it byte-wise.

Maybe it would be more architecture independent to use int (or unsigned int) instead of uint32_t and then sizeof(int) where the size for the fastest integer is needed.

Jeko · **Posted:** Sun Oct 05, 2008 3:51 pm

Code:

static inline void *memcpy(void *to, const void *from, unsigned int n)
{
   if(cpu.feature.flags.edx.sse)
   {
      int i;
      for(i=0; i<n/16; i++)
      {
         __asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

         from += 16;
         to += 16;
      }
   }
   else if(n&15 && cpu.feature.flags.edx.mmx)
   {
      n = n&15;
      int i;
      for(i=0; i<n/8; i++)
      {
         __asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
         from += 8;
         to += 8;
      }
   }
   if(n & 7)
   {
      n = n&7;

      int d0, d1, d2;
      __asm__ __volatile__(
      "rep ; movsl\n\t"
      "testb $2,%b4\n\t"
      "je 1f\n\t"
      "movsw\n"
      "1:\ttestb $1,%b4\n\t"
      "je 2f\n\t"
      "movsb\n"
      "2:"
      : "=&c" (d0), "=&D" (d1), "=&S" (d2)
      :"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
      : "memory");
   }
   return (to);
}

This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:

Code:

static inline void *memset(void *s, char c, unsigned int count)
{
   int d0, d1;
   __asm__ __volatile__(
   "rep\n\t"
   "stosb"
   : "=&c" (d0), "=&D" (d1)
   :"a" (c),"1" (s),"0" (count)
   :"memory");
   return s;
}

geppy · **Joined:** Wed Jun 11, 2008 5:30 pm **Posts:** 27

01000101 wrote:

I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it

Sure thing you should. All LongMode CPUs have SSE2 I think.

so here it goes - clearing allocated memory before giving it to the user process:

Code:

;input: rsi - 4KB aligned addr, eax - number of 4KB blocks 

  shl  eax, 12
  pxor xmm0, xmm0
  neg  rax
  sub  rsi, rax
.zero:
  movaps [rsi+rax], xmm0
  movaps [rsi+rax+16], xmm0
  movaps [rsi+rax+32], xmm0
  movaps [rsi+rax+48], xmm0
  movaps [rsi+rax+64], xmm0
  movaps [rsi+rax+80], xmm0
  movaps [rsi+rax+96], xmm0
  movaps [rsi+rax+112], xmm0
  movaps [rsi+rax+128], xmm0
  movaps [rsi+rax+144], xmm0
  movaps [rsi+rax+160], xmm0
  movaps [rsi+rax+176], xmm0
  movaps [rsi+rax+192], xmm0
  movaps [rsi+rax+208], xmm0
  movaps [rsi+rax+224], xmm0
  movaps [rsi+rax+240], xmm0
  add    rax, 256
  jnz    .zero                  

Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things

and few links to more appropriate threads:
link 1
link2

iammisc · **Joined:** Thu Nov 09, 2006 6:23 pm **Posts:** 269

Personally, on a 64-bit machine, I would do 64-bit copies and implement Duff's device.

http://en.wikipedia.org/wiki/Duff%27s_device

Brendan · **Posted:** Wed Oct 08, 2008 7:21 pm

Hi,

geppy wrote:

Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things

Especially for copying data, I wouldn't recommend using MOVAPS for arbitrary bytes, because MOVAPS works on single precision floating point numbers - the CPU may modify your data while converting from an unusual floating point encoding into a usual floating point format (e.g. converting denormals into normals, or converting denormals to zero), and some combinations of arbitrary bytes may represent SNaNs (where SNaNs may trigger exceptions). Note: For clearing memory it doesn't make too much difference because the integer 0x00000000 is the floating point encoding for positive zero).

Instead, it'd be better to use MOVDQA (which works on integer quadwords). For filling many pages with zero it'd be even better to use MOVNTDQ (which also works on integer quadwords, but includes a non-temporal hint) to avoid filling the caches with zeros... Note: MOVNTDQ won't improve the speed of the copy/fill operation, but will prevent lots of cache misses in code that runs after the copy/fill operation has completed.

Cheers,

Brendan

thre3dee · **Joined:** Mon Nov 03, 2008 7:42 pm **Posts:** 34

Jeko wrote:

Code:

static inline void *memcpy(void *to, const void *from, unsigned int n)
{
   if(cpu.feature.flags.edx.sse)
   {
      int i;
      for(i=0; i<n/16; i++)
      {
         __asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

         from += 16;
         to += 16;
      }
   }
   else if(n&15 && cpu.feature.flags.edx.mmx)
   {
      n = n&15;
      int i;
      for(i=0; i<n/8; i++)
      {
         __asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
         from += 8;
         to += 8;
      }
   }
   if(n & 7)
   {
      n = n&7;

      int d0, d1, d2;
      __asm__ __volatile__(
      "rep ; movsl\n\t"
      "testb $2,%b4\n\t"
      "je 1f\n\t"
      "movsw\n"
      "1:\ttestb $1,%b4\n\t"
      "je 2f\n\t"
      "movsb\n"
      "2:"
      : "=&c" (d0), "=&D" (d1), "=&S" (d2)
      :"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
      : "memory");
   }
   return (to);
}

This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:

Code:

static inline void *memset(void *s, char c, unsigned int count)
{
   int d0, d1;
   __asm__ __volatile__(
   "rep\n\t"
   "stosb"
   : "=&c" (d0), "=&D" (d1)
   :"a" (c),"1" (s),"0" (count)
   :"memory");
   return s;
}

Hi Jeko, I'd love to use this code in my MSVC kernel however I can't use this inline-assembly as its not compatible with MSVC.

Could you possibly translate this into an generic assembly as I've never encountered this type of inline assembly with numbers and string quotes everywhere?

Unless you know the MSVC inline assembly syntax:

Code:

__asm {
    mov eax, [myFunctionVar]
    int 0x10
    ; etc etc
}

Would be a great help as I've never used any extended CPU features like SSE1/2/3DNOW...

tsp · **Joined:** Tue Apr 24, 2007 5:35 am **Posts:** 14

iammisc wrote:

...Duff's device...

Thanks for this piece of information! I decided to try Duff's Device on my kernel project and the results were pretty good. The start situation was that I used next piece of code for my memory copy function:

Code:

    while (length != 0)
        {
        *(p_dst++) = *(p_src++);
        length--;
        }

I changed the implementation to Duff's Device (as shown in the wiki + a small change so that also the dst pointer is incremented) and the results are that initially on my 32-bit ARM7TDMI based platform running @ 48MHz the data transfer rate (from SRAM to SRAM) was 1.29MB/s but with Duff's Device the results are now 1.88MB/s, so Duff's Device gives ~45% better performance for memory copy operations in this HW platform.

kmtdk · **Posted:** Mon Nov 17, 2008 3:01 pm

well
if i remeber right
why not use SMID to transfer big placements ( i dont know anything about SMID)
and rep movsd on the small parts.

but i fell i have missd something in this thred ?

KMT dk

CodeCat · **Posted:** Mon Nov 17, 2008 7:22 pm

See Brendan's post.

kmtdk · **Posted:** Tue Nov 18, 2008 1:19 am

i see now, i think i might overlooked it.

sorry .....

KMT dk

OSDev.org

Optimized memory functions?

Who is online