OSDev.org

The Place to Start for Operating System Developers
It is currently Fri Apr 19, 2024 2:09 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 62 posts ]  Go to page 1, 2, 3, 4, 5  Next
Author Message
 Post subject: Optimized memory functions?
PostPosted: Sat Oct 04, 2008 8:17 pm 
Offline
Member
Member
User avatar

Joined: Fri Jun 22, 2007 12:47 pm
Posts: 1598
Location: New Hampshire, USA
Hey,
I've been brainstorming up some ways to optimize core memory functions such as 'memset' and 'memcpy'.
I wanted to make it so that when I do something like 'memset(&cArray, 0x00, 128);', that there wouldn't have to be a loop of 128 char copy commands. Most memset functions I have seen make it so that it loops while the 'count' variable decrements and just does a char copy over and over. I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it, same goes for 32 bit OSs as well.

here's a rough prototype that I'm testing right now, I wanted to know of any feedback regarding these memset and memcpy functions.

Code:
void memcpy(void *dest, void *src, uint64 count)
{
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = *(uint64*)src; memcpy_transfers_64++; dest += 8; src += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = *(uint32*)src; memcpy_transfers_32++; dest += 4; src += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = *(uint16*)src; memcpy_transfers_16++; dest += 2; src += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = *(uint8*)src; memcpy_transfers_8++; dest += 1; src += 1; count -= 1; }
  return;
}

void memset(void *dest, uint8 sval, uint64 count)
{
  uint64 val = (sval & 0xFF); // create a 64-bit version of 'sval'
  val |= ((val << 8) & 0xFF00);
  val |= ((val << 16) & 0xFFFF0000);
  val |= ((val << 32) & 0xFFFFFFFF00000000);
   
  if(!count){return;} // nothing to copy?
  while(count >= 8){ *(uint64*)dest = (uint64)val; memset_transfers_64++; dest += 8; count -= 8; }
  while(count >= 4){ *(uint32*)dest = (uint32)val; memset_transfers_32++; dest += 4; count -= 4; }
  while(count >= 2){ *(uint16*)dest = (uint16)val; memset_transfers_16++; dest += 2; count -= 2; }
  while(count >= 1){ *(uint8*)dest = (uint8)val; memset_transfers_8++; dest += 1; count -= 1; }
  return;
}


say you want to do a memset of 128 bytes, this would perform 16 64-bit transfers instead of 128 8-bit transfers. Also, say if you wanted to memset 18 bytes, it would perform 2 64-bit and 1 16-bit operations, instead of 18 operations.

What do you think of this type of optimization? What would be the pitfalls?

[edit]btw, the memxxx_transfers_xx variable is just a log variable, and if I wasn't as lazy, would have taken out.[/edit]

_________________
Website: https://Joscor.com


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sat Oct 04, 2008 11:21 pm 
Offline
Member
Member
User avatar

Joined: Tue Oct 17, 2006 6:06 pm
Posts: 1437
Location: Vancouver, BC, Canada
You need to handle the case where the starting address is not aligned properly.

_________________
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 6:44 am 
Offline
Member
Member

Joined: Tue Sep 23, 2008 1:45 pm
Posts: 158
Location: Eindhoven, Netherlands
You want to optimise the code in the loop as much as possible, so you should reduce the amount of variables to change inside the loop. For memset in fact all you need is a single int64 pointer, and then increment that pointer every iteration. Before the loop, create a pointer to the point past the last 64 bit value to be written: int64* end = (int64*) dest + (count / 8); Then the while loop just has to check whether the loop pointer equals the end pointer each time, which comes down to just a 'cmp' and a 'je' instruction.

Also is using inline assembly an option? If so, you might want to consider using the movsd/movsq instructions to automate the moving, and to further reduce the amount of code inside the loop.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 7:11 am 
Offline
Member
Member
User avatar

Joined: Wed Feb 07, 2007 1:45 pm
Posts: 1401
Location: Eugene, OR, US
I have seen memsets that do what you are doing -- as Colonel Kernel says: you need to do a few byte operations to align the pointer on a qword boundary to begin with -- then do the 64 bit ops -- then maybe a few more byte ops to finish up. Don't bother with the dword and word ops, I'd say.

Also, precalculate the number of 64b loops, at least, and count them down to 0. It is always more efficient to test on 0.

And there are no pitfalls.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 7:31 am 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

Some quick notes:
  • For copying or filling small areas of memory, the code to handle the first few bytes (alignment) and last few bytes will cause branch mis-predictions that will cost more than they're worth. For areas less than 128 bytes it's probably better to avoid these branches by moving bytes (e.g. REP MOVSB/STOSB).
  • For large areas, cache management is important - prefetch cache lines (or pre-load them if prefetch isn't supported) before you need them and flush cache lines as soon as you're finished with them to avoid polluting the cache (e.g. CLFLUSH instruction). For copying data, this includes the source and the destination. For example, if you write one dword in a cache line then the CPU may need to read the rest of the cache line's data before it can do the write (even if the rest of the cache line's data will be overwritten by later writes).
  • For very large areas, consider setting the destination area to "write combining" so that the CPU won't need to read any cache lines that are being modified. Unfortunately, setting pages to write combining involves invalidating the TLB entry for each page (INVLPG). Without using the PAT attributes in page tables you'd need to modify MTRRs and flush all caches (on all CPUs at the same time) which can be *expensive*. Don't forget to change back to "write-back" when you're done.
  • The CPU's string instructions (e.g. REP MOVSD/Q) are optimized so that under certain conditions they work on cache lines instead of individual dwords or qwords. The conditions depend on the CPU - for some source and destination must be aligned (but not for others). This also has some setup costs, and the CPU won't do this if the area is too small (to avoid the setup costs).
  • Using SSE can simplify cache management (non-temporal stores). SSE can also complicate things (alignment issues). For e.g. if source and destination aren't aligned you can get good performance by doing aligned load followed by shifting followed by aligned store.
  • A generic routine can't be optimized for any specific case. With several different special case routines a programmer can choose the best routine for the situation. For example, you could have a routine specifically designed for small memory areas that just uses "REP MOVSB" (and is always inlined - e.g. written as a macro) to minimize startup overheads.
  • Don't assume your code is better - always measure the performance on a variety of different computers. Something that's faster on one CPU may be slower on another CPU, and something that's faster one one CPU may be slower on a different computer with the exact same CPU because of different RAM speeds.
  • Avoiding the need to copy or fill memory is always going to be faster.

I'd also recommend finding out what the compiler does already. Most compilers have additional support for optimized memcpy() and memset() functions, and don't treat them like normal library functions. It's possible that your own "optimized" versions will make performance worse.

Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 1:15 pm 
Offline
Member
Member

Joined: Wed Jan 16, 2008 7:37 am
Posts: 47
Location: Wonsheim, Germany
This are my memcpy and memset functions. They are optimized for 32bit.

Code:
void *memcpy(void *dest,const void *src,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint32_t *src32 = (uint32_t*)src;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t *src8 = ((uint8_t*)src)+num_dwords*4;
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = src32[i];
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = src8[i];
  }
  return dest;
}


Code:
void *memset(void *dest,int val,size_t n) {
  uint32_t num_dwords = n/4;
  uint32_t num_bytes = n%4;
  uint32_t *dest32 = (uint32_t*)dest;
  uint8_t *dest8 = ((uint8_t*)dest)+num_dwords*4;
  uint8_t val8 = (uint8_t)val;
  uint32_t val32 = val|(val<<8)|(val<<16)|(val<<24);
  uint32_t i;

  for (i=0;i<num_dwords;i++) {
    dest32[i] = val32;
  }
  for (i=0;i<num_bytes;i++) {
    dest8[i] = val8;
  }
  return dest;
}


So what they basically do is to handle 4 bytes at once and then when less then 4 bytes are left they do it byte-wise.

Maybe it would be more architecture independent to use int (or unsigned int) instead of uint32_t and then sizeof(int) where the size for the fastest integer is needed.

_________________
meinOS


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 3:51 pm 
Offline
Member
Member
User avatar

Joined: Fri Mar 17, 2006 12:00 am
Posts: 500
Location: Napoli, Italy
Code:
static inline void *memcpy(void *to, const void *from, unsigned int n)
{
   if(cpu.feature.flags.edx.sse)
   {
      int i;
      for(i=0; i<n/16; i++)
      {
         __asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

         from += 16;
         to += 16;
      }
   }
   else if(n&15 && cpu.feature.flags.edx.mmx)
   {
      n = n&15;
      int i;
      for(i=0; i<n/8; i++)
      {
         __asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
         from += 8;
         to += 8;
      }
   }
   if(n & 7)
   {
      n = n&7;

      int d0, d1, d2;
      __asm__ __volatile__(
      "rep ; movsl\n\t"
      "testb $2,%b4\n\t"
      "je 1f\n\t"
      "movsw\n"
      "1:\ttestb $1,%b4\n\t"
      "je 2f\n\t"
      "movsb\n"
      "2:"
      : "=&c" (d0), "=&D" (d1), "=&S" (d2)
      :"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
      : "memory");
   }
   return (to);
}


This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:
Code:
static inline void *memset(void *s, char c, unsigned int count)
{
   int d0, d1;
   __asm__ __volatile__(
   "rep\n\t"
   "stosb"
   : "=&c" (d0), "=&D" (d1)
   :"a" (c),"1" (s),"0" (count)
   :"memory");
   return s;
}

_________________
Rewriting virtual memory manager - Working on ELF support - Working on Device Drivers Handling

http://sourceforge.net/projects/jeko - Jeko Operating System


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Sun Oct 05, 2008 5:46 pm 
Offline
Member
Member

Joined: Wed Jun 11, 2008 5:30 pm
Posts: 27
01000101 wrote:
I figured that, since I'm programming a 64-bit OS, that I might as well take advantage of it

Sure thing you should. All LongMode CPUs have SSE2 I think.

so here it goes - clearing allocated memory before giving it to the user process:
Code:
;input: rsi - 4KB aligned addr, eax - number of 4KB blocks

  shl  eax, 12
  pxor xmm0, xmm0
  neg  rax
  sub  rsi, rax
.zero:
  movaps [rsi+rax], xmm0
  movaps [rsi+rax+16], xmm0
  movaps [rsi+rax+32], xmm0
  movaps [rsi+rax+48], xmm0
  movaps [rsi+rax+64], xmm0
  movaps [rsi+rax+80], xmm0
  movaps [rsi+rax+96], xmm0
  movaps [rsi+rax+112], xmm0
  movaps [rsi+rax+128], xmm0
  movaps [rsi+rax+144], xmm0
  movaps [rsi+rax+160], xmm0
  movaps [rsi+rax+176], xmm0
  movaps [rsi+rax+192], xmm0
  movaps [rsi+rax+208], xmm0
  movaps [rsi+rax+224], xmm0
  movaps [rsi+rax+240], xmm0
  add    rax, 256
  jnz    .zero                 


Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things

and few links to more appropriate threads:
link 1
link2


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Wed Oct 08, 2008 6:04 pm 
Offline
Member
Member

Joined: Thu Nov 09, 2006 6:23 pm
Posts: 269
Personally, on a 64-bit machine, I would do 64-bit copies and implement Duff's device.

http://en.wikipedia.org/wiki/Duff%27s_device


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Wed Oct 08, 2008 7:21 pm 
Offline
Member
Member
User avatar

Joined: Sat Jan 15, 2005 12:00 am
Posts: 8561
Location: At his keyboard!
Hi,

geppy wrote:
Tip: having multiple mov(movaps,movq,...) in a single loop pass does speed up things


Especially for copying data, I wouldn't recommend using MOVAPS for arbitrary bytes, because MOVAPS works on single precision floating point numbers - the CPU may modify your data while converting from an unusual floating point encoding into a usual floating point format (e.g. converting denormals into normals, or converting denormals to zero), and some combinations of arbitrary bytes may represent SNaNs (where SNaNs may trigger exceptions). Note: For clearing memory it doesn't make too much difference because the integer 0x00000000 is the floating point encoding for positive zero).

Instead, it'd be better to use MOVDQA (which works on integer quadwords). For filling many pages with zero it'd be even better to use MOVNTDQ (which also works on integer quadwords, but includes a non-temporal hint) to avoid filling the caches with zeros... Note: MOVNTDQ won't improve the speed of the copy/fill operation, but will prevent lots of cache misses in code that runs after the copy/fill operation has completed.


Cheers,

Brendan

_________________
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Tue Nov 11, 2008 7:16 pm 
Offline
Member
Member

Joined: Mon Nov 03, 2008 7:42 pm
Posts: 34
Jeko wrote:
Code:
static inline void *memcpy(void *to, const void *from, unsigned int n)
{
   if(cpu.feature.flags.edx.sse)
   {
      int i;
      for(i=0; i<n/16; i++)
      {
         __asm__ __volatile__ ("movups (%0), %%xmm0\n" "movntdq %%xmm0, (%1)\n"::"r"(from), "r"(to) : "memory");

         from += 16;
         to += 16;
      }
   }
   else if(n&15 && cpu.feature.flags.edx.mmx)
   {
      n = n&15;
      int i;
      for(i=0; i<n/8; i++)
      {
         __asm__ __volatile__ ("movq (%0), %%mm0\n" "movq %%mm0, (%1)\n"::"r"(from), "r"(to):"memory");
         from += 8;
         to += 8;
      }
   }
   if(n & 7)
   {
      n = n&7;

      int d0, d1, d2;
      __asm__ __volatile__(
      "rep ; movsl\n\t"
      "testb $2,%b4\n\t"
      "je 1f\n\t"
      "movsw\n"
      "1:\ttestb $1,%b4\n\t"
      "je 2f\n\t"
      "movsb\n"
      "2:"
      : "=&c" (d0), "=&D" (d1), "=&S" (d2)
      :"0" (n/4), "q" (n),"1" ((long) to),"2" ((long) from)
      : "memory");
   }
   return (to);
}


This is my memcpy, optimized for SSE and MMX. I will support also SSE2 and SSE3, if it will be possible.
This memcpy is really really fast. When I implemented a GUI for my old OS, the windows' moving was impossible without this optimized memcpy.

And this is memset:
Code:
static inline void *memset(void *s, char c, unsigned int count)
{
   int d0, d1;
   __asm__ __volatile__(
   "rep\n\t"
   "stosb"
   : "=&c" (d0), "=&D" (d1)
   :"a" (c),"1" (s),"0" (count)
   :"memory");
   return s;
}

Hi Jeko, I'd love to use this code in my MSVC kernel however I can't use this inline-assembly as its not compatible with MSVC.

Could you possibly translate this into an generic assembly as I've never encountered this type of inline assembly with numbers and string quotes everywhere? :P Unless you know the MSVC inline assembly syntax:

Code:
__asm {
    mov eax, [myFunctionVar]
    int 0x10
    ; etc etc
}


Would be a great help as I've never used any extended CPU features like SSE1/2/3DNOW...


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Mon Nov 17, 2008 12:00 pm 
Offline

Joined: Tue Apr 24, 2007 5:35 am
Posts: 14
iammisc wrote:
...Duff's device...

Thanks for this piece of information! I decided to try Duff's Device on my kernel project and the results were pretty good. The start situation was that I used next piece of code for my memory copy function:

Code:
    while (length != 0)
        {
        *(p_dst++) = *(p_src++);
        length--;
        }


I changed the implementation to Duff's Device (as shown in the wiki + a small change so that also the dst pointer is incremented) and the results are that initially on my 32-bit ARM7TDMI based platform running @ 48MHz the data transfer rate (from SRAM to SRAM) was 1.29MB/s but with Duff's Device the results are now 1.88MB/s, so Duff's Device gives ~45% better performance for memory copy operations in this HW platform.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Mon Nov 17, 2008 3:01 pm 
Offline
Member
Member
User avatar

Joined: Sat May 17, 2008 4:05 am
Posts: 263
Location: Cyperspace, Denmark
well
if i remeber right
why not use SMID to transfer big placements ( i dont know anything about SMID)
and rep movsd on the small parts.

but i fell i have missd something in this thred ?

KMT dk

_________________
well, what to say, to much to do in too little space.
when it goes up hill, increase work, when it goes straight, test yourself but when going down, slow down.


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Mon Nov 17, 2008 7:22 pm 
Offline
Member
Member

Joined: Tue Sep 23, 2008 1:45 pm
Posts: 158
Location: Eindhoven, Netherlands
See Brendan's post. ;)


Top
 Profile  
 
 Post subject: Re: Optimized memory functions?
PostPosted: Tue Nov 18, 2008 1:19 am 
Offline
Member
Member
User avatar

Joined: Sat May 17, 2008 4:05 am
Posts: 263
Location: Cyperspace, Denmark
i see now, i think i might overlooked it.

sorry .....



KMT dk

_________________
well, what to say, to much to do in too little space.
when it goes up hill, increase work, when it goes straight, test yourself but when going down, slow down.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 62 posts ]  Go to page 1, 2, 3, 4, 5  Next

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 52 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group