OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Mar 28, 2024 11:09 am

All times are UTC - 6 hours




Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: How can a nop instruction speed up my test program?
PostPosted: Sun Oct 30, 2022 11:48 am 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
I write a test program to test cache performance on x86 and found an unexpected problem.

Code:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define BUF_SIZE 8192
#define ROUND 100000000UL
int main(int argc, char **argv)
{
   char *buf, *buf_newaddr, *buf_pageend;
   unsigned long i __attribute__((aligned(64)));
   int buf_realsize;
   unsigned long offset __attribute__((aligned(64)));
   struct timespec start={0,0}, end={0,0};
   double start_ns, end_ns;

   if (argc != 2) {
      printf("missing args\n");
      exit(-1);
   }

   offset = atoi(argv[1]);

again:
   buf = (void *)malloc(BUF_SIZE);
   buf_pageend = (void *)((unsigned long)(buf + 4095) & 0xfffffffffffff000UL);
   if (buf_pageend - buf < 1024) { // make sure we have enough space for negative 'offset'
      // don't free, occupt it in order to alloc another different block
      goto again;
   }
   memset(buf, 0, BUF_SIZE);

   printf("&i = %lx, &offset=%lx\n", &i, &offset);
   clock_gettime(CLOCK_MONOTONIC, &start);
   for (i = 0; i < ROUND; i++) {
      *((unsigned long *)(buf_pageend + offset)) = 0; // mark
   }
   clock_gettime(CLOCK_MONOTONIC, &end);
   start_ns = start.tv_sec*1000000000 + start.tv_nsec;
   end_ns = end.tv_sec*1000000000 + end.tv_nsec;
   printf("ns: %lf\n", (end_ns - start_ns)/ROUND);
}


The asm corresponding to the loop is as following:

Code:
        call    clock_gettime
        movq    $0, -112(%rbp)
        jmp     .L5
.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax
        jbe     .L6
        leaq    -208(%rbp), %rax
        movq    %rax, %rsi
        movl    $1, %edi
        call    clock_gettime


The program will print:

Code:
$ ./a.out 0
&i = 7fffd9811700, &offset=7fffd98116c0
ns: 3.217088


But if I insert a nop instruction:

Quote:
.L6:
movq -176(%rbp), %rdx
movq -64(%rbp), %rax
addq %rdx, %rax
nop
movq $0, (%rax)
movq -112(%rbp), %rax
addq $1, %rax
movq %rax, -112(%rbp)
.L5:


The result will be:

Code:
$ ./a.out 0
&i = 7ffef00bb380, &offset=7ffef00bb340
ns: 2.104905


My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Sun Oct 30, 2022 2:25 pm 
Offline
Member
Member
User avatar

Joined: Sat Mar 31, 2012 3:07 am
Posts: 4591
Location: Chichester, UK
I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Sun Oct 30, 2022 4:06 pm 
Offline
Member
Member

Joined: Fri Feb 11, 2022 4:55 am
Posts: 435
Location: behind the keyboard
the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Sun Oct 30, 2022 4:15 pm 
Offline
Member
Member

Joined: Tue Apr 03, 2018 2:44 am
Posts: 401
haolee wrote:
I write a test program to test cache performance on x86 and found an unexpected problem.

My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!


-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.

But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.

The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.

Your printf format specifiers are wrong as well. To print pointers, you should use "%p".


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 1:00 am 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
iansjack wrote:
I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.


This could be a reason. Do we have any methods to prove it?


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 1:01 am 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
devc1 wrote:
the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?


I have run it many times and the result is very steady.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 1:11 am 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
Quote:
-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.


Yes, the result may have a certain amount of error but the difference is still significant.

Quote:
But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.


O2 will eliminate the assignment statement and we can use
Code:
*((volatile unsigned long *)(buf_pageend + offset)) = 0;
to prevent it to be optimized out. And if we use O2 optimisation level, the code will be very fast and nop is useless.

Quote:
The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.


I also suspect nop instruction affects the pipeline but I can find any direct evidence.

Quote:
Your printf format specifiers are wrong as well. To print pointers, you should use "%p".


Yes, %p should be better. Thanks.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 1:28 am 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
I use perf to collect some important counters. here are the results:

Without nop instruction:

Code:
$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys


With nop instruction:

Code:
$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys


We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 7:43 am 
Offline
Member
Member

Joined: Tue Apr 03, 2018 2:44 am
Posts: 401
haolee wrote:
I use perf to collect some important counters. here are the results:

Without nop instruction:

Code:
$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys


With nop instruction:

Code:
$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys


We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.



I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 1:14 pm 
Offline
Member
Member

Joined: Tue Apr 03, 2018 2:44 am
Posts: 401
thewrongchristian wrote:
haolee wrote:
I use perf to collect some important counters. here are the results:

[snip]



I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.


I did some tests using perf, and the consistent result I saw was a higher CPU clock correlated with higher performance (as you'd expect).

I didn't turn off hyper-threading nor run single user to do the test.

Just using: perf spec <cmd> 0

Code:
&i = 7ffcfc0a0d40, &offset=7ffcfc0a0d80
ns: 2.162476

Performance counter stats for './cachep 0':

            216.82 msec task-clock                #    0.997 CPUs utilized         
                 2      context-switches          #    9.224 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                62      page-faults               #  285.953 /sec                   
       630,072,269      cycles                    #    2.906 GHz                      (83.40%)
       328,427,135      stalled-cycles-frontend   #   52.13% frontend cycles idle     (83.40%)
         4,334,423      stalled-cycles-backend    #    0.69% backend cycles idle      (66.80%)
       999,062,021      instructions              #    1.59  insn per cycle         
                                                  #    0.33  stalled cycles per insn  (83.40%)
        99,788,133      branches                  #  460.238 M/sec                    (83.40%)
             2,898      branch-misses             #    0.00% of all branches          (83.01%)

       0.217446367 seconds time elapsed

       0.217450000 seconds user
       0.000000000 seconds sys


&i = 7ffc256532c0, &offset=7ffc25653300
ns: 2.017880

Performance counter stats for './cachepnop 0':

            202.43 msec task-clock                #    0.997 CPUs utilized         
                 5      context-switches          #   24.699 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                64      page-faults               #  316.152 /sec                   
       618,866,183      cycles                    #    3.057 GHz                      (82.22%)
       317,277,966      stalled-cycles-frontend   #   51.27% frontend cycles idle     (82.22%)
         4,277,784      stalled-cycles-backend    #    0.69% backend cycles idle      (67.89%)
     1,100,042,543      instructions              #    1.78  insn per cycle         
                                                  #    0.29  stalled cycles per insn  (84.20%)
       100,308,804      branches                  #  495.512 M/sec                    (84.22%)
             5,670      branch-misses             #    0.01% of all branches          (83.44%)

       0.203112777 seconds time elapsed

       0.203148000 seconds user
       0.000000000 seconds sys


So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.


Top
 Profile  
 
 Post subject: Re: How can a nop instruction speed up my test program?
PostPosted: Mon Oct 31, 2022 9:11 pm 
Offline
User avatar

Joined: Thu Jul 13, 2017 9:52 am
Posts: 22
Quote:
So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.


Unfortunately, on my Xeon Platinum 8260 CPU, the slower one and the faster one don't have difference on CPU clock rate which is both 2.394GHz.

Code:
# ./slow 0
772,752,989      cycles                    #    2.394 GHz
# ./fast 0
537,562,083      cycles                    #    2.394 GHz


More over, I found another strange thing. We first generate a asm file.

Code:
$ gcc -S test.c -o slow.s


Then edit this file.

Before:

Code:
.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax


After:

Code:
.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        addq    $1, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax


Then compile it and test.

Code:
$ gcc slow.s -o slow ; ./slow 0
&i = 7ffe81263600, &offset=7ffe812635c0
ns: 3.218585


It's slow as we expected. But now we use -ftree-ter to compile this program from .c file.

Code:
$ gcc -ftree-ter test.c -o fast; ./fast 0
&i = 7ffc6c8c72c0, &offset=7ffc6c8c7280
ns: 2.094179


It becomes faster.

And now we compare their asm code:

Code:
objdump slow -d > slow.dump
objdump fast -d > fast.dump
vim slow.dump # delete the address columns
vim fast.dump # delete the address columns
diff -u slow.dump fast.dump


Code:
-e8 b0 fd ff ff         callq  400620 <clock_gettime@plt>
+e8 b3 fd ff ff         callq  400620 <clock_gettime@plt>
48 c7 45 90 00 00 00   movq   $0x0,-0x70(%rbp)
00
-eb 1a                  jmp    400894 <main+0x12e>
+eb 1a                  jmp    400891 <main+0x12b>
48 8b 95 50 ff ff ff   mov    -0xb0(%rbp),%rdx
48 8b 45 c0            mov    -0x40(%rbp),%rax
48 01 d0               add    %rdx,%rax
@@ -204,11 +203,11 @@
48 83 45 90 01         addq   $0x1,-0x70(%rbp)
48 8b 45 90            mov    -0x70(%rbp),%rax
48 3d ff e0 f5 05      cmp    $0x5f5e0ff,%rax
-76 da                  jbe    40087a <main+0x114>
+76 da                  jbe    400877 <main+0x111>
48 8d 85 30 ff ff ff   lea    -0xd0(%rbp),%rax
48 89 c6               mov    %rax,%rsi
bf 01 00 00 00         mov    $0x1,%edi
-e8 6c fd ff ff         callq  400620 <clock_gettime@plt>
+e8 6f fd ff ff         callq  400620 <clock_gettime@plt>


They don't have any significant difference. This means their only differences are the instruction addresses. what a weird behavior...


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: 8infy and 59 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group