How can a nop instruction speed up my test program?

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

I write a test program to test cache performance on x86 and found an unexpected problem.

Code:

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define BUF_SIZE 8192
#define ROUND 100000000UL
int main(int argc, char **argv)
{
   char *buf, *buf_newaddr, *buf_pageend;
   unsigned long i __attribute__((aligned(64)));
   int buf_realsize;
   unsigned long offset __attribute__((aligned(64)));
   struct timespec start={0,0}, end={0,0};
   double start_ns, end_ns;

   if (argc != 2) {
      printf("missing args\n");
      exit(-1);
   }

   offset = atoi(argv[1]);

again:
   buf = (void *)malloc(BUF_SIZE);
   buf_pageend = (void *)((unsigned long)(buf + 4095) & 0xfffffffffffff000UL);
   if (buf_pageend - buf < 1024) { // make sure we have enough space for negative 'offset'
      // don't free, occupt it in order to alloc another different block
      goto again;
   }
   memset(buf, 0, BUF_SIZE);

   printf("&i = %lx, &offset=%lx\n", &i, &offset);
   clock_gettime(CLOCK_MONOTONIC, &start);
   for (i = 0; i < ROUND; i++) {
      *((unsigned long *)(buf_pageend + offset)) = 0; // mark
   }
   clock_gettime(CLOCK_MONOTONIC, &end);
   start_ns = start.tv_sec*1000000000 + start.tv_nsec;
   end_ns = end.tv_sec*1000000000 + end.tv_nsec;
   printf("ns: %lf\n", (end_ns - start_ns)/ROUND);
}

The asm corresponding to the loop is as following:

Code:

        call    clock_gettime
        movq    $0, -112(%rbp)
        jmp     .L5
.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax
        jbe     .L6
        leaq    -208(%rbp), %rax
        movq    %rax, %rsi
        movl    $1, %edi
        call    clock_gettime

The program will print:

Code:

$ ./a.out 0
&i = 7fffd9811700, &offset=7fffd98116c0
ns: 3.217088

But if I insert a nop instruction:

Quote:

.L6:
movq -176(%rbp), %rdx
movq -64(%rbp), %rax
addq %rdx, %rax
nop
movq $0, (%rax)
movq -112(%rbp), %rax
addq $1, %rax
movq %rax, -112(%rbp)
.L5:

The result will be:

Code:

$ ./a.out 0
&i = 7ffef00bb380, &offset=7ffef00bb340
ns: 2.104905

My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!

iansjack · **Posted:** Sun Oct 30, 2022 2:25 pm

I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.

devc1 · **Posted:** Sun Oct 30, 2022 4:06 pm

the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

haolee wrote:

I write a test program to test cache performance on x86 and found an unexpected problem.

My program is compiled with O0.
I can't understand how the nop instruction speed up my program. Thanks!

-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.

But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.

The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.

Your printf format specifiers are wrong as well. To print pointers, you should use "%p".

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

iansjack wrote:

I couldn’t say if it’s the case with your program, but nop instructions can alter execution times by altering the alignment of memory accesses.

This could be a reason. Do we have any methods to prove it?

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

devc1 wrote:

the added latency can be just from the host OS scheduler. Does it show the same result when you run it multiple times ?

I have run it many times and the result is very steady.

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

Quote:

-O0 turns off most optimisations. When I use -O0, I do replicate your results, but see a much smaller effect, as much as 10% difference compared to your ~50% difference. Running 10 times in each binary, I get about 10% variation from fastest to slowest of each type, so the margin of error is probably quite high.

Yes, the result may have a certain amount of error but the difference is still significant.

Quote:

But once I turn optimisations on, your for loops is in fact invariant, so it's optimised away and replaced with a single store.

O2 will eliminate the assignment statement and we can use

Code:

*((volatile unsigned long *)(buf_pageend + offset)) = 0;

to prevent it to be optimized out. And if we use O2 optimisation level, the code will be very fast and nop is useless.

Quote:

The difference with the nop is probably micro-architecture specific, relating to pipeline scheduling (my CPU is a Sandy Bridge based i5-2520M). I'd be surprised if it's instruction alignment based, Intel will have optimised that away in the 25+ years of the P6 derived architecture.

I also suspect nop instruction affects the pipeline but I can find any direct evidence.

Quote:

Your printf format specifiers are wrong as well. To print pointers, you should use "%p".

Yes, %p should be better. Thanks.

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

I use perf to collect some important counters. here are the results:

Without nop instruction:

Code:

$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

 Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys

With nop instruction:

Code:

$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys

We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

haolee wrote:

I use perf to collect some important counters. here are the results:

Without nop instruction:

Code:

$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576

 Performance counter stats for './slow 0':

       213,867,558      uops_executed.stall_cycles                                     (17.56%)
       491,423,003      uops_issued.stall_cycles                                      (23.45%)
       512,748,389      uops_retired.stall_cycles                                     (17.62%)
       883,538,689      cycle_activity.cycles_mem_any                                     (23.51%)
       204,676,274      icache_64b.iftag_hit                                          (23.55%)
             7,437      br_misp_retired.all_branches #    872.9 Branch_Misprediction_Cost  (23.55%)
           149,274      machine_clears.count                                          (23.55%)
     1,009,768,664      uops_issued.any                                               (23.55%)
     1,005,689,857      uops_retired.retire_slots                                     (23.56%)
         1,029,988      int_misc.recovery_cycles                                      (23.56%)
       889,352,008      cycles                                                        (29.44%)
       375,429,113      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.44%)
         1,303,779      int_misc.clear_resteer_cycles                                     (29.44%)
            24,653      baclears.any                                                  (29.44%)
     1,005,647,085      uops_retired.retire_slots #      0.9 UPI                      (29.44%)
     1,104,896,299      inst_retired.any                                              (29.44%)
     1,101,945,035      inst_retired.any          #      0.8 CPI                      (23.56%)
       889,131,234      cycles                                                        (23.56%)
     1,200,172,618      uops_executed.thread      #      1.7 ILP                      (17.60%)
       725,629,824      uops_executed.core_cycles_ge_1                                     (17.60%)

       0.374539772 seconds time elapsed

       0.370386000 seconds user
       0.000987000 seconds sys

With nop instruction:

Code:

$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173

        91,068,788      uops_retired.stall_cycles                                     (18.01%)
       523,629,214      cycle_activity.cycles_mem_any                                     (23.87%)
         2,324,443      icache_64b.iftag_hit                                          (23.46%)
             4,823      br_misp_retired.all_branches #   1171.2 Branch_Misprediction_Cost  (23.42%)
           122,449      machine_clears.count                                          (23.43%)
     1,216,774,633      uops_issued.any                                               (23.42%)
     1,206,512,556      uops_retired.retire_slots                                     (23.42%)
           705,325      int_misc.recovery_cycles                                      (23.43%)
       528,397,938      cycles                                                        (29.28%)
        22,295,204      idq_uops_not_delivered.cycles_0_uops_deliv.core                                     (29.28%)
         1,151,776      int_misc.clear_resteer_cycles                                     (29.28%)
            15,162      baclears.any                                                  (29.28%)
     1,205,231,069      uops_retired.retire_slots #      1.0 UPI                      (29.28%)
     1,204,519,156      inst_retired.any                                              (29.28%)
     1,204,031,494      inst_retired.any          #      0.4 CPI                      (23.43%)
       528,232,797      cycles                                                        (23.43%)
     1,312,254,164      uops_executed.thread      #      2.6 ILP                      (17.57%)
       497,642,069      uops_executed.core_cycles_ge_1                                     (17.57%)

       0.222836590 seconds time elapsed

       0.217941000 seconds user
       0.002964000 seconds sys

We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.

I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 402

thewrongchristian wrote:

haolee wrote:

I use perf to collect some important counters. here are the results:

[snip]

I would turn off hyper-threading, start up in single user mode, and run the tests again. The numbers above look like the pipeline might be switching between threads of execution, and the nop is perhaps allowing that switch to happen at a more opportune time (note, this is just a guess, I'm no hyper-threading expert.)

With no hyper-threading, I suspect the difference will narrow. I'll also check my hyper-threading status.

I did some tests using perf, and the consistent result I saw was a higher CPU clock correlated with higher performance (as you'd expect).

I didn't turn off hyper-threading nor run single user to do the test.

Just using: perf spec <cmd> 0

Code:

&i = 7ffcfc0a0d40, &offset=7ffcfc0a0d80
ns: 2.162476

 Performance counter stats for './cachep 0':

            216.82 msec task-clock                #    0.997 CPUs utilized          
                 2      context-switches          #    9.224 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                62      page-faults               #  285.953 /sec                   
       630,072,269      cycles                    #    2.906 GHz                      (83.40%)
       328,427,135      stalled-cycles-frontend   #   52.13% frontend cycles idle     (83.40%)
         4,334,423      stalled-cycles-backend    #    0.69% backend cycles idle      (66.80%)
       999,062,021      instructions              #    1.59  insn per cycle         
                                                  #    0.33  stalled cycles per insn  (83.40%)
        99,788,133      branches                  #  460.238 M/sec                    (83.40%)
             2,898      branch-misses             #    0.00% of all branches          (83.01%)

       0.217446367 seconds time elapsed

       0.217450000 seconds user
       0.000000000 seconds sys


&i = 7ffc256532c0, &offset=7ffc25653300
ns: 2.017880

 Performance counter stats for './cachepnop 0':

            202.43 msec task-clock                #    0.997 CPUs utilized          
                 5      context-switches          #   24.699 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                64      page-faults               #  316.152 /sec                   
       618,866,183      cycles                    #    3.057 GHz                      (82.22%)
       317,277,966      stalled-cycles-frontend   #   51.27% frontend cycles idle     (82.22%)
         4,277,784      stalled-cycles-backend    #    0.69% backend cycles idle      (67.89%)
     1,100,042,543      instructions              #    1.78  insn per cycle         
                                                  #    0.29  stalled cycles per insn  (84.20%)
       100,308,804      branches                  #  495.512 M/sec                    (84.22%)
             5,670      branch-misses             #    0.01% of all branches          (83.44%)

       0.203112777 seconds time elapsed

       0.203148000 seconds user
       0.000000000 seconds sys

So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.

haolee · **Joined:** Thu Jul 13, 2017 9:52 am **Posts:** 22

Quote:

So I think the nop is just preventing CPU clock throttling.

Retry your tests without -e flag to perf stat, and see what your CPU clock rate is for nop-less and nop tests.

Unfortunately, on my Xeon Platinum 8260 CPU, the slower one and the faster one don't have difference on CPU clock rate which is both 2.394GHz.

Code:

# ./slow 0
772,752,989      cycles                    #    2.394 GHz
# ./fast 0
537,562,083      cycles                    #    2.394 GHz

More over, I found another strange thing. We first generate a asm file.

Code:

$ gcc -S test.c -o slow.s

Then edit this file.

Before:

Code:

.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        movq    -112(%rbp), %rax
        addq    $1, %rax
        movq    %rax, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax

After:

Code:

.L6:
        movq    -176(%rbp), %rdx
        movq    -64(%rbp), %rax
        addq    %rdx, %rax
        movq    $0, (%rax)
        addq    $1, -112(%rbp)
.L5:
        movq    -112(%rbp), %rax
        cmpq    $99999999, %rax

Then compile it and test.

Code:

$ gcc slow.s -o slow ; ./slow 0
&i = 7ffe81263600, &offset=7ffe812635c0
ns: 3.218585

It's slow as we expected. But now we use -ftree-ter to compile this program from .c file.

Code:

$ gcc -ftree-ter test.c -o fast; ./fast 0
&i = 7ffc6c8c72c0, &offset=7ffc6c8c7280
ns: 2.094179

It becomes faster.

And now we compare their asm code:

Code:

objdump slow -d > slow.dump
objdump fast -d > fast.dump
vim slow.dump # delete the address columns
vim fast.dump # delete the address columns
diff -u slow.dump fast.dump

Code:

-e8 b0 fd ff ff         callq  400620 <clock_gettime@plt>
+e8 b3 fd ff ff         callq  400620 <clock_gettime@plt>
 48 c7 45 90 00 00 00   movq   $0x0,-0x70(%rbp)
 00
-eb 1a                  jmp    400894 <main+0x12e>
+eb 1a                  jmp    400891 <main+0x12b>
 48 8b 95 50 ff ff ff   mov    -0xb0(%rbp),%rdx
 48 8b 45 c0            mov    -0x40(%rbp),%rax
 48 01 d0               add    %rdx,%rax
@@ -204,11 +203,11 @@
 48 83 45 90 01         addq   $0x1,-0x70(%rbp)
 48 8b 45 90            mov    -0x70(%rbp),%rax
 48 3d ff e0 f5 05      cmp    $0x5f5e0ff,%rax
-76 da                  jbe    40087a <main+0x114>
+76 da                  jbe    400877 <main+0x111>
 48 8d 85 30 ff ff ff   lea    -0xd0(%rbp),%rax
 48 89 c6               mov    %rax,%rsi
 bf 01 00 00 00         mov    $0x1,%edi
-e8 6c fd ff ff         callq  400620 <clock_gettime@plt>
+e8 6f fd ff ff         callq  400620 <clock_gettime@plt>

They don't have any significant difference. This means their only differences are the instruction addresses. what a weird behavior...

OSDev.org

How can a nop instruction speed up my test program?

Who is online