I use perf to collect some important counters. here are the results:
Without nop instruction:
Code:
$ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./slow 0
&i = 7ffc07966f40, &offset=7ffc07966f00
ns: 3.728576
Performance counter stats for './slow 0':
213,867,558 uops_executed.stall_cycles (17.56%)
491,423,003 uops_issued.stall_cycles (23.45%)
512,748,389 uops_retired.stall_cycles (17.62%)
883,538,689 cycle_activity.cycles_mem_any (23.51%)
204,676,274 icache_64b.iftag_hit (23.55%)
7,437 br_misp_retired.all_branches # 872.9 Branch_Misprediction_Cost (23.55%)
149,274 machine_clears.count (23.55%)
1,009,768,664 uops_issued.any (23.55%)
1,005,689,857 uops_retired.retire_slots (23.56%)
1,029,988 int_misc.recovery_cycles (23.56%)
889,352,008 cycles (29.44%)
375,429,113 idq_uops_not_delivered.cycles_0_uops_deliv.core (29.44%)
1,303,779 int_misc.clear_resteer_cycles (29.44%)
24,653 baclears.any (29.44%)
1,005,647,085 uops_retired.retire_slots # 0.9 UPI (29.44%)
1,104,896,299 inst_retired.any (29.44%)
1,101,945,035 inst_retired.any # 0.8 CPI (23.56%)
889,131,234 cycles (23.56%)
1,200,172,618 uops_executed.thread # 1.7 ILP (17.60%)
725,629,824 uops_executed.core_cycles_ge_1 (17.60%)
0.374539772 seconds time elapsed
0.370386000 seconds user
0.000987000 seconds sys
With nop instruction:
Code:
$ $ perf stat -e uops_executed.stall_cycles,uops_issued.stall_cycles,uops_retired.stall_cycles,cycle_activity.cycles_mem_any,icache_64b.iftag_hit -M Branch_Misprediction_Cost -M Pipeline ./fast 0
&i = 7ffd56fbeac0, &offset=7ffd56fbea80
ns: 2.212173
91,068,788 uops_retired.stall_cycles (18.01%)
523,629,214 cycle_activity.cycles_mem_any (23.87%)
2,324,443 icache_64b.iftag_hit (23.46%)
4,823 br_misp_retired.all_branches # 1171.2 Branch_Misprediction_Cost (23.42%)
122,449 machine_clears.count (23.43%)
1,216,774,633 uops_issued.any (23.42%)
1,206,512,556 uops_retired.retire_slots (23.42%)
705,325 int_misc.recovery_cycles (23.43%)
528,397,938 cycles (29.28%)
22,295,204 idq_uops_not_delivered.cycles_0_uops_deliv.core (29.28%)
1,151,776 int_misc.clear_resteer_cycles (29.28%)
15,162 baclears.any (29.28%)
1,205,231,069 uops_retired.retire_slots # 1.0 UPI (29.28%)
1,204,519,156 inst_retired.any (29.28%)
1,204,031,494 inst_retired.any # 0.4 CPI (23.43%)
528,232,797 cycles (23.43%)
1,312,254,164 uops_executed.thread # 2.6 ILP (17.57%)
497,642,069 uops_executed.core_cycles_ge_1 (17.57%)
0.222836590 seconds time elapsed
0.217941000 seconds user
0.002964000 seconds sys
We can see that the slower one has more cycle_activity.cycles_mem_any, uops_issued.stall_cycles, br_misp_retired.all_branches and icache_64b.iftag_hit. I don't know if this means the slower one has more misprediction in pipeline and thus have to re-fetch more instructions.