Linux 应用程序调试技术/面向性能和性能衡量

gprof & -pg

使用 gprof 分析应用程序

使用 -pg 编译代码
使用 -pg 链接
运行应用程序。这将在应用程序的当前文件夹中创建一个 gmon.out 文件。
在提示符下，在 gmon.out 所在的文件夹中：gprof path-to-application

PAPI

性能应用程序编程接口（PAPI）为程序员提供了对大多数主要微处理器中性能计数器硬件的访问权限。使用一个相当不错的 C++ 包装器，测量分支预测错误和缓存未命中（以及更多）只需一行代码即可完成。

默认情况下，这些是 papi::counters<Print_policy> 正在查找的事件

static  const events_type events = {  
      {PAPI_TOT_INS, "Total instructions"} 
    , {PAPI_TOT_CYC, "Total cpu cycles"}
    , {PAPI_L1_DCM,  "L1 load  misses"}
//  , {PAPI_L1_STM,  "L1 store  missess"}
    , {PAPI_L2_DCM,  "L2 load  misses"}
//  , {PAPI_L2_STM,  "L2 store  missess"}
    , {PAPI_BR_MSP,  "Branch mispredictions"}
};

计数器类使用 Print_policy 参数化，指示计数器超出范围后要执行的操作。

例如，让我们看一下这些代码行

   const int nlines = 196608;
   const int ncols  = 64;
   char ctrash[nlines][ncols];
   {
       int x;
       papi::counters<papi::stdout_print> pc("by column"); //<== the famous one-line
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
           }
       }
   }

该代码只是循环遍历一个数组，但顺序错误：最内层循环迭代外层索引。虽然我们先循环第一个索引还是最后一个索引的结果相同，但从理论上讲，为了保持缓存局部性，最内层循环应该迭代最内层索引。这对于迭代数组所需的时间应该有很大影响

   {
       int x;
       papi::counters<papi::stdout_print> pc("By line");
       for (int l = 0; l < nlines; ++l) {
           for (int c = 0; c < ncols; ++c) {
               x = ctrash[l][c];
           }
       }
   }

papi::counters 是一个围绕 PAPI 功能的包装类。它将在创建计数器对象时获取一些性能计数器的快照（在本例中，我们对缓存未命中和分支预测错误感兴趣），并在对象销毁时获取另一个快照。然后它将打印出差异。

第一次测量（使用未优化代码 (-O0)）显示以下结果

Delta by column:
  PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
  PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
  PAPI_L1_DCM (L1 load  misses): 28427 (30620-2193)                 <==
  PAPI_L2_DCM (L2 load  misses): 102 (1269-1167)
  PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475)          <==

Delta By line:
  PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
  PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
  PAPI_L1_DCM (L1 load  misses): 403 (2046-1643)                    <==
  PAPI_L2_DCM (L2 load  misses): 21 (1081-1060)
  PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416)         <==

虽然缓存未命中确实有所改善，但分支预测错误却激增。这不是一个很好的权衡。在处理器的流水线中，比较操作会转换为分支操作。编译器生成的未优化代码有点奇怪。

通常，分支机器代码由 if/else 和三元运算符直接生成；以及由虚调用和通过指针调用间接生成

也许优化代码 (-O2) 表现更好？或者也许不是

Delta by column:
  PAPI_TOT_INS (Total instructions): 329 (229368-229039)
  PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
  PAPI_L1_DCM (L1 load  misses): 2 (1523-1521)
  PAPI_L2_DCM (L2 load  misses): 0 (993-993)
  PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)

Delta By line:
  PAPI_TOT_INS (Total instructions): 330 (209614-209284)
  PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
  PAPI_L1_DCM (L1 load  misses): 2 (1498-1496)
  PAPI_L2_DCM (L2 load  misses): 0 (992-992)
  PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)

这次编译器优化了循环！它发现我们并没有真正使用数组中的数据，所以它将其删除了。完全删除了！

让我们看看这段代码的运行情况

   {
       int x;
       papi::counters<papi::stdout_print> pc("by column");
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
               ctrash[l][c] = x + 1;
           }
       }
   }

Delta by column:
  PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
  PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
  PAPI_L1_DCM (L1 load  misses): 12415661 (12417203-1542)
  PAPI_L2_DCM (L2 load  misses): 9654638 (9655632-994)
  PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)

Delta By line:
  PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
  PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
  PAPI_L1_DCM (L1 load  misses): 197104 (12614449-12417345)
  PAPI_L2_DCM (L2 load  misses): 6330 (9662090-9655760)
  PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)

缓存未命中和分支预测错误都至少提高了一个数量级。使用未优化代码运行将显示相同数量级的改进。

参考文献

局部性原理

OProfile

OProfile 提供对与 PAPI 相同的硬件计数器的访问权限，但无需对代码进行检测

它比 PAPI 更粗粒度 - 在函数级别。
一些开箱即用的内核（RedHat）不友好 OProfile。
您需要 root 权限。

#!/bin/bash

#
# A script to OProfile a program. 
# Must be run as root. 
#


if [ $# -ne 1 ]
then
  echo "Usage: `basename $0` <for-binary-image>"
  exit -1
else
  binimg=$1
fi

opcontrol --stop
opcontrol --shutdown

# Out of the box RedHat kernels are OProfile repellent.
opcontrol --no-vmlinux
opcontrol --reset

# List of events for platform to be found in /usr/share/oprofile/<>/events
opcontrol --event=L2_CACHE_MISSES:1000


opcontrol --start

$binimg

opcontrol --stop
opcontrol --dump


rm $binimg.opreport.log
opreport > $binimg.opreport.log

rm $binimg.opreport.sym
opreport -l > $binimg.opreport.sym


opcontrol --shutdown
opcontrol --deinit
echo "Done"

参考文献

perf

perf 是一个基于内核的子系统，它为运行的程序对内核的影响提供了一个性能分析框架。它涵盖：硬件（CPU/PMU，性能监视单元）功能；以及软件功能（软件计数器、跟踪点）。

perf 教程

perf list

列出特定机器上可用的事件。这些事件根据系统的性能监视硬件和软件配置而有所不同。

$ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-load-misses                                   [Hardware cache event]
  node-stores                                        [Hardware cache event]
  node-store-misses                                  [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

  rNNN (see 'perf list --help' on how to encode it)  [Raw hardware event descriptor]

  mem:<addr>[:access]                                [Hardware breakpoint]

注意：以 root 身份运行将输出扩展的事件列表；一些事件（跟踪点？）需要 root 权限。

perf stat <options> <cmd>

收集常见性能事件的总体统计信息，包括执行的指令和消耗的时钟周期。有选项标志可以收集除默认测量事件以外的事件的统计信息。

$ g++ -std=c++11 -ggdb -fno-omit-frame-pointer perftest.cpp -o perftest

$ perf stat ./perftest
 Performance counter stats for './perftest':

        379.991103 task-clock                #    0.996 CPUs utilized          
                62 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
             6,436 page-faults               #    0.017 M/sec                  
       984,969,006 cycles                    #    2.592 GHz                     [83.27%]
       663,592,329 stalled-cycles-frontend   #   67.37% frontend cycles idle    [83.17%]
       473,904,165 stalled-cycles-backend    #   48.11% backend  cycles idle    [66.42%]
     1,010,613,552 instructions              #    1.03  insns per cycle        
                                             #    0.66  stalled cycles per insn [83.23%]
        53,831,403 branches                  #  141.665 M/sec                   [84.14%]
           401,518 branch-misses             #    0.75% of all branches         [83.48%]

       0.381602838 seconds time elapsed

$ perf stat --event=L1-dcache-load-misses ./perftest
 Performance counter stats for './perftest':

        12,942,048 L1-dcache-load-misses                                       

       0.373719009 seconds time elapsed

perf record

将性能数据记录到文件中，该文件以后可以使用 perf report 进行分析。

$ perf record --event=L1-dcache-load-misses ./perftest

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (~1078 samples) ]

$ ls -al
...
-rw-------  1 amelinte amelinte  27764 Feb 17 17:23 perf.data

perf report

从文件中读取性能数据并分析记录的数据。

$ perf report --stdio
# ========
# captured on: Sun Feb 17 17:23:34 2013
# hostname : bear
# os release : 3.2.0-4-amd64
# perf version : 3.2.17
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU M 390 @ 2.67GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3857640 kB
# cmdline : /usr/bin/perf_3.2 record --event=L1-dcache-load-misses ./perftest 
# event : name = L1-dcache-load-misses, type = 3, config = 0x10000, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = { 
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 274  L1-dcache-load-misses
#
# Overhead         Command   Shared Object      Symbol
# ........         ........  .................  ..................
#
#   95.93 percent  perftest  perftest           [.] 0xd35           
#    1.06 percent  perftest  [kernel.kallsyms]  [k] clear_page_c
#    0.82 percent  perftest  [kernel.kallsyms]  [k] run_timer_softirq
#    0.42 percent  perftest  [kernel.kallsyms]  [k] trylock_page
#    0.41 percent  perftest  [kernel.kallsyms]  [k] __rcu_pending
#    0.41 percent  perftest  [kernel.kallsyms]  [k] update_curr
#    0.33 percent  perftest  [kernel.kallsyms]  [k] do_raw_spin_lock
#    0.26 percent  perftest  [kernel.kallsyms]  [k] __flush_tlb_one
#    0.18 percent  perftest  [kernel.kallsyms]  [k] flush_old_exec
#    0.06 percent  perftest  [kernel.kallsyms]  [k] __free_one_page
#    0.05 percent  perftest  [kernel.kallsyms]  [k] free_swap_cache
#    0.05 percent  perftest  [kernel.kallsyms]  [k] zone_statistics
#    0.04 percent  perftest  [kernel.kallsyms]  [k] alloc_pages_vma
#    0.01 percent  perftest  [kernel.kallsyms]  [k] mm_init
#    0.01 percent  perftest  [kernel.kallsyms]  [k] vfs_read
#    0.00 percent  perftest  [kernel.kallsyms]  [k] __cond_resched
#    0.00 percent  perftest  [kernel.kallsyms]  [k] finish_task_switch
#
# (For a higher level overview, try: perf report --sort comm,dso)
#

$ perf record -g ./perftest
$ perf report -g --stdio
...
# Overhead   Command      Shared Object                  Symbol
# ........  ........  .................  ......................
#
    97.23%  perftest  perftest           [.] 0xc75           
            |
            --- 0x400d2c
                0x400dfb
                __libc_start_main

perf annotate

读取输入文件并显示代码的注释版本。如果目标文件包含调试符号，则源代码将与汇编代码一起显示。如果没有调试信息，则显示注释的汇编代码。坏了！？

$ perf annotate -i ./perf.data -d ./perftest --stdio -f
Warning:
The ./perf.data file has no samples!

perf top

类似于性能顶部工具。它实时生成和显示性能计数器概要文件。

perf archive

perf buildid-cache

perf buildid-list

perf diff

perf inject

perf kmem

perf kvm

perf probe

perf sched

perf script

perf timechart

Valgrind: cachegrind

Cachegrind 模拟了一台具有两级（[I1 & D1] 和 L2）缓存和分支（错误）预测的机器。它在代码注释方面很有用，因为它可以注释到行级别。它与机器的实际 CPU 可能有很大差异。在 AMD64 CPU 上不会走得太远（vex 反汇编程序问题）。速度极慢，通常会使应用程序速度降低 12-15 倍。

DIY: libhitcount

libmemleak 可以轻松地修改以跟踪对代码中特定点的调用。只需在该位置插入一个 mtrace() 调用。

getloadavg(3)

事物如何扩展

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns  =   1 ms
Disk seek ........................... 10,000,000 ns  =  10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms

来源：https://gist.github.com/2843375

Operation           Cost (ns)     Ratio
Clock period        0.6           1.0
Best-case CAS       37.9          63.2
Best-case lock      65.6          109.3
Single cache miss   139.5         232.5
CAS cache miss      306.0         510.0
Comms Fabric        3,000         5,000
Global Comms        130,000,000   216,000,000
Table 2.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System

来源：Paul E. McKenney

关于超线程的说明

假设密集浮点计算的性能损失（每个核心只有一个 FPU 和一个 ALU（两个流水线））。
http://ayumilove.net/hyperthreading-faq/
http://en.wikipedia.org/wiki/Hyper-threading

其他工具

变体