Linux 应用程序调试技术/面向性能和性能衡量
使用 gprof 分析应用程序
- 使用
-pg
编译代码 - 使用
-pg
链接 - 运行应用程序。这将在应用程序的当前文件夹中创建一个
gmon.out
文件。 - 在提示符下,在 gmon.out 所在的文件夹中:
gprof path-to-application
性能应用程序编程接口(PAPI)为程序员提供了对大多数主要微处理器中性能计数器硬件的访问权限。使用一个相当不错的 C++ 包装器,测量分支预测错误和缓存未命中(以及更多)只需一行代码即可完成。
默认情况下,这些是 papi::counters<Print_policy>
正在查找的事件
static const events_type events = {
{PAPI_TOT_INS, "Total instructions"}
, {PAPI_TOT_CYC, "Total cpu cycles"}
, {PAPI_L1_DCM, "L1 load misses"}
// , {PAPI_L1_STM, "L1 store missess"}
, {PAPI_L2_DCM, "L2 load misses"}
// , {PAPI_L2_STM, "L2 store missess"}
, {PAPI_BR_MSP, "Branch mispredictions"}
};
计数器类使用 Print_policy
参数化,指示计数器超出范围后要执行的操作。
例如,让我们看一下这些代码行
const int nlines = 196608;
const int ncols = 64;
char ctrash[nlines][ncols];
{
int x;
papi::counters<papi::stdout_print> pc("by column"); //<== the famous one-line
for (int c = 0; c < ncols; ++c) {
for (int l = 0; l < nlines; ++l) {
x = ctrash[l][c];
}
}
}
该代码只是循环遍历一个数组,但顺序错误:最内层循环迭代外层索引。虽然我们先循环第一个索引还是最后一个索引的结果相同,但从理论上讲,为了保持缓存局部性,最内层循环应该迭代最内层索引。这对于迭代数组所需的时间应该有很大影响
{
int x;
papi::counters<papi::stdout_print> pc("By line");
for (int l = 0; l < nlines; ++l) {
for (int c = 0; c < ncols; ++c) {
x = ctrash[l][c];
}
}
}
papi::counters 是一个围绕 PAPI 功能的包装类。它将在创建计数器对象时获取一些性能计数器的快照(在本例中,我们对缓存未命中和分支预测错误感兴趣),并在对象销毁时获取另一个快照。然后它将打印出差异。
第一次测量(使用未优化代码 (-O0))显示以下结果
Delta by column:
PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
PAPI_L1_DCM (L1 load misses): 28427 (30620-2193) <==
PAPI_L2_DCM (L2 load misses): 102 (1269-1167)
PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475) <==
Delta By line:
PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
PAPI_L1_DCM (L1 load misses): 403 (2046-1643) <==
PAPI_L2_DCM (L2 load misses): 21 (1081-1060)
PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416) <==
虽然缓存未命中确实有所改善,但分支预测错误却激增。这不是一个很好的权衡。在处理器的流水线中,比较操作会转换为分支操作。编译器生成的未优化代码有点奇怪。
通常,分支机器代码由 if/else
和三元运算符直接生成;以及由虚调用和通过指针调用间接生成
也许优化代码 (-O2) 表现更好?或者也许不是
Delta by column:
PAPI_TOT_INS (Total instructions): 329 (229368-229039)
PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
PAPI_L1_DCM (L1 load misses): 2 (1523-1521)
PAPI_L2_DCM (L2 load misses): 0 (993-993)
PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)
Delta By line:
PAPI_TOT_INS (Total instructions): 330 (209614-209284)
PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
PAPI_L1_DCM (L1 load misses): 2 (1498-1496)
PAPI_L2_DCM (L2 load misses): 0 (992-992)
PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)
这次编译器优化了循环!它发现我们并没有真正使用数组中的数据,所以它将其删除了。完全删除了!
让我们看看这段代码的运行情况
{
int x;
papi::counters<papi::stdout_print> pc("by column");
for (int c = 0; c < ncols; ++c) {
for (int l = 0; l < nlines; ++l) {
x = ctrash[l][c];
ctrash[l][c] = x + 1;
}
}
}
Delta by column:
PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
PAPI_L1_DCM (L1 load misses): 12415661 (12417203-1542)
PAPI_L2_DCM (L2 load misses): 9654638 (9655632-994)
PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)
Delta By line:
PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
PAPI_L1_DCM (L1 load misses): 197104 (12614449-12417345)
PAPI_L2_DCM (L2 load misses): 6330 (9662090-9655760)
PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)
缓存未命中和分支预测错误都至少提高了一个数量级。使用未优化代码运行将显示相同数量级的改进。
OProfile 提供对与 PAPI 相同的硬件计数器的访问权限,但无需对代码进行检测
- 它比 PAPI 更粗粒度 - 在函数级别。
- 一些开箱即用的内核(RedHat)不友好 OProfile。
- 您需要 root 权限。
#!/bin/bash
#
# A script to OProfile a program.
# Must be run as root.
#
if [ $# -ne 1 ]
then
echo "Usage: `basename $0` <for-binary-image>"
exit -1
else
binimg=$1
fi
opcontrol --stop
opcontrol --shutdown
# Out of the box RedHat kernels are OProfile repellent.
opcontrol --no-vmlinux
opcontrol --reset
# List of events for platform to be found in /usr/share/oprofile/<>/events
opcontrol --event=L2_CACHE_MISSES:1000
opcontrol --start
$binimg
opcontrol --stop
opcontrol --dump
rm $binimg.opreport.log
opreport > $binimg.opreport.log
rm $binimg.opreport.sym
opreport -l > $binimg.opreport.sym
opcontrol --shutdown
opcontrol --deinit
echo "Done"
perf
是一个基于内核的子系统,它为运行的程序对内核的影响提供了一个性能分析框架。它涵盖:硬件(CPU/PMU,性能监视单元)功能;以及软件功能(软件计数器、跟踪点)。
列出特定机器上可用的事件。这些事件根据系统的性能监视硬件和软件配置而有所不同。
$ perf list
List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
minor-faults [Software event]
major-faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-stores [Hardware cache event]
L1-dcache-store-misses [Hardware cache event]
L1-dcache-prefetches [Hardware cache event]
L1-dcache-prefetch-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-prefetches [Hardware cache event]
L1-icache-prefetch-misses [Hardware cache event]
LLC-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-stores [Hardware cache event]
LLC-store-misses [Hardware cache event]
LLC-prefetches [Hardware cache event]
LLC-prefetch-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-stores [Hardware cache event]
dTLB-store-misses [Hardware cache event]
dTLB-prefetches [Hardware cache event]
dTLB-prefetch-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
node-loads [Hardware cache event]
node-load-misses [Hardware cache event]
node-stores [Hardware cache event]
node-store-misses [Hardware cache event]
node-prefetches [Hardware cache event]
node-prefetch-misses [Hardware cache event]
rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]
mem:<addr>[:access] [Hardware breakpoint]
注意:以 root 身份运行将输出扩展的事件列表;一些事件(跟踪点?)需要 root 权限。
收集常见性能事件的总体统计信息,包括执行的指令和消耗的时钟周期。有选项标志可以收集除默认测量事件以外的事件的统计信息。
$ g++ -std=c++11 -ggdb -fno-omit-frame-pointer perftest.cpp -o perftest
$ perf stat ./perftest
Performance counter stats for './perftest':
379.991103 task-clock # 0.996 CPUs utilized
62 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
6,436 page-faults # 0.017 M/sec
984,969,006 cycles # 2.592 GHz [83.27%]
663,592,329 stalled-cycles-frontend # 67.37% frontend cycles idle [83.17%]
473,904,165 stalled-cycles-backend # 48.11% backend cycles idle [66.42%]
1,010,613,552 instructions # 1.03 insns per cycle
# 0.66 stalled cycles per insn [83.23%]
53,831,403 branches # 141.665 M/sec [84.14%]
401,518 branch-misses # 0.75% of all branches [83.48%]
0.381602838 seconds time elapsed
$ perf stat --event=L1-dcache-load-misses ./perftest
Performance counter stats for './perftest':
12,942,048 L1-dcache-load-misses
0.373719009 seconds time elapsed
perf record
[编辑 | 编辑源代码]将性能数据记录到文件中,该文件以后可以使用 perf report 进行分析。
$ perf record --event=L1-dcache-load-misses ./perftest
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (~1078 samples) ]
$ ls -al
...
-rw------- 1 amelinte amelinte 27764 Feb 17 17:23 perf.data
perf report
[编辑 | 编辑源代码]从文件中读取性能数据并分析记录的数据。
$ perf report --stdio
# ========
# captured on: Sun Feb 17 17:23:34 2013
# hostname : bear
# os release : 3.2.0-4-amd64
# perf version : 3.2.17
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU M 390 @ 2.67GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3857640 kB
# cmdline : /usr/bin/perf_3.2 record --event=L1-dcache-load-misses ./perftest
# event : name = L1-dcache-load-misses, type = 3, config = 0x10000, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = {
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 274 L1-dcache-load-misses
#
# Overhead Command Shared Object Symbol
# ........ ........ ................. ..................
#
# 95.93 percent perftest perftest [.] 0xd35
# 1.06 percent perftest [kernel.kallsyms] [k] clear_page_c
# 0.82 percent perftest [kernel.kallsyms] [k] run_timer_softirq
# 0.42 percent perftest [kernel.kallsyms] [k] trylock_page
# 0.41 percent perftest [kernel.kallsyms] [k] __rcu_pending
# 0.41 percent perftest [kernel.kallsyms] [k] update_curr
# 0.33 percent perftest [kernel.kallsyms] [k] do_raw_spin_lock
# 0.26 percent perftest [kernel.kallsyms] [k] __flush_tlb_one
# 0.18 percent perftest [kernel.kallsyms] [k] flush_old_exec
# 0.06 percent perftest [kernel.kallsyms] [k] __free_one_page
# 0.05 percent perftest [kernel.kallsyms] [k] free_swap_cache
# 0.05 percent perftest [kernel.kallsyms] [k] zone_statistics
# 0.04 percent perftest [kernel.kallsyms] [k] alloc_pages_vma
# 0.01 percent perftest [kernel.kallsyms] [k] mm_init
# 0.01 percent perftest [kernel.kallsyms] [k] vfs_read
# 0.00 percent perftest [kernel.kallsyms] [k] __cond_resched
# 0.00 percent perftest [kernel.kallsyms] [k] finish_task_switch
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
$ perf record -g ./perftest
$ perf report -g --stdio
...
# Overhead Command Shared Object Symbol
# ........ ........ ................. ......................
#
97.23% perftest perftest [.] 0xc75
|
--- 0x400d2c
0x400dfb
__libc_start_main
perf annotate
[编辑 | 编辑源代码]读取输入文件并显示代码的注释版本。如果目标文件包含调试符号,则源代码将与汇编代码一起显示。如果没有调试信息,则显示注释的汇编代码。坏了!?
$ perf annotate -i ./perf.data -d ./perftest --stdio -f
Warning:
The ./perf.data file has no samples!
类似于性能顶部工具。它实时生成和显示性能计数器概要文件。
perf archive
[编辑 | 编辑源代码]perf inject
[编辑 | 编辑源代码]perf probe
[编辑 | 编辑源代码]perf sched
[编辑 | 编辑源代码]perf script
[编辑 | 编辑源代码]Cachegrind 模拟了一台具有两级([I1 & D1] 和 L2)缓存和分支(错误)预测的机器。它在代码注释方面很有用,因为它可以注释到行级别。它与机器的实际 CPU 可能有很大差异。在 AMD64 CPU 上不会走得太远(vex 反汇编程序问题)。速度极慢,通常会使应用程序速度降低 12-15 倍。
libmemleak 可以轻松地修改以跟踪对代码中特定点的调用。只需在该位置插入一个 mtrace()
调用。
getloadavg(3)
[编辑 | 编辑源代码]L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns
Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs
SSD random read ........................ 150,000 ns = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs
Round trip within same datacenter ...... 500,000 ns = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms
Disk seek ........................... 10,000,000 ns = 10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms
Operation Cost (ns) Ratio
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 3,000 5,000
Global Comms 130,000,000 216,000,000
Table 2.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System
- 来源:Paul E. McKenney
- 假设密集浮点计算的性能损失(每个核心只有一个 FPU 和一个 ALU(两个流水线))。
- http://ayumilove.net/hyperthreading-faq/
- http://en.wikipedia.org/wiki/Hyper-threading