PERF performance-counter for Odroid XU3/XU4

Linux hardware performance measurement using counters, trace-points, software performance counters, and dynamic probes. Perf as one of the two most commonly used performance counter profiling tools on Linux. Perf basically use to analyses the core internal bottleneck right up to the driver level. Linux support many profiling tools like perf, trace-cmd, blktrace, strace and oprofile.

Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots. perf provides rich generalized abstractions over hardware specific capabilities. Among others, it provides per task, per CPU and per-workload counters, sampling on top of these and source code event annotation. Using perf we could monitor the performance of the device driver.

In order to build perf you need to install following packages.

sudo apt-get install flex bison libdw-dev libnewt-dev binutils-dev libaudit-dev libgtk2.0-dev binutils-dev libssl-dev python-dev systemtap-sdt-dev libiberty-dev libperl-dev liblzma-dev libpython-dev libunwind-* asciidoc xmlto

Check out the kernel source code to build the perf executable

$ git clone --depth 1 https://github.com/hardkernel/linux -b odroidxu4-4.14.y
$ cd linux/tools/perf
$ make
$ sudo cp perf /usr/bin/perf

Note: perf register pmu is integrated in the kernel, so just need to build the perf binary to test.

Check if Kernel supports Perf feature or not (Kernel 4.14 or higher is required)

root@odroid:~# dmesg | grep PMU
[    0.250870] EXYNOS5420 PMU initialized
[    0.749038] hw perfevents: enabled with armv7_cortex_a7 PMU driver, 5 counters available
[    0.750030] hw perfevents: enabled with armv7_cortex_a15 PMU driver, 7 counters available
root@odroid:~#

Check a list of perf events we can monitor

root@odroid:~# perf list
 
List of pre-defined events (to be used in -e):
 
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
 
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
 
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
 
List of pre-defined events (to be used in -e):
 
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
 
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
 
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
 
  armv7_cortex_a15/br_immed_retired/                 [Kernel PMU event]
  armv7_cortex_a15/br_mis_pred/                      [Kernel PMU event]
  armv7_cortex_a15/br_pred/                          [Kernel PMU event]
  armv7_cortex_a15/br_return_retired/                [Kernel PMU event]
  armv7_cortex_a15/bus_access/                       [Kernel PMU event]
  armv7_cortex_a15/bus_cycles/                       [Kernel PMU event]
  armv7_cortex_a15/cid_write_retired/                [Kernel PMU event]
  armv7_cortex_a15/cpu_cycles/                       [Kernel PMU event]
  armv7_cortex_a15/exc_return/                       [Kernel PMU event]
  armv7_cortex_a15/exc_taken/                        [Kernel PMU event]
  armv7_cortex_a15/inst_retired/                     [Kernel PMU event]
  armv7_cortex_a15/inst_spec/                        [Kernel PMU event]
  armv7_cortex_a15/l1d_cache/                        [Kernel PMU event]
  armv7_cortex_a15/l1d_cache_refill/                 [Kernel PMU event]
  armv7_cortex_a15/l1d_cache_wb/                     [Kernel PMU event]
  armv7_cortex_a15/l1d_tlb_refill/                   [Kernel PMU event]
  armv7_cortex_a15/l1i_cache/                        [Kernel PMU event]
  armv7_cortex_a15/l1i_cache_refill/                 [Kernel PMU event]
root@odroid:~/perf-examples# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.840694 s, 609 MB/s
 
 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
 
        842.111288      task-clock (msec)         #    0.996 CPUs utilized
                 1      context-switches          #    0.001 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                42      page-faults               #    0.050 K/sec
        1684203841      cycles                    #    2.000 GHz
        1435117503      instructions              #    0.85  insn per cycle
         311869004      branches                  #  370.342 M/sec
          11924108      branch-misses             #    3.82% of all branches
 
       0.845417981 seconds time elapsed
 
root@odroid:~/perf-examples#

Note: Exynos5422 is big.Little arch so we obtain the counter for each cpu.

root@odroid:~/perf-examples# perf stat -B  taskset -c 0 dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.65277 s, 310 MB/s
 
 Performance counter stats for 'taskset -c 0 dd if=/dev/zero of=/dev/null count=1000000':
 
       1655.839284      task-clock (msec)         #    0.999 CPUs utilized
                 7      context-switches          #    0.004 K/sec
                 1      cpu-migrations            #    0.001 K/sec
                77      page-faults               #    0.047 K/sec
           1773536      cycles                    #    0.001 GHz
            444207      instructions              #    0.25  insn per cycle
             93267      branches                  #    0.056 M/sec
              9169      branch-misses             #    9.83% of all branches
 
       1.657392774 seconds time elapsed
 
root@odroid:~/perf-examples# perf stat -B  taskset -c 4 dd if=/dev/zero of=/dev/null count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0.809315 s, 633 MB/s
 
 Performance counter stats for 'taskset -c 4 dd if=/dev/zero of=/dev/null count=1000000':
 
        811.520288      task-clock (msec)         #    0.998 CPUs utilized
                 6      context-switches          #    0.007 K/sec
                 1      cpu-migrations            #    0.001 K/sec
                77      page-faults               #    0.095 K/sec
        1622986577      cycles                    #    2.000 GHz
        1435747079      instructions              #    0.88  insn per cycle
         311780313      branches                  #  384.193 M/sec
           8700181      branch-misses             #    2.79% of all branches
 
       0.812844283 seconds time elapsed
 
root@odroid:~/perf-examples#

perf record/report

perf record : perf record uses the cycles event as the sampling event. This is a generic hardware event that is mapped to a hardware-specific PMU event by the kernel.

perf report: Samples collected by perf record are saved into a binary file called, by default, perf.data. The perf report command reads this file and generates a concise execution profile. By default, samples are sorted by functions with the most samples first. It is possible to customize the sorting order and therefore to view the data differently.

root@odroid:~/perf-examples# perf record -a sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.103 MB perf.data (289 samples) ]
 
root@odroid:~/perf-examples#
root@odroid:~/perf-examples# perf report
Samples: 289  of event 'cycles:ppp', Event count (approx.): 28006656
Overhead  Command          Shared Object     Symbol
  40.33%  swapper          [kernel.vmlinux]  [k] arch_cpu_idle
   7.23%  swapper          [kernel.vmlinux]  [k] tick_nohz_idle_exit
   5.40%  swapper          [kernel.vmlinux]  [k] tick_nohz_idle_enter
   3.83%  swapper          [kernel.vmlinux]  [k] _raw_spin_unlock_irq
   3.27%  sleep            [kernel.vmlinux]  [k] filemap_map_pages
   3.25%  perf             [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   2.14%  sleep            [kernel.vmlinux]  [k] page_remove_rmap
   1.82%  perf             [kernel.vmlinux]  [k] perf_event_ctx_lock_nested
   1.78%  swapper          [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   1.70%  ksoftirqd/4      [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   1.68%  sleep            libc-2.23.so      [.] 0x00050840
   1.67%  perf             [kernel.vmlinux]  [k] page_remove_rmap
   1.61%  perf             [kernel.vmlinux]  [k] remove_vma
   1.51%  kworker/u16:0    [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   1.48%  perf             [kernel.vmlinux]  [k] ext4_da_write_begin
   1.44%  kworker/u16:0    [kernel.vmlinux]  [k] _find_opp_table_unlocked
   1.35%  swapper          [kernel.vmlinux]  [k] __exception_text_end
   1.33%  perf             [kernel.vmlinux]  [k] alloc_set_pte
   1.23%  kworker/0:1      [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   1.22%  perf             [kernel.vmlinux]  [k] _test_and_set_bit
   1.06%  perf             [kernel.vmlinux]  [k] _raw_spin_lock
   1.03%  kworker/u16:0    [kernel.vmlinux]  [k] update_devfreq_passive
   0.83%  kworker/u16:0    [kernel.vmlinux]  [k] _raw_spin_unlock_irq
   0.80%  kworker/0:1      [kernel.vmlinux]  [k] memchr_inv
   0.79%  rs:main Q:Reg    [kernel.vmlinux]  [k] balance_dirty_pages_ratelimited
   0.79%  rs:main Q:Reg    rsyslogd          [.] 0x0002c8ae
   0.70%  sleep            [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
   0.68%  systemd-journal  systemd-journald  [.] 0x00015f1c
   0.61%  rs:main Q:Reg    [kernel.vmlinux]  [k] kmap_atomic
   0.54%  systemd-journal  systemd-journald  [.] 0x0002aeac