Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

11 votes
4 answers
10607 views
How to run linux perf without root
I want to benchmark an application of mine. Up to now I used gnu time, but perf yields much better stats. As a matter of principle I would like to go the route of a decicated perf user instead of allowing all users some security-related things, not because I am aware of a specific danger but because...
I want to benchmark an application of mine. Up to now I used gnu time, but perf yields much better stats. As a matter of principle I would like to go the route of a decicated perf user instead of allowing all users some security-related things, not because I am aware of a specific danger but because I don't understand the security implications. Therefore I'd like to avoid lowering the paranoid setting for perf as discussed in this question . Reading kernel.org on perf-security (note that the document seems to imply that this should work with Linux 5.9 or later), I did this:
# addgroup perf_users
# adduser perfer
# addgroup perfer perf_users
# cd /usr/bin
# chgrp perf_users perf
# chmod o-rwx perf
# setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
# setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
which returns perf: ok. # getcap perf returns perf cap_sys_ptrace,cap_syslog,cap_perfmon=ep. which is different from the link where they got perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep My Linux is 5.10.0-5-amd64 #1 SMP Debian 5.10.24-1 If I now run perf with user perfer I still get the error message
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html 
perf_event_paranoid setting is 3:
  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = )
which I tried to circumvent with all the above. Do any of you know, how to get perfer to run perf without lowering the paranoid setting?
Eike (221 rep)
Mar 27, 2021, 02:23 PM • Last activity: Jul 27, 2025, 05:07 PM
1 votes
0 answers
51 views
How to use linux perf in a conda environment
I'm trying to use linux `perf` program in a conda environment, but the `perf` program seems to ignore the conda environment. I installed the `conda-forge::linux-perf` package in my conda environment, but when I run it I get this error: ``` $ perf record ls perf: error while loading shared libraries:...
I'm trying to use linux perf program in a conda environment, but the perf program seems to ignore the conda environment. I installed the conda-forge::linux-perf package in my conda environment, but when I run it I get this error:
$ perf record ls
perf: error while loading shared libraries: libdebuginfod.so.1: cannot open shared object file: No such file or directory
The library is installed in my conda environment:
$ ls $CONDA_PREFIX/lib/*debuginfod*
/home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod-0.191.so  /home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod.so.1
/home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod.so
But perf is not looking in this directory for libraries. Using strace, I'm able to see where it is only looking in the base system directories, not the conda ones.
execve("/home/pcarter/anaconda3/envs/spec_density/bin/perf", ["perf", "record", "ls"], 0x7ffe078205b0 /* 99 vars */) = 0
access("/etc/suid-debug", F_OK)         = -1 ENOENT (No such file or directory)
brk(NULL)                               = 0x556c19e15000
fcntl(0, F_GETFD)                       = 0
fcntl(1, F_GETFD)                       = 0
fcntl(2, F_GETFD)                       = 0
access("/etc/suid-debug", F_OK)         = -1 ENOENT (No such file or directory)
readlink("/proc/self/exe", "/home/pcarter/anaconda3/envs/spe"..., 4096) = 50
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=138259, ...}) = 0
mmap(NULL, 138259, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9bf8a59000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
...
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/haswell/x86_64/libdebuginfod.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib/x86_64-linux-gnu/tls/haswell/x86_64", 0x7fffe984d610) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "/usr/lib/libdebuginfod.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
This is not what happens for other conda installed programs. Looking at strace output for running the nasm program, it is looking at the conda lib directory.
execve("/home/pcarter/anaconda3/envs/spec_density/bin/nasm", ["nasm"], 0x7ffffc123b60 /* 99 vars */) = 0
brk(NULL)                               = 0x189a000
readlink("/proc/self/exe", "/home/pcarter/anaconda3/envs/spe"..., 4096) = 50
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0
Is this a security feature of perf? If so, what do I need to do to fix this? FYI: I followed the instructions at this [page](https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html) to enable my conda perf program to have the capabilities it needs.
$ sudo getcap /home/pcarter/anaconda3/envs/spec_density/bin/perf
/home/pcarter/anaconda3/envs/spec_density/bin/perf cap_sys_ptrace,cap_syslog,cap_perfmon=ep
**Update** I found a work around, but still don't completely understand what is going on. Using sudo to run perf as *root* fixes the issue.
$ sudo $CONDA_PREFIX/bin/perf record ls                                                      
# Output of ls redacted here                   
[ perf record: Woken up 1 times to write data ]                                                                                                                           
[ perf record: Captured and wrote 0.021 MB perf.data (7 samples) ]
(I have to specify the path to perf here since I'm using sudo which uses the *root* user's PATH) It looks like conda sets up the library search path using RPATH in the executable. If you use readelf to look at the perf binary, it returns:
$ readelf -d $CONDA_PREFIX/bin/perf

Dynamic section at offset 0x948bc0 contains 45 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libelf.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdebuginfod.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdw.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libunwind-x86_64.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libunwind.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [liblzma.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libcrypto.so.3]
 0x0000000000000001 (NEEDED)             Shared library: [libpython3.12.so.1.0]
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libzstd.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libcap.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libnuma.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib]
where $ORIGIN represents the directory the executable is in. So RPATH does point to the conda lib directory. However, perf seems to be setup to ignore this depending on properties of the user running it. It does use it for *root* but not for my user account. I assume this is for security reasons. Is there a way to enable this for my user account?
pcarter (111 rep)
Mar 3, 2025, 10:17 PM • Last activity: Mar 4, 2025, 11:37 PM
0 votes
0 answers
45 views
Can "perf mem" command detect remote memory access on CXL NUMA nodes?
I wonder that if `perf mem` can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't...
I wonder that if perf mem can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't work on AMD CPUs (https://community.amd.com/t5/server-processors/issues-with-perf-mem-record/m-p/95270) . Who can help me?
SeenThrough (1 rep)
Feb 11, 2025, 02:01 AM
0 votes
0 answers
183 views
Clarifications on perf_events data collection
_Originally posted [here](https://stackoverflow.com/questions/79391172/clarifications-on-perf-events-data-collection) on stackoverflow_ I have never used the `perf` command before (but I need it), hence I have been reading the (really useful) [PerfWiki](https://perfwiki.github.io/main/). The section...
_Originally posted [here](https://stackoverflow.com/questions/79391172/clarifications-on-perf-events-data-collection) on stackoverflow_ I have never used the perf command before (but I need it), hence I have been reading the (really useful) [PerfWiki](https://perfwiki.github.io/main/) . The section devoted to [Event-based sampling overview](https://perfwiki.github.io/main/tutorial/#sampling-with-perf-record) contains a number of statements that are not completely clear to me. As they are quite essential to precisely understand how data collection is carried out, here I am asking for your help. In the following I will quote paragraphs from that section and explain my duobts immediately after. > Perf_events is based on event-based sampling. The period is expressed as the number of occurrences of an event, not the number of timer ticks. A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0. No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software. Q1. Consider a CPU working at a fixed frequency (no frequency scaling): what is the precise definition of a timer tick? Is it true that the _number_ of ticks in a second equals the _value_ of the frequency (e.g. 1_000_000_000 ticks/second for a CPU working at 1 GHz)? Q2. perf does not use timer ticks, instead, it counts the number of times an event occurs and only "stops" the CPU to gather the relevant data once ever period times; e.g. if period=1 each occurence of an event is registered, if period=2 it only registers half of the total number of occurrences, and so on... is it right? When period > 1 does perf automatically scale the final values and provide data as if all the events were registered? Q3. The above section says that "A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0" which seems to contradict the measurement being taken once every period occurrences of an event... what am I missing? More generally, why does perf wait for a counter to overflow before gathering the information? Also, what happens when more than one event is being monitored? > The way perf_events emulates 64-bit counter is limited to expressing sampling periods using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel **silently** truncates the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running on 32-bit systems. Q4. I cannot truly understand the meaning of this paragraph (maybe I am missing some underlying knowledge). If the actual hardware counter has `N On counter overflow, the kernel records information, i.e., a sample, about the execution of the program. What gets recorded depends on the type of measurement. This is all specified by the user and the tool. But the key information that is common in all samples is the instruction pointer, i.e. where was the program when it was interrupted. Ok, I got this. > Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer stored in each sample designates the place where the program was interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e., where it was at the end of the sampling period. In some case, the distance between those two points may be several dozen instructions or more if there were taken branches. When the program cannot make forward progress, those two locations are indeed identical. **For this reason, care must be taken when interpreting profiles.** Q5. I am aware I know way too little about how a CPU actually works to understand this section, but could you confirm that this paragraph is warning about the following potential chain of events (pun not intended): 1. A sample must be taken 2. The current value in the istruction pointer register is taken 3. A few more instructions are executed by the CPU 4. The sample is taken and the gathered data is saved using and associated with the instruction pointer taken at step (2) If this is correct, could you give me a brief explanation (or point me to some external resource) about what may cause step (3)? More importantly, how can I monitor how many times skids have occurred? Is there anything on my side that I can do to mitigate this? > By default, perf record uses the cycles event as the sampling event. > > [...] > > The perf_events interface allows two modes to express the sampling period: > > * the number of occurrences of the event (period) > * the average rate of samples/sec (frequency) > > The perf tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means that the kernel is dynamically adjusting the sampling period to achieve the target average rate. Q6. Does perf use the cycles event as a reference to compute the sampling period even if it is not among the set of events being monitored? What happens when multiple events are monitored? Does each event have its own period, or is there one event that counts for all? When a frequency is used to determine when samples must be taken, the event used as reference for sampling should be irrelevant, right?
Sirion (101 rep)
Jan 29, 2025, 08:23 AM
1 votes
1 answers
138 views
Failed to run some functions with perf tool in embedded Linux
I am working on an embedded Linux system (kernel-5.19.20), and I tried the perf tool in my SOC, found some functions did NOT work. After run `perf record /test/perf_test`, I got the `perf.data`, and `perf report` also showed functions list. I cross-compiled `objdump` and `addr2line` for my target, t...
I am working on an embedded Linux system (kernel-5.19.20), and I tried the perf tool in my SOC, found some functions did NOT work. After run perf record /test/perf_test, I got the perf.data, and perf report also showed functions list. I cross-compiled objdump and addr2line for my target, then I ran
./perf annotate --addr2line /data/addr2line  --objdump /data/objdump
 Percent |      Source code & Disassembly of perfload for cpu-clock:ppp (33248 samples, percent: local period)
--------------------------------------------------------------------------------------------------------------
         :
         :
         :
         : 3      Disassembly of section .text:
         :
         : 5      0040060c :
         : 6      workload2():
    0.00 :   40060c: addiu   sp,sp,-24
    0.00 :   400610: sw      s8,20(sp)
    0.00 :   400614: move    s8,sp
    0.00 :   400618: sw      zero,12(s8)
    0.00 :   40061c: sw      zero,8(s8)
    0.00 :   400620: b       400698 
    0.00 :   400624: nop
   39.08 :   400628: lw      v1,8(s8)
    0.00 :   40062c: lw      v0,8(s8)
    2.07 :   400630: mul     v0,v1,v0
    0.00 :   400634: lw      v1,12(s8)
    4.08 :   400638: addu    v0,v1,v0
    0.00 :   40063c: sw      v0,12(s8)
....
Then I tried to make a flamegraph with perf.data with following commands.
perf script -i perf.data &> perf.unfold

cat perf.unfold | head -20
        perfload    1822  1443.361889:     250000 cpu-clock:ppp:          80769534 _raw_spin_unlock+0x3c ([kernel.kallsyms])
        perfload    1822  1443.362138:     250000 cpu-clock:ppp:          77eb6a28 __mips_syscall5+0x8 (/lib/ld-linux-mipsn8.so.1)
        perfload    1822  1443.362388:     250000 cpu-clock:ppp:          800fe1e0 filemap_map_pages+0x118 ([kernel.kallsyms])
        perfload    1822  1443.362639:     250000 cpu-clock:ppp:          77eb75f8 memcpy+0x1c8 (/lib/ld-linux-mipsn8.so.1)
        perfload    1822  1443.362945:     250000 cpu-clock:ppp:          800eec8c perf_event_mmap_output+0x218 ([kernel.kallsyms])
        perfload    1822  1443.363196:     250000 cpu-clock:ppp:          8001ec38 enable_restore_fp_context+0x208 ([kernel.kallsyms])
        perfload    1822  1443.363440:     250000 cpu-clock:ppp:            4005d0 workload1+0x80 (/data/perfload)
        perfload    1822  1443.363690:     250000 cpu-clock:ppp:            400584 workload1+0x34 (/data/perfload)
        perfload    1822  1443.363939:     250000 cpu-clock:ppp:            4005e8 workload1+0x98 (/data/perfload)
        perfload    1822  1443.364189:     250000 cpu-clock:ppp:            4005ec workload1+0x9c (/data/perfload)
        perfload    1822  1443.364439:     250000 cpu-clock:ppp:            4005ec workload1+0x9c (/data/perfload)
        perfload    1822  1443.364689:     250000 cpu-clock:ppp:            400594 workload1+0x44 (/data/perfload)
        perfload    1822  1443.364939:     250000 cpu-clock:ppp:            400588 workload1+0x38 (/data/perfload)
        perfload    1822  1443.365189:     250000 cpu-clock:ppp:            400574 workload1+0x24 (/data/perfload)
        perfload    1822  1443.365439:     250000 cpu-clock:ppp:            4005d4 workload1+0x84 (/data/perfload)
        perfload    1822  1443.365689:     250000 cpu-clock:ppp:            40058c workload1+0x3c (/data/perfload)
        perfload    1822  1443.365939:     250000 cpu-clock:ppp:            40058c workload1+0x3c (/data/perfload)
        perfload    1822  1443.366189:     250000 cpu-clock:ppp:            40058c workload1+0x3c (/data/perfload)
        perfload    1822  1443.366439:     250000 cpu-clock:ppp:            40059c workload1+0x4c (/data/perfload)
        perfload    1822  1443.366689:     250000 cpu-clock:ppp:            400598 workload1+0x48 (/data/perfload)
I copied the perf.unfold to my development host, and ran perl scripts from Greg's website.
stackcollapse-perf.pl ~/shared/perf.unfold &> ~/shared/perf.fold
flamegraph.pl ~/shared/perf.fold > ~/shared/perf.svg
Stack count is low (0). Did something go wrong?
ERROR: No stack counts found
I checked the perf.fold, its size is 0!!! The perf is cross-compiled with following features enabled.
Auto-detecting system features:
...                                   dwarf: [ OFF ]
...                      dwarf_getlocations: [ OFF ]
...                                   glibc: [ on  ]
...                                  libbfd: [ OFF ]
...                          libbfd-buildid: [ OFF ]
...                                  libcap: [ OFF ]
...                                  libelf: [ on  ]
...                                 libnuma: [ OFF ]
...                  numa_num_possible_cpus: [ OFF ]
...                                 libperl: [ OFF ]
...                               libpython: [ OFF ]
...                               libcrypto: [ OFF ]
...                               libunwind: [ on  ]
...                      libdw-dwarf-unwind: [ OFF ]
...                                    zlib: [ on  ]
...                                    lzma: [ OFF ]
...                               get_cpuid: [ OFF ]
...                                     bpf: [ OFF ]
...                                  libaio: [ on  ]
...                                 libzstd: [ OFF ]
The SOC is a MIPS, and I am not sure why flamegraph.pl failed, it seemed the stack info is empty in perf.data? Or is there any feature I missed to make the stack unwinding work ? Or perf does NOT support MIPS stack analysis? Thanks,
wangt13 (631 rep)
Oct 25, 2024, 12:03 PM • Last activity: Oct 26, 2024, 02:06 AM
0 votes
0 answers
347 views
How to cross-compile Linux tools/perf for embedded system?
I am working on an embedded Linux system (kernel-5.19.20) on MIPS, and I want to build `tools/perf` for my system. I want to have `libelf` feature enabled when cross compile perf, so I firstly cross-compile and installed the `libelf` to `$(proj)/sysroot/usr/lib/`. Then I tried following command to c...
I am working on an embedded Linux system (kernel-5.19.20) on MIPS, and I want to build tools/perf for my system. I want to have libelf feature enabled when cross compile perf, so I firstly cross-compile and installed the libelf to $(proj)/sysroot/usr/lib/. Then I tried following command to cross compile perf.
CC=mips-none-gnu-gcc make ARCH=mips CROSS_COMPILE=mips-none-gnu- EXTRA_CFLAGS="-I/home/t/proj/target/usr/include" LDFLAGS="-L/home/t/proj/target/usr/lib -Wl,-rpath-link=/home/t/proj/target/usr/lib"
And I got following feature list.
Auto-detecting system features:
...                         dwarf: [ OFF ]
...            dwarf_getlocations: [ OFF ]
...                         glibc: [ on  ]
...                        libbfd: [ OFF ]
...                libbfd-buildid: [ OFF ]
...                        libcap: [ OFF ]
...                        libelf: [ OFF ]
...                       libnuma: [ OFF ]
...        numa_num_possible_cpus: [ OFF ]
...                       libperl: [ OFF ]
...                     libpython: [ OFF ]
...                     libcrypto: [ OFF ]
...                     libunwind: [ OFF ]
...            libdw-dwarf-unwind: [ OFF ]
...                          zlib: [ OFF ]
...                          lzma: [ OFF ]
...                     get_cpuid: [ OFF ]
...                           bpf: [ on  ]
...                        libaio: [ on  ]
...                       libzstd: [ OFF ]
...        disassembler-four-args: [ OFF ]
I checked the ../build/feature/test-libelf.make.output, and I got,
/home/t/proj/mips-none-gnu/bin/ld: warning: libz.so.1, needed by /home/t/proj/target/usr/lib/libelf.so, not found (try using -rpath or -rpath-link)
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `inflate'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `deflate'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to deflateInit_'    /home/t/proj/target/usr/lib/libelf.so: undefined reference to inflateEnd'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `deflateEnd'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to inflateInit_'    /home/t/proj/target/usr/lib/libelf.so: undefined reference to inflateReset'    collect2: error: ld returned 1 exit status
The libz.so is built within buildroot and already installed into /home/t/proj/target/usr/lib. So, how to make perf with libelf succeed in this case?
wangt13 (631 rep)
Oct 17, 2024, 06:25 AM • Last activity: Oct 17, 2024, 07:04 AM
1 votes
1 answers
242 views
perf trace is not available on my machine
I'm on ``` Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-45-generic x86_64) ``` I installed the perf packages ```sudo apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)``` but when I try to run `perf trace` it's complaining that subcommand isn't available ``` akhil@akhil-Inspiron-5559:...
I'm on
Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-45-generic x86_64)
I installed the perf packages
apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
but when I try to run perf trace it's complaining that subcommand isn't available
akhil@akhil-Inspiron-5559:~$ perf trace
perf: 'trace' is not a perf-command. See 'perf --help'.
akhil@akhil-Inspiron-5559:~$ sudo perf trace
[sudo] password for akhil:
perf: 'trace' is not a perf-command. See 'perf --help'.
however I think it is installed because when I run man perf trace I get the man page
PERF-TRACE(1)                                                                     perf Manual                                                                    PERF-TRACE(1)

NAME
       perf-trace - strace inspired tool

SYNOPSIS
       perf trace
       perf trace record
any ideas what could be happenning? EDITS In response to @Romeo's suggestion
akhil@akhil-Inspiron-5559:~$ perf-trace
perf-trace: command not found
akhil@akhil-Inspiron-5559:~$ sudo perf-trace
[sudo] password for akhil:
sudo: perf-trace: command not found
Rockstar5645 (113 rep)
Oct 9, 2024, 03:45 PM • Last activity: Oct 9, 2024, 07:25 PM
0 votes
0 answers
79 views
Linux Perf Stat -> results and intel meteor lake (p-cores only test)
Device -> Asus Zenbook 14 (Intel Meteor Lake 155H) ubuntu 24.04 lts uname -r == 6.8.0-45-generic From running perf stat, i can see entries for **cpu_atom**/instructions/ **cpu_atom**/cycles/ **cpu_atom**/branches/ Are those the results from Efficiency-cores (**cpu_atom**)? Can i run perf stat strict...
Device -> Asus Zenbook 14 (Intel Meteor Lake 155H) ubuntu 24.04 lts uname -r == 6.8.0-45-generic From running perf stat, i can see entries for **cpu_atom**/instructions/ **cpu_atom**/cycles/ **cpu_atom**/branches/ Are those the results from Efficiency-cores (**cpu_atom**)? Can i run perf stat strictly on the Performance-Cores? (which are cpu_core 0 to 5) lscpu --all --extended CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ 0 0 0 0 16:16:4:0 yes 4500.0000 400.0000 1731.5070 1 0 0 1 8:8:2:0 yes 4800.0000 400.0000 1991.5780 2 0 0 1 8:8:2:0 yes 4800.0000 400.0000 400.0000 3 0 0 2 12:12:3:0 yes 4800.0000 400.0000 1684.4030 4 0 0 2 12:12:3:0 yes 4800.0000 400.0000 1889.0811 5 0 0 0 16:16:4:0 yes 4500.0000 400.0000 1271.9000 6 0 0 3 20:20:5:0 yes 4500.0000 400.0000 1669.7340 7 0 0 3 20:20:5:0 yes 4500.0000 400.0000 400.0000 8 0 0 4 24:24:6:0 yes 4500.0000 400.0000 400.0000 9 0 0 4 24:24:6:0 yes 4500.0000 400.0000 1838.1899 10 0 0 5 28:28:7:0 yes 4500.0000 400.0000 1856.5320 11 0 0 5 28:28:7:0 yes 4500.0000 400.0000 400.0000 12 0 0 6 0:0:0:0 yes 3800.0000 400.0000 1537.6290 13 0 0 7 2:2:0:0 yes 3800.0000 400.0000 1502.0179 14 0 0 8 4:4:0:0 yes 3800.0000 400.0000 1341.5179 15 0 0 9 6:6:0:0 yes 3800.0000 400.0000 1461.8240 16 0 0 10 1:0 yes 3800.0000 400.0000 400.0000 17 0 0 11 10:10:1:0 yes 3800.0000 400.0000 400.0000 18 0 0 12 1:0 yes 3800.0000 400.0000 400.0000 19 0 0 13 14:14:1:0 yes 3800.0000 400.0000 998.1960 20 0 0 14 64:64:8 yes 2500.0000 400.0000 400.0000 21 0 0 15 66:66:8 yes 2500.0000 400.0000 400.0000 perf stat results 12.24 msec task-clock # 0.693 CPUs utilized 107 context-switches # 8.742 K/sec 23 cpu-migrations # 1.879 K/sec 329 page-faults # 26.879 K/sec 18,360,525 cpu_atom/cycles/ # 1.500 GHz (24.46%) 21,167,060 cpu_core/cycles/ # 1.729 GHz (75.54%) 19,495,339 cpu_atom/instructions/ # 1.06 insn per cycle (24.46%) 34,945,740 cpu_core/instructions/ # 1.90 insn per cycle (75.54%) 3,627,048 cpu_atom/branches/ # 296.327 M/sec (24.46%) 6,609,493 cpu_core/branches/ # 539.991 M/sec (75.54%) 76,918 cpu_atom/branch-misses/ # 2.12% of all branches (24.46%) 65,568 cpu_core/branch-misses/ # 1.81% of all branches (75.54%
Cedyangs279 (1 rep)
Oct 8, 2024, 07:37 AM
1 votes
0 answers
143 views
Cache or save results of perf report
For large files, perf report takes a while. e.g. around 5 minutes for 500G perf.data. Is there a way to dump (certain states of) perf report to a file, or tell perf to cache something that enables faster invocation of `perf report` later on? Note that I dont want to just save the first view of perf...
For large files, perf report takes a while. e.g. around 5 minutes for 500G perf.data. Is there a way to dump (certain states of) perf report to a file, or tell perf to cache something that enables faster invocation of perf report later on? Note that I dont want to just save the first view of perf report by redirecting to a file perf report > a.txt
A. K. (136 rep)
Aug 25, 2024, 12:00 AM
0 votes
1 answers
159 views
see vdso function with perf report
On my `perf` report I see a bunch of lines with vdso modules and no function names. I am guessing these are for `gettimeofday` type calls. But how can I see actual function calls? My perf record command: ```lang-shell sudo perf record -a -g -k 1 --call-graph dwarf,16384 -F 99 -e cpu-clock:pppH -p $P...
On my perf report I see a bunch of lines with vdso modules and no function names. I am guessing these are for gettimeofday type calls. But how can I see actual function calls? My perf record command:
-shell
sudo perf record -a -g -k 1 --call-graph dwarf,16384 -F 99 -e cpu-clock:pppH -p $PID -- sleep 60
and using sudo perf report to view the report. It shows something like:
+    2.83%     0.00%  myprog  [vdso]              [.] 0x00007fff983f6738
I am on kernel version 5.10 using Amazon Linux 2.
Eee Zee (1 rep)
Jul 24, 2024, 05:01 AM • Last activity: Jul 24, 2024, 12:03 PM
0 votes
1 answers
736 views
Unable to find any information about TLB on my computer or obtain information about hardware counters about TLB
The Ubuntu version I am using is **Ubuntu 18.04.6 LTS**, and the kernel version is **5.4.0-148 generic**. My processor is **12th Gen Intel (R) Core (TM) i7-12700**. I would like to know the number of TLB entries in my CPU for different page sizes (1G, 2MB, 4KB), as well as the number of dTLB misses...
The Ubuntu version I am using is **Ubuntu 18.04.6 LTS**, and the kernel version is **5.4.0-148 generic**. My processor is **12th Gen Intel (R) Core (TM) i7-12700**. I would like to know the number of TLB entries in my CPU for different page sizes (1G, 2MB, 4KB), as well as the number of dTLB misses during program execution.
-1
command told me they are 0: L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x0 (0) instruction associativity = 0x0 (0) data # entries = 0x0 (0) data associativity = 0x0 (0) L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx): instruction # entries = 0x0 (0) instruction associativity = 0x0 (0) data # entries = 0x0 (0) data associativity = 0x0 (0) L1 data cache information (0x80000005/ecx): line size (bytes) = 0x0 (0) lines per tag = 0x0 (0) associativity = 0x0 (0) size (KB) = 0x0 (0) L1 instruction cache information (0x80000005/edx): line size (bytes) = 0x0 (0) lines per tag = 0x0 (0) associativity = 0x0 (0) size (KB) = 0x0 (0) L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 unified cache information (0x80000006/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x0 (0) associativity = 0x7 (7) size (KB) = 0x500 (1280) L3 cache information (0x80000006/edx): line size (bytes) = 0x0 (0) lines per tag = 0x0 (0) associativity = L2 off (0) size (in 512KB units) = 0x0 (0)
stat -e dTLB-loads,dTLB-load-misses,iTLB-load-misses
shows not supported UPDATE: $ cpuid -1 -l 2 CPU: 0xff: cache data is in CPUID 4 0xfe: unknown 0xf0: 64 byte prefetching $ cpuid -1 -l 0x18 CPU: 0x00000018 0x00: eax=0x00000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 $ perf -v perf version 5.4.231 $ perf list List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] alignment-faults [Software event] bpf-output [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] I'm completely confused about what's wrong with my machine.
citrusyi (1 rep)
May 24, 2023, 05:20 AM • Last activity: Jun 20, 2024, 05:05 AM
1 votes
1 answers
413 views
Where does `perf script` timestamps come from?
The third column from `perf script` seems to be close but not quite the uptime, where is that timestamp coming from? Is there a way to access that timestamp other than accessing sampled events? ``` $ perf record cat /proc/uptime 1392597.79 16669901.66 [ perf record: Woken up 1 times to write data ]...
The third column from perf script seems to be close but not quite the uptime, where is that timestamp coming from? Is there a way to access that timestamp other than accessing sampled events?
$ perf record cat /proc/uptime
1392597.79 16669901.66
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]
$ perf script
             cat 902536 1392640.417831:          1 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417849:          1 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417863:          3 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417876:         10 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417889:         33 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417902:        108 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417915:        351 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417929:       1136 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417942:       3657 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417958:      11701 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.417998:      34018 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.418060:      55936 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.418117:      77496 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.418195:     110190 cycles:u:  ffffffffac000163 [unknown] ([unknown])
             cat 902536 1392640.418296:     140857 cycles:u:  ffffffffac000163 [unknown] ([unknown])
             cat 902536 1392640.418422:     166643 cycles:u:  ffffffffac000b47 [unknown] ([unknown])
             cat 902536 1392640.418589:     187277 cycles:u:  ffffffffac000163 [unknown] ([unknown])
             cat 902536 1392640.418763:     198929 cycles:u:      7fdb1734b9a8 _dl_addr+0x108 (/usr/lib/libc-2.31.so)
             cat 902536 1392640.418959:     209655 cycles:u:  ffffffffac000163 [unknown] ([unknown])
Frederik Deweerdt (3784 rep)
Apr 28, 2020, 09:58 PM • Last activity: Jun 15, 2024, 03:37 PM
0 votes
2 answers
129 views
INST_RETIRED.ANY no longer works as a performance counter with perf on linux 6.7
With the older kernels INST_RETIRED.ANY (as well as many of the other counters documented in [https://perfmon-events.intel.com/ahybrid.htm][1] ) worked as counters for perf. I am now using perf with a 6.7 kernel running on a Sapphire Rapids {Golden Cove} processor. When I do the following perf stat...
With the older kernels INST_RETIRED.ANY (as well as many of the other counters documented in https://perfmon-events.intel.com/ahybrid.htm ) worked as counters for perf. I am now using perf with a 6.7 kernel running on a Sapphire Rapids {Golden Cove} processor. When I do the following perf stat -e INST_RETIRED.ANY,cycles sleep 2 I get event syntax error: 'INST_RETIRED.ANY,cycles' \___ parser error Is this the expected behavior?
rkg125 (1 rep)
Apr 8, 2024, 05:31 AM • Last activity: Apr 11, 2024, 02:05 AM
3 votes
2 answers
2309 views
how to set capabilities (setcap) on perf
I'd like to use the perf utility. I was following instructions to set up a privileged group of users who are permitted to execute performance monitoring and observability without limits (as instructed here: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html). I added the group and...
I'd like to use the perf utility. I was following instructions to set up a privileged group of users who are permitted to execute performance monitoring and observability without limits (as instructed here: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html) . I added the group and limited access to users not in the group. I started having problems when assigning capabilities to the perf tool:
setcap cap_sys_admin,cap_sys_ptrace,cap_syslog=ep perf
I get an invalid arguments error saying
fatal error: Invalid argument
usage: setcap [-q] [-v] [-n ] (-r|-|)  [ ... (-r|-|)  ]

Note  must be a regular (non-symlink) file.
But running stats perf gives me this
File: ./perf
  Size: 1622      	Blocks: 8          IO Block: 4096   regular file
Device: 10307h/66311d	Inode: 35260925    Links: 1
Access: (0750/-rwxr-x---)  Uid: (    0/    root)   Gid: ( 1001/perf_users)
Access: 2021-12-03 13:08:48.923220351 +0100
Modify: 2021-11-05 17:02:56.000000000 +0100
Change: 2021-12-03 12:31:49.451991980 +0100
 Birth: -
which says the file is a regular file. What could be the problem? How can I set the capabilities for the Perf tool? Linux distribution: Ubuntu 20.04 EDIT: Last 20 output lines of strace setcap cap_sys_admin,cap_sys_ptrace,cap_syslog=ep perf:
munmap(0x7f825054c000, 90581)           = 0
prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE) = 1
prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x28 /* CAP_??? */) = 1
prctl(PR_CAPBSET_READ, 0x2c /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x2a /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x29 /* CAP_??? */) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x55de3e858000
brk(0x55de3e879000)                     = 0x55de3e879000
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=1<
levente.nas (133 rep)
Dec 3, 2021, 01:07 PM • Last activity: Mar 29, 2024, 07:25 PM
2 votes
0 answers
85 views
A simple memcpy loop with stride of 8-16-32 Bytes, no L1 cache misses, what stalls backend cycles?
I am trying to understand the CPU cache performance in a single producer single consumer queue algorithm, but cannot pinpoint the cause of performance degradation in some cases. The following simplified test program runs with practically no L1 cache misses, but nevertheless it spends many cycles sta...
I am trying to understand the CPU cache performance in a single producer single consumer queue algorithm, but cannot pinpoint the cause of performance degradation in some cases. The following simplified test program runs with practically no L1 cache misses, but nevertheless it spends many cycles stalled in the CPU backend when its memory access pattern is somewhat sparser. **What causes the CPU backend to stall in such cases, when there are practically no L1 misses?** What should be measured to pinpoint the cause? I understand that this question is not so much about Linux or perf_event, as it is more about the architecture of the CPU and caches. What would be a more appropriate stackexchange? Stackoverflow is sort of more about software? Serverfault or Superuser don't aim at these topics either. Electronics stackexchange does not quite target CPU architecture. Although, according to [this meta from 2013](https://meta.stackexchange.com/a/216253/194455) , Electronics may be the best fit. I post it here simply because all the tests were done on Linux, and from experience I think there are experts who may know what happens here, e.g. Gilles. Probably, the best is to post it on AMD forums. But I cannot get my draft posted there because of the error: "Post flooding detected (user tried to post more than 2 messages within 600 seconds)" when I have no posts actually posted. No wonder why their forums are so quiet. My CPU is AMD Ryzen 5 PRO 4650G, which is Zen 2 "Renoir" with 192KiB L1d cache. The memcpy test program:
++
// demo_memcpy_test_speed-gap.c
#include 
#include 
#include 
#include 

#define PACKET_SIZE 8 // 16 32 // <--- it is really the stride of the memcpy over the mem array
#define SIZE_TO_MEMCPY 8 // memcpy only the first 8 bytes of the "packet"

const static long long unsigned n_packets = 512; // use few packets, to fit in L2 etc
static long long unsigned repeat    = 1000*1000 * 2 * 2; // repeat many times to get enough stats in perf

const static long long unsigned n_max_data_bytes = n_packets * PACKET_SIZE;

#define CACHE_LINE_SIZE 64 // align explicitly just in case
alignas(CACHE_LINE_SIZE) uint8_t data_in  [n_max_data_bytes];
alignas(CACHE_LINE_SIZE) uint8_t data_out [n_packets][PACKET_SIZE];

int main(int argc, char* argv[])
{
printf("memcpy_test.c standard\n");
printf("PACKET_SIZE %d   SIZE_TO_MEMCPY %d\n", PACKET_SIZE, SIZE_TO_MEMCPY);

//
// warmup the memory
// i.e. access the memory to make sure Linux has set up the virtual mem tables
...

{
printf("\nrun memcpy\n");

long long unsigned n_bytes_copied = 0;
long long unsigned memcpy_ops     = 0;
start_setup = clock();
for (unsigned rep=0; rep
It's built with -O1 to get a really efficient loop with memcpy:
g++ -g -O1 ./demo_memcpy_test_speed-gap.c
The instructions of the memcpy loop, as seen in perf record annotate option:
sudo perf record -F 999 -e stalled-cycles-backend -- ./a.out
sudo perf report
...select main
With PACKET_SIZE set to 8, the code is really efficient:
│     for (unsigned long long i_packet=0; i_packet
With PACKET_SIZE set to 1024, and the code is the same for 256, except add $0x100.. instead of 0x400:
│       lea      _end,%rsi
       │140:┌─→mov      %rbp,%rdx
       │    │
       │    │  lea      data_in,%rax
       │    │
       │    │__fortify_function void *
       │    │__NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │    │size_t __len))
       │    │{
       │    │return __builtin___memcpy_chk (__dest, __src, __len,
       │14a:│  mov      (%rax),%rcx
       │    │memcpy():
 96.31 │    │  mov      %rcx,(%rdx)
       │    │
  1.81 │    │  add      $0x400,%rax
  0.20 │    │  add      $0x400,%rdx
  1.12 │    │  cmp      %rsi,%rax
  0.57 │    │↑ jne      14a
       │    │  sub      $0x1,%edi
       │    └──jne      140
I run it with PACKET_SIZE set to 8, 16, 32, and other values. The perf counts for 8 and 32:
sudo perf stat -e task-clock,instructions,cycles,stalled-cycles-frontend,stalled-cycles-backend \
      -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches \
      -e l2_cache_accesses_from_dc_misses,l2_cache_hits_from_dc_misses,l2_cache_misses_from_dc_misses \
      -- ./a.out

PACKET_SIZE 8   SIZE_TO_MEMCPY 8
...
 Performance counter stats for './a.out':

            503.43 msec task-clock                       #    0.998 CPUs utilized
    10,323,618,071      instructions                     #    4.79  insn per cycle
                                                  #    0.01  stalled cycles per insn     (29.11%)
     2,154,694,815      cycles                           #    4.280 GHz                         (29.91%)
         5,148,993      stalled-cycles-frontend          #    0.24% frontend cycles idle        (30.70%)
        55,922,538      stalled-cycles-backend           #    2.60% backend cycles idle         (30.99%)
     4,091,862,625      L1-dcache-loads                  #    8.128 G/sec                       (30.99%)
            24,211      L1-dcache-load-misses            #    0.00% of all L1-dcache accesses   (30.99%)
            18,745      L1-dcache-prefetches             #   37.234 K/sec                       (30.37%)
            30,749      l2_cache_accesses_from_dc_misses #   61.079 K/sec                       (29.57%)
            21,046      l2_cache_hits_from_dc_misses     #   41.805 K/sec                       (28.78%)
             9,095      l2_cache_misses_from_dc_misses   #   18.066 K/sec                       (28.60%)

PACKET_SIZE 32   SIZE_TO_MEMCPY 8
...
 Performance counter stats for './a.out':

            832.83 msec task-clock                       #    0.999 CPUs utilized
    12,289,501,297      instructions                     #    3.46  insn per cycle
                                                  #    0.11  stalled cycles per insn     (29.42%)
     3,549,297,932      cycles                           #    4.262 GHz                         (29.64%)
         5,552,837      stalled-cycles-frontend          #    0.16% frontend cycles idle        (30.12%)
     1,349,663,970      stalled-cycles-backend           #   38.03% backend cycles idle         (30.25%)
     4,144,875,512      L1-dcache-loads                  #    4.977 G/sec                       (30.25%)
           772,968      L1-dcache-load-misses            #    0.02% of all L1-dcache accesses   (30.24%)
           539,481      L1-dcache-prefetches             #  647.767 K/sec                       (30.25%)
           532,879      l2_cache_accesses_from_dc_misses #  639.839 K/sec                       (30.24%)
           461,131      l2_cache_hits_from_dc_misses     #  553.690 K/sec                       (30.04%)
            14,485      l2_cache_misses_from_dc_misses   #   17.392 K/sec                       (29.55%)
**There is a slight increase in L1 cache misses: from 0% at 8 bytes PACKET_SIZE up to 0.02% at 32 bytes. But can it justify why the backend stalls jumped from 2.6% to 38%? If no, then what else stalls the CPU backend?** I get that a larger stride means that the memcpy loop moves from one L1 cache line to another one sooner. But, if the lines are already in the cache, and there are practically no L1 miss events, as reported by perf, then **why would an access to a different cache line stall the backend?** Is it something about how the CPU issues the instructions in parallel? Maybe, it cannot issue instructions that access different cache lines simultaneously? Runs with PACKET_SIZE up to 1024 bytes are shown on the following somewhat busy plot: The stats for the memcpy test program with different PACKET_SIZE values: Mops VS L1 cache misses, percent of CPU backend and frontend cycle stalls. The number on the data points shows the PACKET_SIZE parameter of the run, i.e. the stride of the memcpy access pattern. The X axis is millions of operations per second (Mops), an "operation" = 1 memcpy. The Y axis contains the metrics from perf: the percent of L1 accesses that missed, and percent of cycles that are stalled in backend and frontend. In all these runs, the L2 accesses practically do not miss, i.e. l2_cache_misses_from_dc_misses metric is always very low. Just for completeness, according to [anandtech](https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/11) Zen 2 architecture L1 latency is 4 cycles, L2 latency is 12 cycles. I am not sure why the frontend gets stalled. But that's what perf reports. And I believe it is real. Because the effect of the stalled frontend is different from backend. If you compare the runs with PACKET_SIZE 256 and 1024 on the plot: they have about the same L1 misses; 256 has about 77% cycles stalled in backend and 0% in frontend; 1024 is the opposite, 77% stalled in frontend and 0% in backend. Yet, 1024 is much slower, because it has much less instructions issued per cycle. It's about 0.42 in the 1024 run, and 1.28 in the 256 run. So, the CPU issues less instructions per cycle when it is stalled in frontend rather than in backend. I guess that's just how the frontend and backend work, i.e. the backend can run more parallel. It would be well appreciated if someone could confirm or correct this guess. However, a more important question: why does the frontend get stalled? The frontend is supposed to just decode the instructions. The assembly does not really change with PACKET_SIZE set to 256 or 1024. **Then, what makes the frontend to stall more at 1024 stride than at 256?** The plot of IPC per Mops for all the PACKET_SIZE runs: IPC per Mops for all the PACKET_SIZE values The run with PACKET_SIZE 8 is slightly off the line towards more Mops, i.e. faster than the trend of other values. That must be because of the more efficient instructions.
xealits (2267 rep)
Feb 4, 2024, 11:08 PM • Last activity: Feb 5, 2024, 09:33 AM
0 votes
1 answers
215 views
Which tools for micro-benchmarking?
I'm not sure which tool to use for micro-benchmarking a C program. I would like to measure both: - Memory usage, RSS ( Resident Set Size ) - CPU cycles I did use `perf record -g` and `perf script` piped into an awk script. This worked for finding the memory usage, but CPU cycles weren't accurate bec...
I'm not sure which tool to use for micro-benchmarking a C program. I would like to measure both: - Memory usage, RSS ( Resident Set Size ) - CPU cycles I did use perf record -g and perf script piped into an awk script. This worked for finding the memory usage, but CPU cycles weren't accurate because perf record gets the cpu cycles by sampling. perf stat is accurate but obviously doesn't give per-function stats. The perf_event library seems to be terribly documented and a meal of a task for simple benchmarking. Having briefly looked at: - SystemTap - DTrace - LTTng - gperftools - likwid - PAPI Which seem like decent, well-documented tools. What would you recommend looking the most into? Or any other suggestions? Thank you for your time.
jinTgreater (1 rep)
Jan 19, 2024, 06:23 PM • Last activity: Jan 19, 2024, 11:20 PM
1 votes
0 answers
29 views
What is futex-default-S?
I can't find more info about `futex-default-S` Is this a system call? module? or else? https://github.com/torvalds/linux/blob/06dc10eae55b5ceabfef287a7e5f16ceea204aa0/tools/perf/Documentation/perf-lock.txt#L102 ``` $ perf lock report -t -F acquired,contended,avg_wait Name acquired contended avg wait...
I can't find more info about futex-default-S Is this a system call? module? or else? https://github.com/torvalds/linux/blob/06dc10eae55b5ceabfef287a7e5f16ceea204aa0/tools/perf/Documentation/perf-lock.txt#L102
$ perf lock report -t -F acquired,contended,avg_wait

                    Name   acquired  contended   avg wait (ns)

                    perf     240569          9            5784
                 swapper     106610         19             543
                  :15789      17370          2           14538
            ContainerMgr       8981          6             874
                   sleep       5275          1           11281
         ContainerThread       4416          4             944
         RootPressureThr       3215          5            1215
             rcu_preempt       2954          0               0
            ContainerMgr       2560          0               0
                 unnamed       1873          0               0
         EventManager_De       1845          1             636
         futex-default-S       1609          0               0
Mark K (955 rep)
Oct 18, 2023, 08:24 AM
1 votes
0 answers
247 views
perf instruction count
so i've been playing with perf and assembly i have the following program: ```asm .intel_syntax noprefix .global _start _start: mov cl, 2 mov ebx, 0b101 shr ebx, cl and bl, 1 je do_stuff do_stuff: mov eax, 1 mov ebx, 0 int 0x80 ``` and when I use it with `perf -e instructions:u ./shift` it shows 9 in...
so i've been playing with perf and assembly i have the following program:
.intel_syntax noprefix

.global _start
_start:
  mov cl, 2
  mov ebx, 0b101
  shr ebx, cl
  and bl, 1 
  je do_stuff

  do_stuff:
  mov eax, 1
  mov ebx, 0
  int 0x80
and when I use it with perf -e instructions:u ./shift it shows 9 instructions instead of 8, I could not find why is that. is there any way to find out which is the +1 instruction? is it just one of the program instructions but running in parallel then the cpu retires it? if thats the case how can observe how that works at a lower level? compiling with: as -msyntax=intel -mnaked-reg shift.s -o shift.o && ld shift.o -o shift /proc/sys/kernel/perf_event_paranoid set to -1
Joao Luca (11 rep)
Sep 11, 2023, 07:02 PM
7 votes
3 answers
11928 views
How do I generate the /sys/kernel/debug/tracing folder in kernel with yocto project?
I was tying to use `perf` on Renesas target and I configured the yocto "local.conf" as showed in [this link](https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#General_Setup). #avoid stripping binaries INHIBIT_PACKAGE_STRIP = "1" #add the debug information EXTRA_IMAGE_FEATURES= "debug-tweaks t...
I was tying to use perf on Renesas target and I configured the yocto "local.conf" as showed in [this link](https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#General_Setup) . #avoid stripping binaries INHIBIT_PACKAGE_STRIP = "1" #add the debug information EXTRA_IMAGE_FEATURES= "debug-tweaks tools-debug dbg-pkgs tools-profile" #format the debug info into a readable format for PERF PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory' perf is working but I need to monitor the context switches which require to use perf timechart and other commands that depends on perf-events, but the commands can't find this path "/sys/kernel/debug/tracing/events" . What should I do in order to get this folder and its files compiled with my kernel?
gemad (83 rep)
Jul 10, 2017, 08:58 AM • Last activity: Jun 2, 2023, 06:00 AM
0 votes
0 answers
117 views
Security of perf under root access
when I use perf it asked for more permission to be used: You may not have permission to collect stats. I was lazy so I did sudo perf (some short program I created myself) but I later realize you can allow perf a little more access that's is not root. My question is that is it secure to use the comma...
when I use perf it asked for more permission to be used: You may not have permission to collect stats. I was lazy so I did sudo perf (some short program I created myself) but I later realize you can allow perf a little more access that's is not root. My question is that is it secure to use the command:
sudo perf
NewbMaster66 (1 rep)
Apr 7, 2023, 06:11 AM
Showing page 1 of 20 total questions