Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
11
votes
4
answers
10607
views
How to run linux perf without root
I want to benchmark an application of mine. Up to now I used gnu time, but perf yields much better stats. As a matter of principle I would like to go the route of a decicated perf user instead of allowing all users some security-related things, not because I am aware of a specific danger but because...
I want to benchmark an application of mine. Up to now I used gnu time, but perf yields much better stats.
As a matter of principle I would like to go the route of a decicated perf user instead of allowing all users some security-related things, not because I am aware of a specific danger but because I don't understand the security implications. Therefore I'd like to avoid lowering the paranoid setting for perf as discussed in this question .
Reading kernel.org on perf-security (note that the document seems to imply that this should work with Linux 5.9 or later), I did this:
# addgroup perf_users
# adduser perfer
# addgroup perfer perf_users
# cd /usr/bin
# chgrp perf_users perf
# chmod o-rwx perf
# setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
# setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
which returns perf: ok
.
# getcap perf
returns
perf cap_sys_ptrace,cap_syslog,cap_perfmon=ep
.
which is different from the link where they got
perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
My Linux is 5.10.0-5-amd64 #1 SMP Debian 5.10.24-1
If I now run perf
with user perfer
I still get the error message
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 3:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = )
which I tried to circumvent with all the above.
Do any of you know, how to get perfer to run perf without lowering the paranoid setting?
Eike
(221 rep)
Mar 27, 2021, 02:23 PM
• Last activity: Jul 27, 2025, 05:07 PM
1
votes
0
answers
51
views
How to use linux perf in a conda environment
I'm trying to use linux `perf` program in a conda environment, but the `perf` program seems to ignore the conda environment. I installed the `conda-forge::linux-perf` package in my conda environment, but when I run it I get this error: ``` $ perf record ls perf: error while loading shared libraries:...
I'm trying to use linux
perf
program in a conda environment, but the perf
program seems to ignore the conda environment. I installed the conda-forge::linux-perf
package in my conda environment, but when I run it I get this error:
$ perf record ls
perf: error while loading shared libraries: libdebuginfod.so.1: cannot open shared object file: No such file or directory
The library is installed in my conda environment:
$ ls $CONDA_PREFIX/lib/*debuginfod*
/home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod-0.191.so /home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod.so.1
/home/pcarter/anaconda3/envs/spec_density/lib/libdebuginfod.so
But perf
is not looking in this directory for libraries. Using strace
, I'm able to see where it is only looking in the base system directories, not the conda ones.
execve("/home/pcarter/anaconda3/envs/spec_density/bin/perf", ["perf", "record", "ls"], 0x7ffe078205b0 /* 99 vars */) = 0
access("/etc/suid-debug", F_OK) = -1 ENOENT (No such file or directory)
brk(NULL) = 0x556c19e15000
fcntl(0, F_GETFD) = 0
fcntl(1, F_GETFD) = 0
fcntl(2, F_GETFD) = 0
access("/etc/suid-debug", F_OK) = -1 ENOENT (No such file or directory)
readlink("/proc/self/exe", "/home/pcarter/anaconda3/envs/spe"..., 4096) = 50
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=138259, ...}) = 0
mmap(NULL, 138259, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9bf8a59000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 l\0\0\0\0\0\0"..., 832) = 832
...
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/tls/haswell/x86_64/libdebuginfod.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/lib/x86_64-linux-gnu/tls/haswell/x86_64", 0x7fffe984d610) = -1 ENOENT (No such file or directory)
...
openat(AT_FDCWD, "/usr/lib/libdebuginfod.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
This is not what happens for other conda installed programs. Looking at strace
output for running the nasm
program, it is looking at the conda lib directory.
execve("/home/pcarter/anaconda3/envs/spec_density/bin/nasm", ["nasm"], 0x7ffffc123b60 /* 99 vars */) = 0
brk(NULL) = 0x189a000
readlink("/proc/self/exe", "/home/pcarter/anaconda3/envs/spe"..., 4096) = 50
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/haswell", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/tls", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/haswell", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/x86_64/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib/x86_64", 0x7ffe9d0a04f0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/pcarter/anaconda3/envs/spec_density/bin/../lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/home/pcarter/anaconda3/envs/spec_density/bin/../lib", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0
Is this a security feature of perf
? If so, what do I need to do to fix this?
FYI: I followed the instructions at this [page](https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html) to enable my conda perf
program to have the capabilities it needs.
$ sudo getcap /home/pcarter/anaconda3/envs/spec_density/bin/perf
/home/pcarter/anaconda3/envs/spec_density/bin/perf cap_sys_ptrace,cap_syslog,cap_perfmon=ep
**Update**
I found a work around, but still don't completely understand what is going on. Using sudo
to run perf
as *root* fixes the issue.
$ sudo $CONDA_PREFIX/bin/perf record ls
# Output of ls redacted here
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.021 MB perf.data (7 samples) ]
(I have to specify the path to perf
here since I'm using sudo
which uses the *root* user's PATH
) It looks like conda sets up the library search path using RPATH
in the executable. If you use readelf
to look at the perf
binary, it returns:
$ readelf -d $CONDA_PREFIX/bin/perf
Dynamic section at offset 0x948bc0 contains 45 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libelf.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdebuginfod.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdw.so.1]
0x0000000000000001 (NEEDED) Shared library: [libunwind-x86_64.so.8]
0x0000000000000001 (NEEDED) Shared library: [libunwind.so.8]
0x0000000000000001 (NEEDED) Shared library: [liblzma.so.5]
0x0000000000000001 (NEEDED) Shared library: [libcrypto.so.3]
0x0000000000000001 (NEEDED) Shared library: [libpython3.12.so.1.0]
0x0000000000000001 (NEEDED) Shared library: [libz.so.1]
0x0000000000000001 (NEEDED) Shared library: [libzstd.so.1]
0x0000000000000001 (NEEDED) Shared library: [libcap.so.2]
0x0000000000000001 (NEEDED) Shared library: [libnuma.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000f (RPATH) Library rpath: [$ORIGIN/../lib]
where $ORIGIN
represents the directory the executable is in. So RPATH
does point to the conda lib
directory. However, perf
seems to be setup to ignore this depending on properties of the user running it. It does use it for *root* but not for my user account. I assume this is for security reasons. Is there a way to enable this for my user account?
pcarter
(111 rep)
Mar 3, 2025, 10:17 PM
• Last activity: Mar 4, 2025, 11:37 PM
0
votes
0
answers
45
views
Can "perf mem" command detect remote memory access on CXL NUMA nodes?
I wonder that if `perf mem` can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't...
I wonder that if
perf mem
can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't work on AMD CPUs (https://community.amd.com/t5/server-processors/issues-with-perf-mem-record/m-p/95270) .
Who can help me?
SeenThrough
(1 rep)
Feb 11, 2025, 02:01 AM
0
votes
0
answers
183
views
Clarifications on perf_events data collection
_Originally posted [here](https://stackoverflow.com/questions/79391172/clarifications-on-perf-events-data-collection) on stackoverflow_ I have never used the `perf` command before (but I need it), hence I have been reading the (really useful) [PerfWiki](https://perfwiki.github.io/main/). The section...
_Originally posted [here](https://stackoverflow.com/questions/79391172/clarifications-on-perf-events-data-collection) on stackoverflow_
I have never used the
perf
command before (but I need it), hence I have been reading the (really useful) [PerfWiki](https://perfwiki.github.io/main/) .
The section devoted to [Event-based sampling overview](https://perfwiki.github.io/main/tutorial/#sampling-with-perf-record) contains a number of statements that are not completely clear to me.
As they are quite essential to precisely understand how data collection is carried out, here I am asking for your help.
In the following I will quote paragraphs from that section and explain my duobts immediately after.
> Perf_events is based on event-based sampling. The period is expressed as the number of occurrences of an event, not the number of timer ticks. A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0. No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.
Q1. Consider a CPU working at a fixed frequency (no frequency scaling): what is the precise definition of a timer tick
? Is it true that the _number_ of ticks in a second equals the _value_ of the frequency (e.g. 1_000_000_000
ticks/second for a CPU working at 1
GHz)?
Q2. perf
does not use timer ticks, instead, it counts the number of times an event occurs and only "stops" the CPU to gather the relevant data once ever period
times; e.g. if period=1
each occurence of an event is registered, if period=2
it only registers half of the total number of occurrences, and so on... is it right?
When period > 1
does perf
automatically scale the final values and provide data as if all the events were registered?
Q3. The above section says that "A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0" which seems to contradict the measurement being taken once every period
occurrences of an event... what am I missing?
More generally, why does perf
wait for a counter to overflow before gathering the information?
Also, what happens when more than one event is being monitored?
> The way perf_events emulates 64-bit counter is limited to expressing sampling periods using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel **silently** truncates the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running on 32-bit systems.
Q4. I cannot truly understand the meaning of this paragraph (maybe I am missing some underlying knowledge). If the actual hardware counter has `N On counter overflow, the kernel records information, i.e., a sample, about the execution of the program. What gets recorded depends on the type of measurement. This is all specified by the user and the tool. But the key information that is common in all samples is the instruction pointer, i.e. where was the program when it was interrupted.
Ok, I got this.
> Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer stored in each sample designates the place where the program was interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e., where it was at the end of the sampling period. In some case, the distance between those two points may be several dozen instructions or more if there were taken branches. When the program cannot make forward progress, those two locations are indeed identical. **For this reason, care must be taken when interpreting profiles.**
Q5. I am aware I know way too little about how a CPU actually works to understand this section, but could you confirm that this paragraph is warning about the following potential chain of events (pun not intended):
1. A sample must be taken
2. The current value in the istruction pointer register is taken
3. A few more instructions are executed by the CPU
4. The sample is taken and the gathered data is saved using and associated with the instruction pointer taken at step (2)
If this is correct, could you give me a brief explanation (or point me to some external resource) about what may cause step (3)?
More importantly, how can I monitor how many times skids have occurred? Is there anything on my side that I can do to mitigate this?
> By default, perf record uses the cycles event as the sampling event.
>
> [...]
>
> The perf_events interface allows two modes to express the sampling period:
>
> * the number of occurrences of the event (period)
> * the average rate of samples/sec (frequency)
>
> The perf tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means that the kernel is dynamically adjusting the sampling period to achieve the target average rate.
Q6. Does perf
use the cycles
event as a reference to compute the sampling period even if it is not among the set of events being monitored?
What happens when multiple events are monitored? Does each event have its own period, or is there one event that counts for all?
When a frequency is used to determine when samples must be taken, the event used as reference for sampling should be irrelevant, right?
Sirion
(101 rep)
Jan 29, 2025, 08:23 AM
1
votes
1
answers
138
views
Failed to run some functions with perf tool in embedded Linux
I am working on an embedded Linux system (kernel-5.19.20), and I tried the perf tool in my SOC, found some functions did NOT work. After run `perf record /test/perf_test`, I got the `perf.data`, and `perf report` also showed functions list. I cross-compiled `objdump` and `addr2line` for my target, t...
I am working on an embedded Linux system (kernel-5.19.20), and I tried the perf tool in my SOC, found some functions did NOT work.
After run
perf record /test/perf_test
, I got the perf.data
, and perf report
also showed functions list.
I cross-compiled objdump
and addr2line
for my target, then I ran
./perf annotate --addr2line /data/addr2line --objdump /data/objdump
Percent | Source code & Disassembly of perfload for cpu-clock:ppp (33248 samples, percent: local period)
--------------------------------------------------------------------------------------------------------------
:
:
:
: 3 Disassembly of section .text:
:
: 5 0040060c :
: 6 workload2():
0.00 : 40060c: addiu sp,sp,-24
0.00 : 400610: sw s8,20(sp)
0.00 : 400614: move s8,sp
0.00 : 400618: sw zero,12(s8)
0.00 : 40061c: sw zero,8(s8)
0.00 : 400620: b 400698
0.00 : 400624: nop
39.08 : 400628: lw v1,8(s8)
0.00 : 40062c: lw v0,8(s8)
2.07 : 400630: mul v0,v1,v0
0.00 : 400634: lw v1,12(s8)
4.08 : 400638: addu v0,v1,v0
0.00 : 40063c: sw v0,12(s8)
....
Then I tried to make a flamegraph with perf.data
with following commands.
perf script -i perf.data &> perf.unfold
cat perf.unfold | head -20
perfload 1822 1443.361889: 250000 cpu-clock:ppp: 80769534 _raw_spin_unlock+0x3c ([kernel.kallsyms])
perfload 1822 1443.362138: 250000 cpu-clock:ppp: 77eb6a28 __mips_syscall5+0x8 (/lib/ld-linux-mipsn8.so.1)
perfload 1822 1443.362388: 250000 cpu-clock:ppp: 800fe1e0 filemap_map_pages+0x118 ([kernel.kallsyms])
perfload 1822 1443.362639: 250000 cpu-clock:ppp: 77eb75f8 memcpy+0x1c8 (/lib/ld-linux-mipsn8.so.1)
perfload 1822 1443.362945: 250000 cpu-clock:ppp: 800eec8c perf_event_mmap_output+0x218 ([kernel.kallsyms])
perfload 1822 1443.363196: 250000 cpu-clock:ppp: 8001ec38 enable_restore_fp_context+0x208 ([kernel.kallsyms])
perfload 1822 1443.363440: 250000 cpu-clock:ppp: 4005d0 workload1+0x80 (/data/perfload)
perfload 1822 1443.363690: 250000 cpu-clock:ppp: 400584 workload1+0x34 (/data/perfload)
perfload 1822 1443.363939: 250000 cpu-clock:ppp: 4005e8 workload1+0x98 (/data/perfload)
perfload 1822 1443.364189: 250000 cpu-clock:ppp: 4005ec workload1+0x9c (/data/perfload)
perfload 1822 1443.364439: 250000 cpu-clock:ppp: 4005ec workload1+0x9c (/data/perfload)
perfload 1822 1443.364689: 250000 cpu-clock:ppp: 400594 workload1+0x44 (/data/perfload)
perfload 1822 1443.364939: 250000 cpu-clock:ppp: 400588 workload1+0x38 (/data/perfload)
perfload 1822 1443.365189: 250000 cpu-clock:ppp: 400574 workload1+0x24 (/data/perfload)
perfload 1822 1443.365439: 250000 cpu-clock:ppp: 4005d4 workload1+0x84 (/data/perfload)
perfload 1822 1443.365689: 250000 cpu-clock:ppp: 40058c workload1+0x3c (/data/perfload)
perfload 1822 1443.365939: 250000 cpu-clock:ppp: 40058c workload1+0x3c (/data/perfload)
perfload 1822 1443.366189: 250000 cpu-clock:ppp: 40058c workload1+0x3c (/data/perfload)
perfload 1822 1443.366439: 250000 cpu-clock:ppp: 40059c workload1+0x4c (/data/perfload)
perfload 1822 1443.366689: 250000 cpu-clock:ppp: 400598 workload1+0x48 (/data/perfload)
I copied the perf.unfold
to my development host, and ran perl scripts from Greg's website.
stackcollapse-perf.pl ~/shared/perf.unfold &> ~/shared/perf.fold
flamegraph.pl ~/shared/perf.fold > ~/shared/perf.svg
Stack count is low (0). Did something go wrong?
ERROR: No stack counts found
I checked the perf.fold
, its size is 0!!!
The perf
is cross-compiled with following features enabled.
Auto-detecting system features:
... dwarf: [ OFF ]
... dwarf_getlocations: [ OFF ]
... glibc: [ on ]
... libbfd: [ OFF ]
... libbfd-buildid: [ OFF ]
... libcap: [ OFF ]
... libelf: [ on ]
... libnuma: [ OFF ]
... numa_num_possible_cpus: [ OFF ]
... libperl: [ OFF ]
... libpython: [ OFF ]
... libcrypto: [ OFF ]
... libunwind: [ on ]
... libdw-dwarf-unwind: [ OFF ]
... zlib: [ on ]
... lzma: [ OFF ]
... get_cpuid: [ OFF ]
... bpf: [ OFF ]
... libaio: [ on ]
... libzstd: [ OFF ]
The SOC is a MIPS, and I am not sure why flamegraph.pl
failed, it seemed the stack info is empty in perf.data
? Or is there any feature I missed to make the stack unwinding work ? Or perf does NOT support MIPS stack analysis?
Thanks,
wangt13
(631 rep)
Oct 25, 2024, 12:03 PM
• Last activity: Oct 26, 2024, 02:06 AM
0
votes
0
answers
347
views
How to cross-compile Linux tools/perf for embedded system?
I am working on an embedded Linux system (kernel-5.19.20) on MIPS, and I want to build `tools/perf` for my system. I want to have `libelf` feature enabled when cross compile perf, so I firstly cross-compile and installed the `libelf` to `$(proj)/sysroot/usr/lib/`. Then I tried following command to c...
I am working on an embedded Linux system (kernel-5.19.20) on MIPS, and I want to build
tools/perf
for my system.
I want to have libelf
feature enabled when cross compile perf, so I firstly cross-compile and installed the libelf
to $(proj)/sysroot/usr/lib/
.
Then I tried following command to cross compile perf
.
CC=mips-none-gnu-gcc make ARCH=mips CROSS_COMPILE=mips-none-gnu- EXTRA_CFLAGS="-I/home/t/proj/target/usr/include" LDFLAGS="-L/home/t/proj/target/usr/lib -Wl,-rpath-link=/home/t/proj/target/usr/lib"
And I got following feature list.
Auto-detecting system features:
... dwarf: [ OFF ]
... dwarf_getlocations: [ OFF ]
... glibc: [ on ]
... libbfd: [ OFF ]
... libbfd-buildid: [ OFF ]
... libcap: [ OFF ]
... libelf: [ OFF ]
... libnuma: [ OFF ]
... numa_num_possible_cpus: [ OFF ]
... libperl: [ OFF ]
... libpython: [ OFF ]
... libcrypto: [ OFF ]
... libunwind: [ OFF ]
... libdw-dwarf-unwind: [ OFF ]
... zlib: [ OFF ]
... lzma: [ OFF ]
... get_cpuid: [ OFF ]
... bpf: [ on ]
... libaio: [ on ]
... libzstd: [ OFF ]
... disassembler-four-args: [ OFF ]
I checked the ../build/feature/test-libelf.make.output
, and I got,
/home/t/proj/mips-none-gnu/bin/ld: warning: libz.so.1, needed by /home/t/proj/target/usr/lib/libelf.so, not found (try using -rpath or -rpath-link)
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `inflate'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `deflate'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to deflateInit_' /home/t/proj/target/usr/lib/libelf.so: undefined reference to
inflateEnd'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to `deflateEnd'
/home/t/proj/target/usr/lib/libelf.so: undefined reference to inflateInit_' /home/t/proj/target/usr/lib/libelf.so: undefined reference to
inflateReset' collect2: error: ld returned 1 exit status
The libz.so is built within buildroot
and already installed into /home/t/proj/target/usr/lib
.
So, how to make perf with libelf
succeed in this case?
wangt13
(631 rep)
Oct 17, 2024, 06:25 AM
• Last activity: Oct 17, 2024, 07:04 AM
1
votes
1
answers
242
views
perf trace is not available on my machine
I'm on ``` Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-45-generic x86_64) ``` I installed the perf packages ```sudo apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)``` but when I try to run `perf trace` it's complaining that subcommand isn't available ``` akhil@akhil-Inspiron-5559:...
I'm on
Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-45-generic x86_64)
I installed the perf packages
apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
but when I try to run perf trace
it's complaining that subcommand isn't available
akhil@akhil-Inspiron-5559:~$ perf trace
perf: 'trace' is not a perf-command. See 'perf --help'.
akhil@akhil-Inspiron-5559:~$ sudo perf trace
[sudo] password for akhil:
perf: 'trace' is not a perf-command. See 'perf --help'.
however I think it is installed because when I run man perf trace
I get the man page
PERF-TRACE(1) perf Manual PERF-TRACE(1)
NAME
perf-trace - strace inspired tool
SYNOPSIS
perf trace
perf trace record
any ideas what could be happenning?
EDITS
In response to @Romeo's suggestion
akhil@akhil-Inspiron-5559:~$ perf-trace
perf-trace: command not found
akhil@akhil-Inspiron-5559:~$ sudo perf-trace
[sudo] password for akhil:
sudo: perf-trace: command not found
Rockstar5645
(113 rep)
Oct 9, 2024, 03:45 PM
• Last activity: Oct 9, 2024, 07:25 PM
0
votes
0
answers
79
views
Linux Perf Stat -> results and intel meteor lake (p-cores only test)
Device -> Asus Zenbook 14 (Intel Meteor Lake 155H) ubuntu 24.04 lts uname -r == 6.8.0-45-generic From running perf stat, i can see entries for **cpu_atom**/instructions/ **cpu_atom**/cycles/ **cpu_atom**/branches/ Are those the results from Efficiency-cores (**cpu_atom**)? Can i run perf stat strict...
Device -> Asus Zenbook 14 (Intel Meteor Lake 155H)
ubuntu 24.04 lts
uname -r == 6.8.0-45-generic
From running perf stat, i can see entries for
**cpu_atom**/instructions/
**cpu_atom**/cycles/
**cpu_atom**/branches/
Are those the results from Efficiency-cores (**cpu_atom**)?
Can i run perf stat strictly on the Performance-Cores? (which are cpu_core 0 to 5)
lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 16:16:4:0 yes 4500.0000 400.0000 1731.5070
1 0 0 1 8:8:2:0 yes 4800.0000 400.0000 1991.5780
2 0 0 1 8:8:2:0 yes 4800.0000 400.0000 400.0000
3 0 0 2 12:12:3:0 yes 4800.0000 400.0000 1684.4030
4 0 0 2 12:12:3:0 yes 4800.0000 400.0000 1889.0811
5 0 0 0 16:16:4:0 yes 4500.0000 400.0000 1271.9000
6 0 0 3 20:20:5:0 yes 4500.0000 400.0000 1669.7340
7 0 0 3 20:20:5:0 yes 4500.0000 400.0000 400.0000
8 0 0 4 24:24:6:0 yes 4500.0000 400.0000 400.0000
9 0 0 4 24:24:6:0 yes 4500.0000 400.0000 1838.1899
10 0 0 5 28:28:7:0 yes 4500.0000 400.0000 1856.5320
11 0 0 5 28:28:7:0 yes 4500.0000 400.0000 400.0000
12 0 0 6 0:0:0:0 yes 3800.0000 400.0000 1537.6290
13 0 0 7 2:2:0:0 yes 3800.0000 400.0000 1502.0179
14 0 0 8 4:4:0:0 yes 3800.0000 400.0000 1341.5179
15 0 0 9 6:6:0:0 yes 3800.0000 400.0000 1461.8240
16 0 0 10 1:0 yes 3800.0000 400.0000 400.0000
17 0 0 11 10:10:1:0 yes 3800.0000 400.0000 400.0000
18 0 0 12 1:0 yes 3800.0000 400.0000 400.0000
19 0 0 13 14:14:1:0 yes 3800.0000 400.0000 998.1960
20 0 0 14 64:64:8 yes 2500.0000 400.0000 400.0000
21 0 0 15 66:66:8 yes 2500.0000 400.0000 400.0000
perf stat results
12.24 msec task-clock # 0.693 CPUs utilized
107 context-switches # 8.742 K/sec
23 cpu-migrations # 1.879 K/sec
329 page-faults # 26.879 K/sec
18,360,525 cpu_atom/cycles/ # 1.500 GHz (24.46%)
21,167,060 cpu_core/cycles/ # 1.729 GHz (75.54%)
19,495,339 cpu_atom/instructions/ # 1.06 insn per cycle (24.46%)
34,945,740 cpu_core/instructions/ # 1.90 insn per cycle (75.54%)
3,627,048 cpu_atom/branches/ # 296.327 M/sec (24.46%)
6,609,493 cpu_core/branches/ # 539.991 M/sec (75.54%)
76,918 cpu_atom/branch-misses/ # 2.12% of all branches (24.46%)
65,568 cpu_core/branch-misses/ # 1.81% of all branches (75.54%
Cedyangs279
(1 rep)
Oct 8, 2024, 07:37 AM
1
votes
0
answers
143
views
Cache or save results of perf report
For large files, perf report takes a while. e.g. around 5 minutes for 500G perf.data. Is there a way to dump (certain states of) perf report to a file, or tell perf to cache something that enables faster invocation of `perf report` later on? Note that I dont want to just save the first view of perf...
For large files, perf report takes a while. e.g. around 5 minutes for 500G perf.data. Is there a way to dump (certain states of) perf report to a file, or tell perf to cache something that enables faster invocation of
perf report
later on?
Note that I dont want to just save the first view of perf report by redirecting to a file perf report > a.txt
A. K.
(136 rep)
Aug 25, 2024, 12:00 AM
0
votes
1
answers
159
views
see vdso function with perf report
On my `perf` report I see a bunch of lines with vdso modules and no function names. I am guessing these are for `gettimeofday` type calls. But how can I see actual function calls? My perf record command: ```lang-shell sudo perf record -a -g -k 1 --call-graph dwarf,16384 -F 99 -e cpu-clock:pppH -p $P...
On my
perf
report I see a bunch of lines with vdso modules and no function names. I am guessing these are for gettimeofday
type calls. But how can I see actual function calls?
My perf record command:
-shell
sudo perf record -a -g -k 1 --call-graph dwarf,16384 -F 99 -e cpu-clock:pppH -p $PID -- sleep 60
and using sudo perf report
to view the report. It shows something like:
+ 2.83% 0.00% myprog [vdso] [.] 0x00007fff983f6738
I am on kernel version 5.10
using Amazon Linux 2.
Eee Zee
(1 rep)
Jul 24, 2024, 05:01 AM
• Last activity: Jul 24, 2024, 12:03 PM
0
votes
1
answers
736
views
Unable to find any information about TLB on my computer or obtain information about hardware counters about TLB
The Ubuntu version I am using is **Ubuntu 18.04.6 LTS**, and the kernel version is **5.4.0-148 generic**. My processor is **12th Gen Intel (R) Core (TM) i7-12700**. I would like to know the number of TLB entries in my CPU for different page sizes (1G, 2MB, 4KB), as well as the number of dTLB misses...
The Ubuntu version I am using is **Ubuntu 18.04.6 LTS**, and the kernel version is **5.4.0-148 generic**. My processor is **12th Gen Intel (R) Core (TM) i7-12700**.
I would like to know the number of TLB entries in my CPU for different page sizes (1G, 2MB, 4KB), as well as the number of dTLB misses during program execution.
-1
command told me they are 0:
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0x0 (0)
instruction associativity = 0x0 (0)
data # entries = 0x0 (0)
data associativity = 0x0 (0)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0x0 (0)
instruction associativity = 0x0 (0)
data # entries = 0x0 (0)
data associativity = 0x0 (0)
L1 data cache information (0x80000005/ecx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = 0x0 (0)
size (KB) = 0x0 (0)
L1 instruction cache information (0x80000005/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = 0x0 (0)
size (KB) = 0x0 (0)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x0 (0)
associativity = 0x7 (7)
size (KB) = 0x500 (1280)
L3 cache information (0x80000006/edx):
line size (bytes) = 0x0 (0)
lines per tag = 0x0 (0)
associativity = L2 off (0)
size (in 512KB units) = 0x0 (0)
stat -e dTLB-loads,dTLB-load-misses,iTLB-load-misses
shows not supported
UPDATE:
$ cpuid -1 -l 2
CPU:
0xff: cache data is in CPUID 4
0xfe: unknown
0xf0: 64 byte prefetching
$ cpuid -1 -l 0x18
CPU:
0x00000018 0x00: eax=0x00000004 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
$ perf -v
perf version 5.4.231
$ perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
I'm completely confused about what's wrong with my machine.
citrusyi
(1 rep)
May 24, 2023, 05:20 AM
• Last activity: Jun 20, 2024, 05:05 AM
1
votes
1
answers
413
views
Where does `perf script` timestamps come from?
The third column from `perf script` seems to be close but not quite the uptime, where is that timestamp coming from? Is there a way to access that timestamp other than accessing sampled events? ``` $ perf record cat /proc/uptime 1392597.79 16669901.66 [ perf record: Woken up 1 times to write data ]...
The third column from
perf script
seems to be close but not quite the uptime, where is that timestamp coming from? Is there a way to access that timestamp other than accessing sampled events?
$ perf record cat /proc/uptime
1392597.79 16669901.66
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]
$ perf script
cat 902536 1392640.417831: 1 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417849: 1 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417863: 3 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417876: 10 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417889: 33 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417902: 108 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417915: 351 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417929: 1136 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417942: 3657 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417958: 11701 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.417998: 34018 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.418060: 55936 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.418117: 77496 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.418195: 110190 cycles:u: ffffffffac000163 [unknown] ([unknown])
cat 902536 1392640.418296: 140857 cycles:u: ffffffffac000163 [unknown] ([unknown])
cat 902536 1392640.418422: 166643 cycles:u: ffffffffac000b47 [unknown] ([unknown])
cat 902536 1392640.418589: 187277 cycles:u: ffffffffac000163 [unknown] ([unknown])
cat 902536 1392640.418763: 198929 cycles:u: 7fdb1734b9a8 _dl_addr+0x108 (/usr/lib/libc-2.31.so)
cat 902536 1392640.418959: 209655 cycles:u: ffffffffac000163 [unknown] ([unknown])
Frederik Deweerdt
(3784 rep)
Apr 28, 2020, 09:58 PM
• Last activity: Jun 15, 2024, 03:37 PM
0
votes
2
answers
129
views
INST_RETIRED.ANY no longer works as a performance counter with perf on linux 6.7
With the older kernels INST_RETIRED.ANY (as well as many of the other counters documented in [https://perfmon-events.intel.com/ahybrid.htm][1] ) worked as counters for perf. I am now using perf with a 6.7 kernel running on a Sapphire Rapids {Golden Cove} processor. When I do the following perf stat...
With the older kernels INST_RETIRED.ANY (as well as many of the other counters documented in https://perfmon-events.intel.com/ahybrid.htm ) worked as counters for perf.
I am now using perf with a 6.7 kernel running on a Sapphire Rapids {Golden Cove} processor.
When I do the following
perf stat -e INST_RETIRED.ANY,cycles sleep 2
I get
event syntax error: 'INST_RETIRED.ANY,cycles'
\___ parser error
Is this the expected behavior?
rkg125
(1 rep)
Apr 8, 2024, 05:31 AM
• Last activity: Apr 11, 2024, 02:05 AM
3
votes
2
answers
2309
views
how to set capabilities (setcap) on perf
I'd like to use the perf utility. I was following instructions to set up a privileged group of users who are permitted to execute performance monitoring and observability without limits (as instructed here: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html). I added the group and...
I'd like to use the perf utility. I was following instructions to set up a privileged group of users who are permitted to execute performance monitoring and observability without limits (as instructed here: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html) . I added the group and limited access to users not in the group. I started having problems when assigning capabilities to the perf tool:
setcap cap_sys_admin,cap_sys_ptrace,cap_syslog=ep perf
I get an invalid arguments error saying
fatal error: Invalid argument
usage: setcap [-q] [-v] [-n ] (-r|-|) [ ... (-r|-|) ]
Note must be a regular (non-symlink) file.
But running stats perf
gives me this
File: ./perf
Size: 1622 Blocks: 8 IO Block: 4096 regular file
Device: 10307h/66311d Inode: 35260925 Links: 1
Access: (0750/-rwxr-x---) Uid: ( 0/ root) Gid: ( 1001/perf_users)
Access: 2021-12-03 13:08:48.923220351 +0100
Modify: 2021-11-05 17:02:56.000000000 +0100
Change: 2021-12-03 12:31:49.451991980 +0100
Birth: -
which says the file is a regular file. What could be the problem? How can I set the capabilities for the Perf tool?
Linux distribution: Ubuntu 20.04
EDIT:
Last 20 output lines of strace setcap cap_sys_admin,cap_sys_ptrace,cap_syslog=ep perf
:
munmap(0x7f825054c000, 90581) = 0
prctl(PR_CAPBSET_READ, CAP_MAC_OVERRIDE) = 1
prctl(PR_CAPBSET_READ, 0x30 /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x28 /* CAP_??? */) = 1
prctl(PR_CAPBSET_READ, 0x2c /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x2a /* CAP_??? */) = -1 EINVAL (Invalid argument)
prctl(PR_CAPBSET_READ, 0x29 /* CAP_??? */) = -1 EINVAL (Invalid argument)
brk(NULL) = 0x55de3e858000
brk(0x55de3e879000) = 0x55de3e879000
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, NULL) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=1<
levente.nas
(133 rep)
Dec 3, 2021, 01:07 PM
• Last activity: Mar 29, 2024, 07:25 PM
2
votes
0
answers
85
views
A simple memcpy loop with stride of 8-16-32 Bytes, no L1 cache misses, what stalls backend cycles?
I am trying to understand the CPU cache performance in a single producer single consumer queue algorithm, but cannot pinpoint the cause of performance degradation in some cases. The following simplified test program runs with practically no L1 cache misses, but nevertheless it spends many cycles sta...
I am trying to understand the CPU cache performance in a single producer single consumer queue algorithm, but cannot pinpoint the cause of performance degradation in some cases. The following simplified test program runs with practically no L1 cache misses, but nevertheless it spends many cycles stalled in the CPU backend when its memory access pattern is somewhat sparser. **What causes the CPU backend to stall in such cases, when there are practically no L1 misses?** What should be measured to pinpoint the cause?
I understand that this question is not so much about Linux or perf_event, as it is more about the architecture of the CPU and caches. What would be a more appropriate stackexchange? Stackoverflow is sort of more about software? Serverfault or Superuser don't aim at these topics either. Electronics stackexchange does not quite target CPU architecture. Although, according to [this meta from 2013](https://meta.stackexchange.com/a/216253/194455) , Electronics may be the best fit. I post it here simply because all the tests were done on Linux, and from experience I think there are experts who may know what happens here, e.g. Gilles. Probably, the best is to post it on AMD forums. But I cannot get my draft posted there because of the error: "Post flooding detected (user tried to post more than 2 messages within 600 seconds)" when I have no posts actually posted. No wonder why their forums are so quiet.
My CPU is AMD Ryzen 5 PRO 4650G, which is Zen 2 "Renoir" with 192KiB L1d cache. The
The number on the data points shows the
The run with
memcpy
test program:
++
// demo_memcpy_test_speed-gap.c
#include
#include
#include
#include
#define PACKET_SIZE 8 // 16 32 // <--- it is really the stride of the memcpy over the mem array
#define SIZE_TO_MEMCPY 8 // memcpy only the first 8 bytes of the "packet"
const static long long unsigned n_packets = 512; // use few packets, to fit in L2 etc
static long long unsigned repeat = 1000*1000 * 2 * 2; // repeat many times to get enough stats in perf
const static long long unsigned n_max_data_bytes = n_packets * PACKET_SIZE;
#define CACHE_LINE_SIZE 64 // align explicitly just in case
alignas(CACHE_LINE_SIZE) uint8_t data_in [n_max_data_bytes];
alignas(CACHE_LINE_SIZE) uint8_t data_out [n_packets][PACKET_SIZE];
int main(int argc, char* argv[])
{
printf("memcpy_test.c standard\n");
printf("PACKET_SIZE %d SIZE_TO_MEMCPY %d\n", PACKET_SIZE, SIZE_TO_MEMCPY);
//
// warmup the memory
// i.e. access the memory to make sure Linux has set up the virtual mem tables
...
{
printf("\nrun memcpy\n");
long long unsigned n_bytes_copied = 0;
long long unsigned memcpy_ops = 0;
start_setup = clock();
for (unsigned rep=0; rep
It's built with -O1
to get a really efficient loop with memcpy
:
g++ -g -O1 ./demo_memcpy_test_speed-gap.c
The instructions of the memcpy
loop, as seen in perf record
annotate option:
sudo perf record -F 999 -e stalled-cycles-backend -- ./a.out
sudo perf report
...select main
With PACKET_SIZE
set to 8, the code is really efficient:
│ for (unsigned long long i_packet=0; i_packet
With PACKET_SIZE
set to 1024, and the code is the same for 256, except add $0x100..
instead of 0x400
:
│ lea _end,%rsi
│140:┌─→mov %rbp,%rdx
│ │
│ │ lea data_in,%rax
│ │
│ │__fortify_function void *
│ │__NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
│ │size_t __len))
│ │{
│ │return __builtin___memcpy_chk (__dest, __src, __len,
│14a:│ mov (%rax),%rcx
│ │memcpy():
96.31 │ │ mov %rcx,(%rdx)
│ │
1.81 │ │ add $0x400,%rax
0.20 │ │ add $0x400,%rdx
1.12 │ │ cmp %rsi,%rax
0.57 │ │↑ jne 14a
│ │ sub $0x1,%edi
│ └──jne 140
I run it with PACKET_SIZE
set to 8, 16, 32, and other values. The perf counts for 8 and 32:
sudo perf stat -e task-clock,instructions,cycles,stalled-cycles-frontend,stalled-cycles-backend \
-e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches \
-e l2_cache_accesses_from_dc_misses,l2_cache_hits_from_dc_misses,l2_cache_misses_from_dc_misses \
-- ./a.out
PACKET_SIZE 8 SIZE_TO_MEMCPY 8
...
Performance counter stats for './a.out':
503.43 msec task-clock # 0.998 CPUs utilized
10,323,618,071 instructions # 4.79 insn per cycle
# 0.01 stalled cycles per insn (29.11%)
2,154,694,815 cycles # 4.280 GHz (29.91%)
5,148,993 stalled-cycles-frontend # 0.24% frontend cycles idle (30.70%)
55,922,538 stalled-cycles-backend # 2.60% backend cycles idle (30.99%)
4,091,862,625 L1-dcache-loads # 8.128 G/sec (30.99%)
24,211 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (30.99%)
18,745 L1-dcache-prefetches # 37.234 K/sec (30.37%)
30,749 l2_cache_accesses_from_dc_misses # 61.079 K/sec (29.57%)
21,046 l2_cache_hits_from_dc_misses # 41.805 K/sec (28.78%)
9,095 l2_cache_misses_from_dc_misses # 18.066 K/sec (28.60%)
PACKET_SIZE 32 SIZE_TO_MEMCPY 8
...
Performance counter stats for './a.out':
832.83 msec task-clock # 0.999 CPUs utilized
12,289,501,297 instructions # 3.46 insn per cycle
# 0.11 stalled cycles per insn (29.42%)
3,549,297,932 cycles # 4.262 GHz (29.64%)
5,552,837 stalled-cycles-frontend # 0.16% frontend cycles idle (30.12%)
1,349,663,970 stalled-cycles-backend # 38.03% backend cycles idle (30.25%)
4,144,875,512 L1-dcache-loads # 4.977 G/sec (30.25%)
772,968 L1-dcache-load-misses # 0.02% of all L1-dcache accesses (30.24%)
539,481 L1-dcache-prefetches # 647.767 K/sec (30.25%)
532,879 l2_cache_accesses_from_dc_misses # 639.839 K/sec (30.24%)
461,131 l2_cache_hits_from_dc_misses # 553.690 K/sec (30.04%)
14,485 l2_cache_misses_from_dc_misses # 17.392 K/sec (29.55%)
**There is a slight increase in L1 cache misses: from 0% at 8 bytes PACKET_SIZE
up to 0.02% at 32 bytes. But can it justify why the backend stalls jumped from 2.6% to 38%? If no, then what else stalls the CPU backend?**
I get that a larger stride means that the memcpy
loop moves from one L1 cache line to another one sooner. But, if the lines are already in the cache, and there are practically no L1 miss events, as reported by perf
, then **why would an access to a different cache line stall the backend?** Is it something about how the CPU issues the instructions in parallel? Maybe, it cannot issue instructions that access different cache lines simultaneously?
Runs with PACKET_SIZE
up to 1024 bytes are shown on the following somewhat busy plot:

PACKET_SIZE
parameter of the run, i.e. the stride of the memcpy
access pattern. The X axis is millions of operations per second (Mops), an "operation" = 1 memcpy. The Y axis contains the metrics from perf: the percent of L1 accesses that missed, and percent of cycles that are stalled in backend and frontend.
In all these runs, the L2 accesses practically do not miss, i.e. l2_cache_misses_from_dc_misses
metric is always very low. Just for completeness, according to [anandtech](https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/11) Zen 2 architecture L1 latency is 4 cycles, L2 latency is 12 cycles.
I am not sure why the frontend gets stalled. But that's what perf
reports. And I believe it is real. Because the effect of the stalled frontend is different from backend. If you compare the runs with PACKET_SIZE
256 and 1024 on the plot: they have about the same L1 misses; 256 has about 77% cycles stalled in backend and 0% in frontend; 1024 is the opposite, 77% stalled in frontend and 0% in backend. Yet, 1024 is much slower, because it has much less instructions issued per cycle. It's about 0.42 in the 1024 run, and 1.28 in the 256 run.
So, the CPU issues less instructions per cycle when it is stalled in frontend rather than in backend. I guess that's just how the frontend and backend work, i.e. the backend can run more parallel. It would be well appreciated if someone could confirm or correct this guess. However, a more important question: why does the frontend get stalled? The frontend is supposed to just decode the instructions. The assembly does not really change with PACKET_SIZE
set to 256 or 1024. **Then, what makes the frontend to stall more at 1024 stride than at 256?**
The plot of IPC per Mops for all the PACKET_SIZE
runs:

PACKET_SIZE
8 is slightly off the line towards more Mops, i.e. faster than the trend of other values. That must be because of the more efficient instructions.
xealits
(2267 rep)
Feb 4, 2024, 11:08 PM
• Last activity: Feb 5, 2024, 09:33 AM
0
votes
1
answers
215
views
Which tools for micro-benchmarking?
I'm not sure which tool to use for micro-benchmarking a C program. I would like to measure both: - Memory usage, RSS ( Resident Set Size ) - CPU cycles I did use `perf record -g` and `perf script` piped into an awk script. This worked for finding the memory usage, but CPU cycles weren't accurate bec...
I'm not sure which tool to use for micro-benchmarking a C program.
I would like to measure both:
- Memory usage, RSS ( Resident Set Size )
- CPU cycles
I did use
perf record -g
and perf script
piped into an awk script. This worked for finding the memory usage, but CPU cycles weren't accurate because perf record
gets the cpu cycles by sampling. perf stat
is accurate but obviously doesn't give per-function stats. The perf_event library seems to be terribly documented and a meal of a task for simple benchmarking.
Having briefly looked at:
- SystemTap
- DTrace
- LTTng
- gperftools
- likwid
- PAPI
Which seem like decent, well-documented tools.
What would you recommend looking the most into? Or any other suggestions?
Thank you for your time.
jinTgreater
(1 rep)
Jan 19, 2024, 06:23 PM
• Last activity: Jan 19, 2024, 11:20 PM
1
votes
0
answers
29
views
What is futex-default-S?
I can't find more info about `futex-default-S` Is this a system call? module? or else? https://github.com/torvalds/linux/blob/06dc10eae55b5ceabfef287a7e5f16ceea204aa0/tools/perf/Documentation/perf-lock.txt#L102 ``` $ perf lock report -t -F acquired,contended,avg_wait Name acquired contended avg wait...
I can't find more info about
futex-default-S
Is this a system call? module? or else?
https://github.com/torvalds/linux/blob/06dc10eae55b5ceabfef287a7e5f16ceea204aa0/tools/perf/Documentation/perf-lock.txt#L102
$ perf lock report -t -F acquired,contended,avg_wait
Name acquired contended avg wait (ns)
perf 240569 9 5784
swapper 106610 19 543
:15789 17370 2 14538
ContainerMgr 8981 6 874
sleep 5275 1 11281
ContainerThread 4416 4 944
RootPressureThr 3215 5 1215
rcu_preempt 2954 0 0
ContainerMgr 2560 0 0
unnamed 1873 0 0
EventManager_De 1845 1 636
futex-default-S 1609 0 0
Mark K
(955 rep)
Oct 18, 2023, 08:24 AM
1
votes
0
answers
247
views
perf instruction count
so i've been playing with perf and assembly i have the following program: ```asm .intel_syntax noprefix .global _start _start: mov cl, 2 mov ebx, 0b101 shr ebx, cl and bl, 1 je do_stuff do_stuff: mov eax, 1 mov ebx, 0 int 0x80 ``` and when I use it with `perf -e instructions:u ./shift` it shows 9 in...
so i've been playing with perf and assembly i have the following program:
.intel_syntax noprefix
.global _start
_start:
mov cl, 2
mov ebx, 0b101
shr ebx, cl
and bl, 1
je do_stuff
do_stuff:
mov eax, 1
mov ebx, 0
int 0x80
and when I use it with perf -e instructions:u ./shift
it shows 9 instructions instead of 8, I could not find why is that.
is there any way to find out which is the +1 instruction?
is it just one of the program instructions but running in parallel then the cpu retires it?
if thats the case how can observe how that works at a lower level?
compiling with: as -msyntax=intel -mnaked-reg shift.s -o shift.o && ld shift.o -o shift
/proc/sys/kernel/perf_event_paranoid
set to -1
Joao Luca
(11 rep)
Sep 11, 2023, 07:02 PM
7
votes
3
answers
11928
views
How do I generate the /sys/kernel/debug/tracing folder in kernel with yocto project?
I was tying to use `perf` on Renesas target and I configured the yocto "local.conf" as showed in [this link](https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#General_Setup). #avoid stripping binaries INHIBIT_PACKAGE_STRIP = "1" #add the debug information EXTRA_IMAGE_FEATURES= "debug-tweaks t...
I was tying to use
perf
on Renesas target and I configured the yocto "local.conf" as showed in [this link](https://wiki.yoctoproject.org/wiki/Tracing_and_Profiling#General_Setup) .
#avoid stripping binaries
INHIBIT_PACKAGE_STRIP = "1"
#add the debug information
EXTRA_IMAGE_FEATURES= "debug-tweaks tools-debug dbg-pkgs tools-profile"
#format the debug info into a readable format for PERF
PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'
perf
is working but I need to monitor the context switches which require to use perf timechart
and other commands that depends on perf-events, but the commands can't find this path "/sys/kernel/debug/tracing/events" .
What should I do in order to get this folder and its files compiled with my kernel?
gemad
(83 rep)
Jul 10, 2017, 08:58 AM
• Last activity: Jun 2, 2023, 06:00 AM
0
votes
0
answers
117
views
Security of perf under root access
when I use perf it asked for more permission to be used: You may not have permission to collect stats. I was lazy so I did sudo perf (some short program I created myself) but I later realize you can allow perf a little more access that's is not root. My question is that is it secure to use the comma...
when I use perf it asked for more permission to be used:
You may not have permission to collect stats.
I was lazy so I did sudo perf (some short program I created myself) but I later realize you can allow perf a little more access that's is not root. My question is that is it secure to use the command:
sudo perf
NewbMaster66
(1 rep)
Apr 7, 2023, 06:11 AM
Showing page 1 of 20 total questions