Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
1
votes
0
answers
87
views
Is the TASK-PID in trace-cmd output the TID of the thread handling TAP interface I/O?
I'm working on an networking lab tool leveraging `QEMU`-based VM virtualization and `docker` technology to run VMs and containers respectively on a Linux `host`. The underlying lab connectivity is implemented by using linux `bridges`. I have a linux Ubuntu `guest` running inside a `QEMU VM` that fea...
I'm working on an networking lab tool leveraging
QEMU
-based VM virtualization and docker
technology to run VMs and containers respectively on a Linux host
. The underlying lab connectivity is implemented by using linux bridges
.
I have a linux Ubuntu guest
running inside a QEMU VM
that features a virtio-net
paravirualized interface with TAP
backend. Such TAP
interface is connected to a linux bridge's port
on the host.
root@eve-ng62-28:~# brctl show vnet0_3
bridge name bridge id STP enabled interfaces
vnet0_3 8000.d63b1f37e4ba no vnet0_9_2
vunl0_3_3
vunl0_7_0
vunl0_9_2
root@eve-ng62-28:~#
root@eve-ng62-28:~# ethtool -i vunl0_7_0
driver: tun
version: 1.6
firmware-version:
expansion-rom-version:
bus-info: tap
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
root@eve-ng62-28:~#
I'm using Linux ftrace
via trace-cmd
frontend to dig into some details, see also https://unix.stackexchange.com/questions/797717/tcp-checksum-offloading-on-virtio-net-paravirtualized-interfaces
root@eve-ng62-28:~# trace-cmd start -e net:netif_receive_skb_entry -f "name == 'vunl0_7_0'"
root@eve-ng62-28:~#
root@eve-ng62-28:~# trace-cmd show
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:48
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
qemu-system-x86-600348 b.... 66505.777999: netif_receive_skb_entry: dev=vunl0_7_0 napi_id=0x0 queue_mapping=1 skbaddr=0000000006a1cc35 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 hash=0x00000000 l4_hash=0 len=60 data_len=0 truesize=768 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0
As you can see, linux guest sends outgoing TCP
packets to the virtio-net
network interface setting CHECKSUM_PARTIAL(3)
tag in the ip_summed
field within sk_buff
struct.
My question is related to the TASK-PID
field shown by trace-cmd show
. 600348
is the PID
of the qemu-system-x86_64
process's instance associated to the VM.
As required I edit this to provide the question: is the TASK-PID
shown the PID
or TID
of the process/thread that is the context the TAP
driver runs into ?
CarloC
(385 rep)
Jul 9, 2025, 01:03 PM
• Last activity: Jul 12, 2025, 07:42 PM
0
votes
1
answers
31
views
Failed to register hook to tracepoint of signal_generate in Linux?
I am testing `FTRACE` in Linux VM (ubuntu 24.04) the kernel is `Linux VirtualBox 6.11.0-26-generic`. I wrote a kernel module to register a hook (probe) to kernel tracepoint of `signal_generate`, here is the code. ``` include #include #include static void trace_signal_generate_handler(void *ignore, i...
I am testing
FTRACE
in Linux VM (ubuntu 24.04) the kernel is Linux VirtualBox 6.11.0-26-generic
.
I wrote a kernel module to register a hook (probe) to kernel tracepoint of signal_generate
, here is the code.
include
#include
#include
static void trace_signal_generate_handler(void *ignore, int sig, struct kernel_siginfo *info,
struct task_struct *task, int type, int result)
{
printk("Signal %d generated for task %s (PID=%d), type=%d\n",
sig, task->comm, task->pid, type);
}
static int __init my_module_init(void) {
int ret;
ret = register_trace_signal_generate(trace_signal_generate_handler, NULL);
if (ret) {
pr_err("Failed to register tracepoint: %d\n", ret);
return ret;
}
return 0;
}
static void __exit my_module_exit(void) {
unregister_trace_signal_generate(trace_signal_generate_handler, NULL);
tracepoint_synchronize_unregister();
}
module_init(my_module_init);
module_exit(my_module_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("wangt13");
MODULE_DESCRIPTION("Signal Generate Tracepoint Hook Example");
When I compiled this module, I hit following error.
make: Entering directory '/usr/src/linux-headers-6.11.0-26-generic'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
You are using: gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0
MODPOST /home/wangt/develop/kermod/Module.symvers
ERROR: modpost: "__tracepoint_signal_generate" [/home/wangt/develop/kermod/tracep_reg.ko] undefined!
make: *** [scripts/Makefile.modpost:145: /home/wangt/develop/kermod/Module.symvers] Error 1
make: *** [/usr/src/linux-headers-6.11.0-26-generic/Makefile:1879: modpost] Error 2
make: *** [Makefile:224: __sub-make] Error 2
I checked the /usr/src/linux-headers-6.11.0-26-generic/include/trace/events/signal.h
, and found it is defined there.
36 /**
37 * signal_generate - called when a signal is generated
38 * @sig: signal number
39 * @info: pointer to struct siginfo
40 * @task: pointer to struct task_struct
41 * @group: shared or private
42 * @result: TRACE_SIGNAL_*
43 *
44 * Current process sends a 'sig' signal to 'task' process with
45 * 'info' siginfo. If 'info' is SEND_SIG_NOINFO or SEND_SIG_PRIV,
46 * 'info' is not a pointer and you can't access its field. Instead,
47 * SEND_SIG_NOINFO means that si_code is SI_USER, and SEND_SIG_PRIV
48 * means that si_code is SI_KERNEL.
49 */
50 TRACE_EVENT(signal_generate,
51
52 TP_PROTO(int sig, struct kernel_siginfo *info, struct task_struct *task,
53 int group, int result),
54
55 TP_ARGS(sig, info, task, group, result),
56
57 TP_STRUCT__entry(
58 __field( int, sig )
59 __field( int, errno )
60 __field( int, code )
61 __array( char, comm, TASK_COMM_LEN )
62 __field( pid_t, pid )
63 __field( int, group )
64 __field( int, result )
65 ),
.......
I don't know why there was the error in module compiling.
I tried to hook another kernel tracepoint of kmalloc
as follows,
ret = register_trace_kmalloc(probe_kmalloc, NULL);
and it was compiled and loaded without error.
So how to fix the error of hooking probe to tracepoint of signal_generate
, and what is the right way to hook my probe to the tracepoint of signal_generate
from within kernel module?
wangt13
(631 rep)
May 24, 2025, 02:59 AM
• Last activity: May 24, 2025, 12:38 PM
2
votes
1
answers
2379
views
How to get epoch time with trace-cmd (frontend for ftrace)?
-32578 [001] **133767.868869**: kfree: call_site=ffffffff810e6f06 ptr=(nil) How to make `trace-cmd` report time in epoch time? The time by default is I think time since application uptime. This is a requirement to pinpoint exact time of the event. `trace-cmd list -O` options doesn't help either.
-32578 **133767.868869**: kfree: call_site=ffffffff810e6f06 ptr=(nil)
How to make
trace-cmd
report time in epoch time? The time by default is I think time since application uptime. This is a requirement to pinpoint exact time of the event.
trace-cmd list -O
options doesn't help either.
themagicalyang
(191 rep)
Dec 12, 2016, 05:10 AM
• Last activity: May 2, 2025, 10:07 AM
0
votes
0
answers
12
views
Why does comm stay as the old name even after multiple task_rename and sched_process_exec events?
I am analyzing some trace-cmd (event) output filtered for some specific PID. The workflow is as follows (parse.py is just a custom python to parse for specific PIDs, $EVENTS_ARGS is just a list of the events to be traced): ``` trace-cmd record -b 10000 $EVENT_ARGS -o output_file -F -c executable_to_...
I am analyzing some trace-cmd (event) output filtered for some specific PID. The workflow is as follows (parse.py is just a custom python to parse for specific PIDs, $EVENTS_ARGS is just a list of the events to be traced):
trace-cmd record -b 10000 $EVENT_ARGS -o output_file -F -c executable_to_trace
trace-cmd report outputfile > trace.txt
parse.py trace.txt
And parsing produces a list of events for the chosen PID(s). Here's a relevant excerpt:
-1342 3231.385036: sched_switch: taskset:1342 D ==> swapper/4:0
sh-1342 3231.385357: task_rename: pid=1342 oldcomm=taskset newcomm=launch_function oom_score_adj=0
sh-1342 3231.385473: sched_switch: launch_function:1342 D ==> swapper/4:0
sh-1342 3231.385660: sched_process_exec: filename=/shared/loadgen/payload/launch_function.out pid=1342 old_pid=1342
sh-1342 3231.385830: sched_switch: launch_function:1342 D ==> swapper/4:0
However, even after these **task_rename** and **sched_process_exec** events, following events (*like sched_switch*) still log the comm field as sh, even though the actual process name should have changed.
To clarify further, trace-cmd seems to use *ftrace*'s *record-cmd* option which matches to the *comm* I am seeing in the trace-cmd output, but shouldn't that handle *task_rename*s better?
I know I can rewrite this during post-processing by replacing the comm with the latest task_rename, but I’m curious why the original trace still shows the outdated comm
*6.5.9 kernel*
Panagiotis Stefanis
(13 rep)
Apr 4, 2025, 01:03 PM
2
votes
0
answers
25
views
Task name shown as <...> in the output of EBPF program
I wrote a simple EBPF program which prints a message when the `execve` system call is invoked. I print the message using the `bpf_trace_printk` function. In the output, the task name for some processes is shown as ` `. For example, I ran the `sh` command and the output corresponding to that is the f...
I wrote a simple EBPF program which prints a message when the
execve
system call is invoked. I print the message using the bpf_trace_printk
function. In the output, the task name for some processes is shown as `. For example, I ran the
sh` command and the output corresponding to that is the following:
b' -588130 ...21 512047.173196: bpf_trace_printk: Hello World!'
As you can see above, the task name is shown as ` instead of
sh`. Why is that happening? How can I get the actual task name to be displayed?
russell.price
(53 rep)
Jan 14, 2025, 10:32 PM
1
votes
0
answers
68
views
How to interpert hwlat trace results on an ARM Cortex-A7?
I’m working with an STM32MP157D-DK1, trying to use the hwlat tracer for the board's arm Cortex-A7 CPU, to check for typical hardware latency. The following attempt was made to use the hwlat tracer on the board: ``` # mount -t debugfs debugfs /sys/kernel/debug # mount -t tracefs nodev /sys/kernel/tra...
I’m working with an STM32MP157D-DK1, trying to use the hwlat tracer for the board's arm Cortex-A7 CPU, to check for typical hardware latency.
The following attempt was made to use the hwlat tracer on the board:
# mount -t debugfs debugfs /sys/kernel/debug
# mount -t tracefs nodev /sys/kernel/tracing
# cd /sys/kernel/tracing
# cat hwlat_detector/window
1000000
# cat hwlat_detector/width
900000
# cat tracing_thresh
0
# cat tracing_max_latency
0
# cat tracing_cpumask
3
# cat hwlat_detector/mode
none [round-robin] per-cpu
# cat tracing_on
0
# cat current_tracer
nop
# echo hwlat > current_tracer
Assigning # echo hwlat > current_tracer
would set tracing_thresh
to 10, so:
# echo 0 > tracing_thresh
# cat tracing_thresh
0
# echo 1 > tracing_on
# sleep 1800
# cat trace
# tracer: hwlat
#
# entries-in-buffer/entries-written: 46/46 #P:2
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
-159 d.... 240.855853: #1 inner/outer(us): 0/2 ts:946699046.769930071 count:2
-159 d.... 241.871956: #2 inner/outer(us): 0/1 ts:946699047.786034280 count:2
-159 d.... 242.891945: #3 inner/outer(us): 0/1 ts:946699048.806022489 count:2
-159 d.... 243.911906: #4 inner/outer(us): 0/1 ts:946699049.825984865 count:2
-159 d.... 246.971967: #5 inner/outer(us): 0/1 ts:946699052.886044616 count:2
-159 d.... 252.071924: #6 inner/outer(us): 0/1 ts:946699057.986000660 count:2
-159 d.... 257.171915: #7 inner/outer(us): 0/1 ts:946699063.085992329 count:1
-159 d.... 272.471929: #8 inner/outer(us): 0/1 ts:946699078.386005920 count:2
-159 d.... 273.491923: #9 inner/outer(us): 0/1 ts:946699079.406001504 count:2
-159 d.... 277.571906: #10 inner/outer(us): 0/1 ts:946699083.485984589 count:2
-159 d.... 278.591918: #11 inner/outer(us): 0/1 ts:946699084.505996048 count:2
-159 d.... 318.371905: #12 inner/outer(us): 0/1 ts:946699124.285982817 count:2
-159 d.... 346.931903: #13 inner/outer(us): 0/1 ts:946699152.845980372 count:2
-159 d.... 393.851943: #14 inner/outer(us): 0/1 ts:946699199.766020728 count:2
-159 d.... 420.371906: #15 inner/outer(us): 0/1 ts:946699226.285984865 count:1
-159 d.... 487.691904: #16 inner/outer(us): 0/1 ts:946699293.605983314 count:2
-159 d.... 614.171939: #17 inner/outer(us): 0/1 ts:946699420.086017333 count:2
-159 d.... 646.811933: #18 inner/outer(us): 0/1 ts:946699452.726012015 count:1
-159 d.... 650.891904: #19 inner/outer(us): 0/1 ts:946699456.805984600 count:2
-159 d.... 705.971922: #20 inner/outer(us): 0/1 ts:946699511.886000877 count:2
-159 d.... 708.011907: #21 inner/outer(us): 0/1 ts:946699513.925987669 count:1
-159 d.... 763.091907: #22 inner/outer(us): 0/1 ts:946699569.005986862 count:2
-159 d.... 824.291904: #23 inner/outer(us): 0/1 ts:946699630.205985225 count:2
-159 d.... 893.651899: #24 inner/outer(us): 0/1 ts:946699699.565979716 count:2
-159 d.... 895.691904: #25 inner/outer(us): 0/1 ts:946699701.605985717 count:2
-159 d.... 912.011931: #26 inner/outer(us): 0/1 ts:946699717.926011266 count:2
-159 d.... 913.031903: #27 inner/outer(us): 0/1 ts:946699718.945983725 count:2
-159 d.... 930.371903: #28 inner/outer(us): 0/1 ts:946699736.285984275 count:2
-159 d.... 934.451904: #29 inner/outer(us): 0/1 ts:946699740.365986111 count:2
-159 d.... 1034.411903: #30 inner/outer(us): 0/1 ts:946699840.325984200 count:2
-159 d.... 1049.711894: #31 inner/outer(us): 0/1 ts:946699855.625976207 count:1
-159 d.... 1211.891897: #32 inner/outer(us): 0/1 ts:946700017.805979076 count:2
-159 d.... 1222.091906: #33 inner/outer(us): 0/1 ts:946700028.005987831 count:2
-159 d.... 1336.331903: #34 inner/outer(us): 0/1 ts:946700142.245984969 count:2
-159 d.... 1338.371900: #35 inner/outer(us): 0/1 ts:946700144.285983803 count:2
-159 d.... 1356.731931: #36 inner/outer(us): 0/1 ts:946700162.646014437 count:2
-159 d.... 1477.091900: #37 inner/outer(us): 0/1 ts:946700283.005983161 count:2
-159 d.... 1514.831892: #38 inner/outer(us): 0/1 ts:946700320.745975929 count:2
-159 d.... 1530.131903: #39 inner/outer(us): 0/1 ts:946700336.045985853 count:2
-159 d.... 1533.191895: #40 inner/outer(us): 0/1 ts:946700339.105978979 count:1
-159 d.... 1534.211901: #41 inner/outer(us): 0/1 ts:946700340.125985563 count:2
-159 d.... 1571.951896: #42 inner/outer(us): 0/1 ts:946700377.865979289 count:2
-159 d.... 1639.271892: #43 inner/outer(us): 0/1 ts:946700445.185976197 count:2
-159 d.... 1664.771895: #44 inner/outer(us): 0/1 ts:946700470.685979917 count:2
-159 d.... 1679.051899: #45 inner/outer(us): 0/1 ts:946700484.965983799 count:1
-159 d.... 1768.811900: #46 inner/outer(us): 0/1 ts:946700574.725985300 count:2
After a half an hour sleep
, this is what is logged to file trace
. I'm trying to understand this result. Trying to take assistance in Steven Rosetedt's walkthrough of the hwlat tracer output .
In another 20 minutes test I ran with similar conditions, there also appeared a line where numerous encounters of a 3 millisecond latency occurred, rather than the frequent 0, 1 or 2 microsec. It looks like this (see #16's count):
# tracer: hwlat
#
# entries-in-buffer/entries-written: 20/20 #P:2
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
-169 d.... 328.696323: #1 inner/outer(us): 0/2 ts:946685131.612004155 count:2
-169 d.... 329.711962: #2 inner/outer(us): 0/1 ts:946685132.627644489 count:2
-169 d.... 332.771947: #3 inner/outer(us): 0/1 ts:946685135.687627240 count:2
-169 d.... 333.791924: #4 inner/outer(us): 0/1 ts:946685136.707605282 count:2
-169 d.... 334.811973: #5 inner/outer(us): 0/1 ts:946685137.727653991 count:2
-169 d.... 335.831926: #6 inner/outer(us): 0/1 ts:946685138.747608742 count:2
-169 d.... 336.851921: #7 inner/outer(us): 0/1 ts:946685139.767602242 count:2
-169 d.... 345.011927: #8 inner/outer(us): 0/1 ts:946685147.927608371 count:2
-169 d.... 365.411950: #9 inner/outer(us): 0/1 ts:946685168.327631214 count:2
-169 d.... 455.171927: #10 inner/outer(us): 0/1 ts:946685258.087609424 count:2
-169 d.... 471.491916: #11 inner/outer(us): 0/1 ts:946685274.407599140 count:2
-169 d.... 616.331919: #12 inner/outer(us): 0/1 ts:946685419.247601959 count:2
-169 d.... 805.031908: #13 inner/outer(us): 0/1 ts:946685607.947590632 count:2
-169 d.... 1285.451914: #14 inner/outer(us): 0/1 ts:946686088.367599861 count:2
-169 d.... 1324.211913: #15 inner/outer(us): 0/1 ts:946686127.127599213 count:2
-169 d.... 1540.451925: #16 inner/outer(us): 1/3 ts:946686344.024041733 count:457
-169 d.... 1541.471970: #17 inner/outer(us): 0/2 ts:946686344.387656150 count:2
-169 d.... 1546.571943: #18 inner/outer(us): 0/1 ts:946686349.487628111 count:2
-169 d.... 1547.591934: #19 inner/outer(us): 0/2 ts:946686350.507622153 count:2
-169 d.... 1548.611929: #20 inner/outer(us): 0/1 ts:946686351.527615528 count:2
Since then I have made more half an hour/whole hour tests, their results are similar to those of the first log file shown.
Please note that according to the hardware latency detector documentation :
> The tracer hwlat_detector is a special purpose tracer that is used to
> detect large system latencies induced by the behavior of certain
> underlying hardware or firmware, independent of Linux itself. The code
> was developed originally to detect SMIs (System Management Interrupts)
> on x86 systems, **however there is nothing x86 specific about this
> patchset.**
I'm assuming that the reasons that x86 processors need SMIs (temperature management, ...) are inevitably present on arm cores too.
How should these results be interpreted? \
Hardware latency from the Cortex-A7 doesn't exceed 3 microseconds?\
Is there something wrong with what I’m doing?
user656857
(11 rep)
Oct 12, 2024, 11:58 AM
• Last activity: Oct 17, 2024, 11:24 AM
4
votes
1
answers
400
views
When (and how) are file permission/access decisions made in kernel space?
(I think) I have a relatively good understanding of how Linux permissions work - traditional Unix DAC permissions, and how they're represented, security context in relation to processes (i.e. cred struct, capabilities, etc.), hooks for LSMs, etc., and the link they have to (E)UID/E(GID), etc. What I...
(I think) I have a relatively good understanding of how Linux permissions work - traditional Unix DAC permissions, and how they're represented, security context in relation to processes (i.e. cred struct, capabilities, etc.), hooks for LSMs, etc., and the link they have to (E)UID/E(GID), etc. What I don't understand, and am having difficulty finding in kernel code, and from using
(via -cmd
), is when these checks are actually done (presumably) in kernel space, and at *which point* the access decision is actually made.
To elaborate:
I have a root-owned directory /dir
chmod-ed to -x------
(*500*). As an unprivileged user, I run /bin/ls /dir
and receive a permission denied error, as expected.
If I run -s 10000 /bin/ls /dir
I can see the
syscall, which returns a -1
back to user space, and sets
to
(i.e. "permission denied"). There's then a subsequent
syscall to print the error message to STDERR, all as expected. At no point here do I ever see
or
syscalls being called.
If I run the same test using trace-cmd, as -cmd record -p function_graph -F /bin/ls
with the same unprivileged user doing /bin/ls /dir
in a separate terminal tab (where -cmd
will trigger on the execution of the
binary), I can basically map most of the syscalls I'm seeing, one-to-one, between strace and trace-cmd outputs.
However, the discrepancy is that, in the trace-cmd output after the
syscall, rather than seeing the write outputs and it exiting, I can see
calls being made (and the full call tree of iterating over items in the directory, etc.). Because of limitations with ftrace, I guess (and more likely limitations in my understanding), I'm not seeing syscall arguments or returns in the output, so I can't see when, if at all, I'm getting permission errors, but from my limited understanding, the kernel appears to be actually doing the full directory listing, but just not returning the output back to user space.
So can anyone explain what's actually happening - why -cmd
shows the kernel actually doing the
syscalls, while
exits after the open call fails? Most of the forum posts (on here and elsewhere) with questions pertaining to permissions suggest that checks are done on opening the file, but provide little more detail than that, though this aligns with what I can see with
(but not
/-cmd
).
Looking at the kernel source, and following that through manually from the
/
syscalls, I can see functions call sequences such as -> do_inode_permission -> generic_permission -> acl_permission_check
with the latter actually doing comparisons on the classic Unix permissions, and returning from this. I can also see the hooks elsewhere for the LSMs, like SELinux, Apparmor, SMACK, Tomoyo, etc., which I guess will do their bits.
My assumption would be that the directory is attempted to be opened, permission checks would happen, kernel says no, and returns the error back to user space, and it would never get to the point of actually doing the directory listing. However, based on the -cmd
output appearing to actually show the directory listing happening, I'm not sure any more.
Any information - in as much technical depth as possible - would be greatly appreciated. Also, I understand that using Systemtap may be able to offer more information to me, but my knowledge of that is *very* limited!
For awareness, my testing has been on Kernel versions 2.6.35 and 6.5.0 on Debian-based distros, with comparatively similar results.
genericuser99
(119 rep)
Mar 26, 2024, 12:39 PM
• Last activity: Mar 26, 2024, 06:23 PM
2
votes
0
answers
180
views
What's the difference between ftrace and blktrace tracepoints?
For instance, we have `/sys/kernel/debug/tracing/events/block/block_bio_complete` and `block_rq_complete` which is basically IO completion trace points for `ftrace`, I believe. Using `sudo blktrace -d /dev/sdb1 -o - | blkparse -i -` we might see a line like this: 8,33 1 3 0.001086286 0 C RA 11581378...
For instance, we have
/sys/kernel/debug/tracing/events/block/block_bio_complete
and block_rq_complete
which is basically IO completion trace points for ftrace
, I believe. Using sudo blktrace -d /dev/sdb1 -o - | blkparse -i -
we might see a line like this:
8,33 1 3 0.001086286 0 C RA 1158137880 + 8
^ IO completion action
The C
here is the IO completion event according to the documentation but is this the same as block_bio_complete
or is it somehow different. Checking the source code, I can see at https://elixir.bootlin.com/linux/v5.6/source/block/blk-core.c#L1441 :
trace_block_rq_complete(req, blk_status_to_errno(error), nr_bytes);
And at https://elixir.bootlin.com/linux/v5.6/source/block/bio.c#L1863 :
trace_block_bio_complete(bio->bi_disk->queue, bio, blk_status_to_errno(bio->bi_status));
Which of the two is the C
action? Or does blktrace
use a separate set of actions - searches tell me ftrace
and blktrace
use different reporting mechanisms, but I cannot work out how each tracepoint for both mechanisms relate to kernel source code. Is there an easy way to map tracepoints to source code?
Ken Y-N
(251 rep)
Mar 31, 2023, 01:20 AM
5
votes
2
answers
852
views
Change format of syscall event trace output to ftrace
I enabled ftrace event tracing for sys_enter_openat syscall. The respective output format given at events/syscalls/sys_enter_openat/format is print fmt: "dfd: 0x%08lx, filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(R...
I enabled ftrace event tracing for sys_enter_openat syscall. The respective output format given at events/syscalls/sys_enter_openat/format is
print fmt: "dfd: 0x%08lx, filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(REC->flags)), ((unsigned long)(REC->mode))
As expected a sample output line to ftrace is something like
msm_irqbalance-1338 ...1 211710.033931: sys_openat(dfd: ffffff9c, filename: 5af693f224, flags: 2, mode: 0)
Is there a way to change output format such that
filename: 5af693f224
can be shown as filename:
instead of hex(5af693f224)
?
So basically is there a way to change output format while tracing a particular event(eg. sys_enter_openat above) to ftrace.
I guess this would have been possible using systemtap or krpobe but my setup does not allow its use as of now.
Madhur Rawat
(151 rep)
Apr 15, 2019, 06:41 AM
• Last activity: Dec 16, 2021, 08:29 PM
1
votes
0
answers
41
views
How to get the cpu number of a trace event in a custom probe function?
I am registering a probe function for trace events using a Kernel module. For eg the probe function for the trace event sched_kthread_stop would be: void handle_kthread_stop(void *data, struct task_struct *t) And register this using tracepoint_probe_register() Now an actual trace in the trace buffer...
I am registering a probe function for trace events using a Kernel module.
For eg the probe function for the trace event sched_kthread_stop would be:
void handle_kthread_stop(void *data, struct task_struct *t)
And register this using
tracepoint_probe_register()
Now an actual trace in the trace buffer looks like this:
bash-9002 d..5 2145.303974: sched_wakeup: comm=kworker/u8:1 pid=46 prio=120 target_cpu=003
Here is the CPU ID.
I want to get this CPU ID info in the probe function.
Can someone tell me how to get the CPU ID in the probe function?
Madhuparna Bhowmik
(163 rep)
Apr 19, 2020, 07:02 PM
• Last activity: Apr 21, 2020, 06:16 AM
2
votes
1
answers
193
views
Multiple Tracing Events Readers
**Question** Does the tracing pipes in Linux, for tracing events, accepts simultaneous readers? **Context** I am trying to get traces from the ```RAS``` subsystem while running some benchmarks. This subsystem reports hardware error events using trace events. The tool I am using to get the events in...
**Question**
Does the tracing pipes in Linux, for tracing events, accepts simultaneous readers?
**Context**
I am trying to get traces from the
subsystem while running some benchmarks.
This subsystem reports hardware error events using trace events.
The tool I am using to get the events in the **RASdaemon** tool.
I know the tool subscribe to the
FIFO on each cpu.
Furthermore, the benchmark subscribe to performance events which, **I believe that**, are also advertised through the same pipes.
The
works like the
which has the following description:
> The output is the same as the “trace” file but this file is meant to be streamed with live tracing. Reads from this file will block until new data is retrieved. Unlike the “trace” file, this file is a consumer. This means reading from this file causes sequential reads to display more current data. Once data is read from this file, it is consumed, and will not be read again with a sequential read. The “trace” file is static, and if the tracer is not adding more data, it will display the same information every time it is read. Unlike the “trace” file, opening this file for reading will not temporarily disable tracing.
I understand the concept of FIFO being a consumer file and why **each read will produce different data**.
However, do the Kernel make any arrangements for multiple subscribers (reader) to these events?
Alexandre Santana
(21 rep)
Jan 27, 2020, 02:52 PM
• Last activity: Jan 31, 2020, 11:34 AM
2
votes
1
answers
240
views
How can I send dmesg printout to ftrace subsystem?
How can I send `dmesg` printout to `ftrace` subsystem? I like correlate the `dmesg` msg with the functions call `graph` in `ftrace`. Thanks
How can I send
dmesg
printout to ftrace
subsystem?
I like correlate the dmesg
msg with the functions call graph
in ftrace
.
Thanks
tony-p-lee
(141 rep)
Aug 7, 2019, 01:00 AM
• Last activity: Aug 21, 2019, 11:45 PM
3
votes
0
answers
332
views
Linux task state different in /proc/pid/stat compared to Ftrace output of sched events
I'm trying to trace the flow of a piece of code just to understand the points where the tasks go into D/S/R states (uninterruptible/interruptible/running). I do understand the meaning of each state, however doing an exercise shows confusing results and I wasn't able till now to figure out why. The p...
I'm trying to trace the flow of a piece of code just to understand the points where the tasks go into D/S/R states (uninterruptible/interruptible/running). I do understand the meaning of each state, however doing an exercise shows confusing results and I wasn't able till now to figure out why.
The piece of code I'm running:
void foo() {
pthread_setname_np(pthread_self(), "lock_func/1");
while(quit == false)
{
for(int64_t idx=0; idx lock(mutex1);
while(!flag)
cv1.wait(lock);
flag = false;
}
int main(int argc, char *argv[]) {
...
std::thread t1 = std::thread(foo);
std::thread t2 = std::thread(bar);
...
if (mlockall( MCL_CURRENT | MCL_FUTURE ) -0 d... 335998: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 335999: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336000: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336001: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336002: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336003: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336004: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336005: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336006: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336007: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336008: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336009: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
-0 d... 336010: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
lock_func/1-30296 d... 336011: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
-0 d... 336012: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
Observe how prev_state for my lock_func1 (bar) is always **D** (uninterruptible). Looking at the code a context switch can happen only on the condition variable sleep, or when time slice is finished (running in CFS scheduler)
cat /proc/30295/task/30296/wchan
futex_wait_queue_me
which according to the docs is a interruptible function.
Checking multiple times the task state in proc shows always **S**:
cat /proc/30295/task/30296/stat
30295 (lock_func/1) S ...
So the stat shows S-interruptible state, the **futex_wait_queue_me** is interruptible, however, the ftrace show always at the context switch time **prev_state=D**. Running this code and recording several seconds with ftrace shows only **prev_state=D** for my lock_func/1 (bar). I never get the S-interruptible.
Can someone explain why? Is the scheduler doing some state change in between?
Thanks.
mdaniel
(151 rep)
Aug 7, 2019, 08:27 AM
1
votes
1
answers
387
views
How writing in debugfs controls ftrace?
Ftrace is controlled by manipulating the debug fs files. But how is it possible ? How kernel be aware of writing in this files and start its required actions?for exmple if 1 is written to tracing_on , kernel turns on tracing . Is it something related to a special file structure or kernel subscribes...
Ftrace is controlled by manipulating the debug fs files. But how is it possible ? How kernel be aware of writing in this files and start its required actions?for exmple if 1 is written to tracing_on , kernel turns on tracing .
Is it something related to a special file structure or kernel subscribes to changes of debugfs or I'm missing something obvious ?
fjahan
(11 rep)
May 26, 2019, 01:51 PM
• Last activity: May 26, 2019, 06:15 PM
0
votes
1
answers
242
views
functions folder in trace_stat folder of ftrace
I am learning ftrace, I read from the following link that I can enable function profiling using the 'function_profile_enabled' file. mount -t tracefs nodev /sys/kernel/tracing cd /sys/kernel/tracing echo 1 > function_profile_enabled echo "function" > current_tracer https://www.kernel.org/doc/html/v4...
I am learning ftrace, I read from the following link that I can enable function profiling using the 'function_profile_enabled' file.
mount -t tracefs nodev /sys/kernel/tracing
cd /sys/kernel/tracing
echo 1 > function_profile_enabled
echo "function" > current_tracer
https://www.kernel.org/doc/html/v4.18/trace/ftrace.html
The above link tells that the information is present in trace_stats/function (function0, function1..)
$ ls trace_stat/function*
trace_stat/function0 trace_stat/function15 trace_stat/function21 trace_stat/function28 trace_stat/function6
trace_stat/function1 trace_stat/function16 trace_stat/function22 trace_stat/function29 trace_stat/function7
trace_stat/function10 trace_stat/function17 trace_stat/function23 trace_stat/function3 trace_stat/function8
trace_stat/function11 trace_stat/function18 trace_stat/function24 trace_stat/function30 trace_stat/function9
trace_stat/function12 trace_stat/function19 trace_stat/function25 trace_stat/function31
trace_stat/function13 trace_stat/function2 trace_stat/function26 trace_stat/function4
trace_stat/function14 trace_stat/function20 trace_stat/function27 trace_stat/function5
root tracing
I only have 4 cores on my Ubuntu VM
$ grep -c ^processor /proc/cpuinfo
4
Can anyone explain me the actual significance of this file function in trace_stat folder
functions folder in trace_stat folder
md.jamal
(301 rep)
Jan 26, 2019, 09:08 AM
• Last activity: May 23, 2019, 05:43 PM
0
votes
1
answers
172
views
Performance reduction activating trace-functionality in kernel
Is there any performance reduction to be expected from just activating trace (ftrace) functionality in the Linux kernel, but not using it at run time?
Is there any performance reduction to be expected from just activating trace (ftrace) functionality in the Linux kernel, but not using it at run time?
arash javan
(381 rep)
Mar 1, 2019, 08:35 AM
• Last activity: May 23, 2019, 05:40 PM
1
votes
0
answers
930
views
how to use ftrace to function_trace a kernel modules interrupt?
I have a Banana Pi Arm device with 2 cores (1.2 GHz) running a Debian kernel which I enabled ftrace kernel debug options. I want to determine the amount of time spent in a kernel module function. The function is of the form static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, stru...
I have a Banana Pi Arm device with 2 cores (1.2 GHz) running a Debian kernel which I enabled ftrace kernel debug options. I want to determine the amount of time spent in a kernel module function. The function is of the form
static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, struct pt_regs *regs);
In my debug file system after I load the module I see the name exposed as a filter. I am tracing all functions. I tried to enable the
trace_printk
statements but I can not see it printed in my trace file. Here is the implementation of the function:
/** @brief handler for rising signal */
static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, struct pt_regs *regs) {
trace_printk(KERN_INFO "rise\n");
pause();
return (irq_handler_t) IRQ_HANDLED; // return that we all good
}
I know that the function is being called because I see the printk
statement in dmesg | tail
if I replace trace_printk
with printk
. Also I know that some of the modules functions are traceable like writing to sysfs store functions. For example I can see
static struct kobj_attribute mode_attr = __ATTR(mode, 0660, mode_show, mode_store);
The ftrace from writing to mode_store
.
1) | sys_write() {
1) 0.708 us | fget_light();
1) | vfs_write() {
1) 0.875 us | rw_verify_area();
1) | sysfs_write_file() {
1) 0.459 us | mutex_lock();
1) | get_zeroed_page() {
1) | __get_free_pages() {
1) | __alloc_pages_nodemask() {
1) 0.708 us | next_zones_zonelist();
1) | get_page_from_freelist() {
1) 0.458 us | next_zones_zonelist();
1) 1.000 us | __zone_watermark_ok();
1) | kmap_atomic() {
1) 0.625 us | add_preempt_count();
1) 0.625 us | page_address();
1) 9.250 us | }
1) | __kunmap_atomic() {
1) 0.750 us | sub_preempt_count();
1) 5.459 us | }
1) + 41.000 us | }
1) + 53.209 us | }
1) 0.542 us | page_address();
1) + 62.541 us | }
1) + 66.958 us | }
1) 0.917 us | sysfs_get_active();
1) | mode_store() {
1) | /* VT-GPIO_TEST: resume */
1) | gpio_direction_output() {
1) | _raw_spin_lock_irqsave() {
1) | __raw_spin_lock_irqsave() {
1) 0.834 us | add_preempt_count();
1) 5.375 us | }
1) 9.458 us | }
1) 0.834 us | gpio_ensure_requested();
1) | _raw_spin_unlock_irqrestore() {
1) 0.708 us | sub_preempt_count();
1) 5.125 us | }
1) | sunxi_gpio_direction_out() {
1) 1.125 us | gpio_set_one_pin_io_status();
1) 1.167 us | gpio_write_one_pin_value();
1) + 12.333 us | }
1) + 47.250 us | }
1) | gpio_direction_input() {
1) | _raw_spin_lock_irqsave() {
1) | __raw_spin_lock_irqsave() {
1) 0.625 us | add_preempt_count();
1) 5.167 us | }
1) 9.667 us | }
1) 0.583 us | gpio_ensure_requested();
1) | _raw_spin_unlock_irqrestore() {
1) 0.708 us | sub_preempt_count();
1) 5.166 us | }
1) | sunxi_gpio_direction_in() {
1) 0.667 us | gpio_set_one_pin_io_status();
1) 5.208 us | }
1) + 38.833 us | }
1) ! 105.833 us | }
1) 0.875 us | sysfs_put_active();
1) 0.875 us | mutex_unlock();
1) ! 204.500 us | }
1) 0.416 us | __fsnotify_parent();
1) | fsnotify() {
1) | __srcu_read_lock() {
1) 0.583 us | add_preempt_count();
1) 0.667 us | sub_preempt_count();
1) 9.542 us | }
1) | __srcu_read_unlock() {
1) 0.584 us | add_preempt_count();
1) 0.667 us | sub_preempt_count();
1) 9.041 us | }
1) + 30.125 us | }
1) ! 252.708 us | }
1) ! 262.292 us | }
The mode_store
in this trace actually writes HIGH or LOW to a pin that the IRQ handler is registered to. Any tips will be very helpful, I am trying to measure the overhead of the IRQ handler.
channon
(193 rep)
Jul 24, 2018, 08:28 PM
• Last activity: Jul 27, 2018, 05:28 PM
2
votes
0
answers
643
views
Tool for displaying timeline view of kernel/userspace execution trace
I've used the ftrace tool **trace-cmd** for recording scheduling tracepoints and syscall tracepoints. I can inspect them visually using the [kernelshark][1] GUI program. It dislays per-core timelines and per-task timelines as seen in the screenshot below. [![enter image description here][2]][2] I ca...
I've used the ftrace tool **trace-cmd** for recording scheduling tracepoints and syscall tracepoints. I can inspect them visually using the kernelshark GUI program. It dislays per-core timelines and per-task timelines as seen in the screenshot below.
I can see when interrupts are triggered, when the scheduler is invoked and when system calls are being made and when they return. But when a system call is being made, kernelshark makes no distinction between the task running in userspace and it running in kernelspace.
I want a tool that allows me to see for a specific core when it is executing an interrupt handler; the currently traced process in userspace; or some task in userspace. Something similar to this picture:
Does anyone know of a tool for transforming ftrace data into that sort of format, be it as an svg file, a chrome trace-file or using any other format or tool?


Daniel Näslund
(1932 rep)
Feb 13, 2018, 04:22 PM
1
votes
0
answers
128
views
PMU and tracepoints
I am searching for information about the mechanism of creation of tracepoints. From this topic [https://unix.stackexchange.com/questions/326621/what-are-kernel-pmu-event-s-in-perf-events-list][1] and this blog [brendangregg][2] I've got that tracepoints are some macros inside the kernel code which a...
I am searching for information about the mechanism of creation of tracepoints. From this topic https://unix.stackexchange.com/questions/326621/what-are-kernel-pmu-event-s-in-perf-events-list and this blog brendangregg I've got that tracepoints are some macros inside the kernel code which are called when they are enabled, but I'm a little bit confused how tracepoint connected with PMU (because it mentioned in both links)?
Igor Masternoy
(111 rep)
Jan 20, 2018, 03:42 PM
1
votes
0
answers
386
views
Function profiling on multi-core platform for specific PID using ftrace
I'm trying to use ftrace for function profiling based on function_profile_enabled as described [here][1]. I would like to capture function calls only for a certain PID using set_ftrace_pid as described in the same link. However, when I set PID and do function_profile_enabled it still captures functi...
I'm trying to use ftrace for function profiling based on function_profile_enabled as described here . I would like to capture function calls only for a certain PID using set_ftrace_pid as described in the same link. However, when I set PID and do function_profile_enabled it still captures function calls on all CPUs. So, considering context-switching it is really hard to filter out unwanted CPUs from /sys/kernel/debug/tracing/trace_stat/function*. The function_graph feature works with set_ftrace_pid just fine. Brandan Gregg had the same problem as found here , but I'm using 4.4 kernel, which is generations after.
Dimon
(131 rep)
Oct 23, 2017, 11:32 PM
Showing page 1 of 20 total questions