Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

1 votes

0 answers

87 views

Is the TASK-PID in trace-cmd output the TID of the thread handling TAP interface I/O?

virtual-machine qemu tracing virtio ftrace

I'm working on an networking lab tool leveraging `QEMU`-based VM virtualization and `docker` technology to run VMs and containers respectively on a Linux `host`. The underlying lab connectivity is implemented by using linux `bridges`. I have a linux Ubuntu `guest` running inside a `QEMU VM` that fea...

                                  I'm working on an networking lab tool leveraging QEMU-based VM virtualization and docker technology to run VMs and containers respectively on a Linux host. The underlying lab connectivity is implemented by using linux bridges.

I have a linux Ubuntu guest running inside a QEMU VM that features a virtio-net paravirualized interface with TAP backend. Such TAP interface is connected to a linux bridge's port on the host.

    root@eve-ng62-28:~# brctl show vnet0_3
    bridge name     bridge id               STP enabled     interfaces
    vnet0_3         8000.d63b1f37e4ba       no              vnet0_9_2
                                                            vunl0_3_3
                                                            vunl0_7_0
                                                            vunl0_9_2
    root@eve-ng62-28:~#
    root@eve-ng62-28:~# ethtool -i vunl0_7_0
    driver: tun
    version: 1.6
    firmware-version:
    expansion-rom-version:
    bus-info: tap
    supports-statistics: no
    supports-test: no
    supports-eeprom-access: no
    supports-register-dump: no
    supports-priv-flags: no
    root@eve-ng62-28:~#

I'm using Linux ftrace via trace-cmd frontend to dig into some details, see also https://unix.stackexchange.com/questions/797717/tcp-checksum-offloading-on-virtio-net-paravirtualized-interfaces 

    root@eve-ng62-28:~# trace-cmd start -e net:netif_receive_skb_entry -f "name == 'vunl0_7_0'"
    root@eve-ng62-28:~#
    root@eve-ng62-28:~# trace-cmd show
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 1/1   #P:48
    #
    #                                _-----=> irqs-off/BH-disabled
    #                               / _----=> need-resched
    #                              | / _---=> hardirq/softirq
    #                              || / _--=> preempt-depth
    #                              ||| / _-=> migrate-disable
    #                              |||| /     delay
    #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
    #              | |         |   |||||     |         |
    qemu-system-x86-600348   b.... 66505.777999: netif_receive_skb_entry: dev=vunl0_7_0 napi_id=0x0 queue_mapping=1 skbaddr=0000000006a1cc35 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 hash=0x00000000 l4_hash=0 len=60 data_len=0 truesize=768 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0

As you can see, linux guest sends outgoing TCP packets to the virtio-net network interface setting CHECKSUM_PARTIAL(3) tag in the ip_summed field within sk_buff struct. 

My question is related to the TASK-PID field shown by trace-cmd show. 600348 is the PID of the qemu-system-x86_64 process's instance associated to the VM.

As required I edit this to provide the question: is the TASK-PID shown the PID or TID of the process/thread that is the context the TAP driver runs into ?
                                

CarloC (385 rep)

Jul 9, 2025, 01:03 PM • Last activity: Jul 12, 2025, 07:42 PM

0 votes

1 answers

31 views

Failed to register hook to tracepoint of signal_generate in Linux?

linux kernel-modules c ftrace

I am testing `FTRACE` in Linux VM (ubuntu 24.04) the kernel is `Linux VirtualBox 6.11.0-26-generic`. I wrote a kernel module to register a hook (probe) to kernel tracepoint of `signal_generate`, here is the code. ``` include #include #include static void trace_signal_generate_handler(void *ignore, i...

I am testing FTRACE in Linux VM (ubuntu 24.04) the kernel is Linux VirtualBox 6.11.0-26-generic. I wrote a kernel module to register a hook (probe) to kernel tracepoint of signal_generate, here is the code.

include 
#include 
#include 

static void trace_signal_generate_handler(void *ignore, int sig, struct kernel_siginfo *info,
                 struct task_struct *task, int type, int result)
{
    printk("Signal %d generated for task %s (PID=%d), type=%d\n",
           sig, task->comm, task->pid, type);
}

static int __init my_module_init(void) {
    int ret;

    ret = register_trace_signal_generate(trace_signal_generate_handler, NULL);
    if (ret) {
        pr_err("Failed to register tracepoint: %d\n", ret);
        return ret;
    }

    return 0;
}

static void __exit my_module_exit(void) {
    unregister_trace_signal_generate(trace_signal_generate_handler, NULL);
    tracepoint_synchronize_unregister();
}

module_init(my_module_init);
module_exit(my_module_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("wangt13");
MODULE_DESCRIPTION("Signal Generate Tracepoint Hook Example");

When I compiled this module, I hit following error.

make: Entering directory '/usr/src/linux-headers-6.11.0-26-generic'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
  You are using:           gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0
  MODPOST /home/wangt/develop/kermod/Module.symvers
ERROR: modpost: "__tracepoint_signal_generate" [/home/wangt/develop/kermod/tracep_reg.ko] undefined!
make: *** [scripts/Makefile.modpost:145: /home/wangt/develop/kermod/Module.symvers] Error 1
make: *** [/usr/src/linux-headers-6.11.0-26-generic/Makefile:1879: modpost] Error 2
make: *** [Makefile:224: __sub-make] Error 2

I checked the /usr/src/linux-headers-6.11.0-26-generic/include/trace/events/signal.h, and found it is defined there.

36 /**
 37  * signal_generate - called when a signal is generated
 38  * @sig: signal number
 39  * @info: pointer to struct siginfo
 40  * @task: pointer to struct task_struct
 41  * @group: shared or private
 42  * @result: TRACE_SIGNAL_*
 43  *
 44  * Current process sends a 'sig' signal to 'task' process with
 45  * 'info' siginfo. If 'info' is SEND_SIG_NOINFO or SEND_SIG_PRIV,
 46  * 'info' is not a pointer and you can't access its field. Instead,
 47  * SEND_SIG_NOINFO means that si_code is SI_USER, and SEND_SIG_PRIV
 48  * means that si_code is SI_KERNEL.
 49  */
 50 TRACE_EVENT(signal_generate,
 51 
 52     TP_PROTO(int sig, struct kernel_siginfo *info, struct task_struct *task,
 53             int group, int result),
 54 
 55     TP_ARGS(sig, info, task, group, result),
 56 
 57     TP_STRUCT__entry(
 58         __field(    int,    sig         )
 59         __field(    int,    errno           )
 60         __field(    int,    code            )
 61         __array(    char,   comm,   TASK_COMM_LEN   )
 62         __field(    pid_t,  pid         )
 63         __field(    int,    group           )
 64         __field(    int,    result          )
 65     ),
.......

I don't know why there was the error in module compiling. I tried to hook another kernel tracepoint of kmalloc as follows,

ret = register_trace_kmalloc(probe_kmalloc, NULL);

and it was compiled and loaded without error. So how to fix the error of hooking probe to tracepoint of signal_generate, and what is the right way to hook my probe to the tracepoint of signal_generate from within kernel module?

wangt13 (631 rep)

May 24, 2025, 02:59 AM • Last activity: May 24, 2025, 12:38 PM

2 votes

1 answers

2379 views

How to get epoch time with trace-cmd (frontend for ftrace)?

linux-kernel timestamps ftrace

-32578 [001] **133767.868869**: kfree: call_site=ffffffff810e6f06 ptr=(nil) How to make `trace-cmd` report time in epoch time? The time by default is I think time since application uptime. This is a requirement to pinpoint exact time of the event. `trace-cmd list -O` options doesn't help either.

                                        -32578  **133767.868869**: kfree:            call_site=ffffffff810e6f06 ptr=(nil)

How to make trace-cmd report time in epoch time? The time by default is I think time since application uptime. This is a requirement to pinpoint exact time of the event.

trace-cmd list -O options doesn't help either.

themagicalyang (191 rep)

Dec 12, 2016, 05:10 AM • Last activity: May 2, 2025, 10:07 AM

0 votes

0 answers

12 views

Why does comm stay as the old name even after multiple task_rename and sched_process_exec events?

tracing ftrace

I am analyzing some trace-cmd (event) output filtered for some specific PID. The workflow is as follows (parse.py is just a custom python to parse for specific PIDs, $EVENTS_ARGS is just a list of the events to be traced): ``` trace-cmd record -b 10000 $EVENT_ARGS -o output_file -F -c executable_to_...

trace-cmd record -b 10000 $EVENT_ARGS -o output_file -F -c executable_to_trace
trace-cmd report outputfile > trace.txt
parse.py trace.txt

And parsing produces a list of events for the chosen PID(s). Here's a relevant excerpt:

-1342    3231.385036: sched_switch:         taskset:1342  D ==> swapper/4:0 
	sh-1342    3231.385357: task_rename:          pid=1342 oldcomm=taskset newcomm=launch_function oom_score_adj=0
	sh-1342    3231.385473: sched_switch:         launch_function:1342  D ==> swapper/4:0 
	sh-1342    3231.385660: sched_process_exec:   filename=/shared/loadgen/payload/launch_function.out pid=1342 old_pid=1342
	sh-1342    3231.385830: sched_switch:         launch_function:1342  D ==> swapper/4:0

However, even after these **task_rename** and **sched_process_exec** events, following events (*like sched_switch*) still log the comm field as sh, even though the actual process name should have changed. To clarify further, trace-cmd seems to use *ftrace*'s *record-cmd* option which matches to the *comm* I am seeing in the trace-cmd output, but shouldn't that handle *task_rename*s better? I know I can rewrite this during post-processing by replacing the comm with the latest task_rename, but I’m curious why the original trace still shows the outdated comm *6.5.9 kernel*

Panagiotis Stefanis (13 rep)

Apr 4, 2025, 01:03 PM

2 votes

0 answers

25 views

Task name shown as <...> in the output of EBPF program

linux-kernel kernel ebpf ftrace

I wrote a simple EBPF program which prints a message when the `execve` system call is invoked. I print the message using the `bpf_trace_printk` function. In the output, the task name for some processes is shown as ` `. For example, I ran the `sh` command and the output corresponding to that is the f...

I wrote a simple EBPF program which prints a message when the execve system call is invoked. I print the message using the bpf_trace_printk function. In the output, the task name for some processes is shown as `. For example, I ran the sh` command and the output corresponding to that is the following:

b'           -588130   ...21 512047.173196: bpf_trace_printk: Hello World!'

As you can see above, the task name is shown as ` instead of sh`. Why is that happening? How can I get the actual task name to be displayed?

russell.price (53 rep)

Jan 14, 2025, 10:32 PM

1 votes

0 answers

68 views

How to interpert hwlat trace results on an ARM Cortex-A7?

embedded arm real-time latency ftrace

I’m working with an STM32MP157D-DK1, trying to use the hwlat tracer for the board's arm Cortex-A7 CPU, to check for typical hardware latency. The following attempt was made to use the hwlat tracer on the board: ``` # mount -t debugfs debugfs /sys/kernel/debug # mount -t tracefs nodev /sys/kernel/tra...

# mount -t debugfs debugfs /sys/kernel/debug
# mount -t tracefs nodev /sys/kernel/tracing
# cd /sys/kernel/tracing
# cat hwlat_detector/window
1000000
# cat hwlat_detector/width
900000
# cat tracing_thresh
0
# cat tracing_max_latency
0
# cat tracing_cpumask
3
# cat hwlat_detector/mode
none [round-robin] per-cpu
# cat tracing_on
0
# cat current_tracer
nop
# echo hwlat > current_tracer

Assigning # echo hwlat > current_tracer would set tracing_thresh to 10, so:

# echo 0 > tracing_thresh
# cat tracing_thresh
0
# echo 1 > tracing_on
# sleep 1800
# cat trace


# tracer: hwlat
#
# entries-in-buffer/entries-written: 46/46   #P:2
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
           -159      d....   240.855853: #1     inner/outer(us):    0/2     ts:946699046.769930071 count:2
           -159      d....   241.871956: #2     inner/outer(us):    0/1     ts:946699047.786034280 count:2
           -159      d....   242.891945: #3     inner/outer(us):    0/1     ts:946699048.806022489 count:2
           -159      d....   243.911906: #4     inner/outer(us):    0/1     ts:946699049.825984865 count:2
           -159      d....   246.971967: #5     inner/outer(us):    0/1     ts:946699052.886044616 count:2
           -159      d....   252.071924: #6     inner/outer(us):    0/1     ts:946699057.986000660 count:2
           -159      d....   257.171915: #7     inner/outer(us):    0/1     ts:946699063.085992329 count:1
           -159      d....   272.471929: #8     inner/outer(us):    0/1     ts:946699078.386005920 count:2
           -159      d....   273.491923: #9     inner/outer(us):    0/1     ts:946699079.406001504 count:2
           -159      d....   277.571906: #10    inner/outer(us):    0/1     ts:946699083.485984589 count:2
           -159      d....   278.591918: #11    inner/outer(us):    0/1     ts:946699084.505996048 count:2
           -159      d....   318.371905: #12    inner/outer(us):    0/1     ts:946699124.285982817 count:2
           -159      d....   346.931903: #13    inner/outer(us):    0/1     ts:946699152.845980372 count:2
           -159      d....   393.851943: #14    inner/outer(us):    0/1     ts:946699199.766020728 count:2
           -159      d....   420.371906: #15    inner/outer(us):    0/1     ts:946699226.285984865 count:1
           -159      d....   487.691904: #16    inner/outer(us):    0/1     ts:946699293.605983314 count:2
           -159      d....   614.171939: #17    inner/outer(us):    0/1     ts:946699420.086017333 count:2
           -159      d....   646.811933: #18    inner/outer(us):    0/1     ts:946699452.726012015 count:1
           -159      d....   650.891904: #19    inner/outer(us):    0/1     ts:946699456.805984600 count:2
           -159      d....   705.971922: #20    inner/outer(us):    0/1     ts:946699511.886000877 count:2
           -159      d....   708.011907: #21    inner/outer(us):    0/1     ts:946699513.925987669 count:1
           -159      d....   763.091907: #22    inner/outer(us):    0/1     ts:946699569.005986862 count:2
           -159      d....   824.291904: #23    inner/outer(us):    0/1     ts:946699630.205985225 count:2
           -159      d....   893.651899: #24    inner/outer(us):    0/1     ts:946699699.565979716 count:2
           -159      d....   895.691904: #25    inner/outer(us):    0/1     ts:946699701.605985717 count:2
           -159      d....   912.011931: #26    inner/outer(us):    0/1     ts:946699717.926011266 count:2
           -159      d....   913.031903: #27    inner/outer(us):    0/1     ts:946699718.945983725 count:2
           -159      d....   930.371903: #28    inner/outer(us):    0/1     ts:946699736.285984275 count:2
           -159      d....   934.451904: #29    inner/outer(us):    0/1     ts:946699740.365986111 count:2
           -159      d....  1034.411903: #30    inner/outer(us):    0/1     ts:946699840.325984200 count:2
           -159      d....  1049.711894: #31    inner/outer(us):    0/1     ts:946699855.625976207 count:1
           -159      d....  1211.891897: #32    inner/outer(us):    0/1     ts:946700017.805979076 count:2
           -159      d....  1222.091906: #33    inner/outer(us):    0/1     ts:946700028.005987831 count:2
           -159      d....  1336.331903: #34    inner/outer(us):    0/1     ts:946700142.245984969 count:2
           -159      d....  1338.371900: #35    inner/outer(us):    0/1     ts:946700144.285983803 count:2
           -159      d....  1356.731931: #36    inner/outer(us):    0/1     ts:946700162.646014437 count:2
           -159      d....  1477.091900: #37    inner/outer(us):    0/1     ts:946700283.005983161 count:2
           -159      d....  1514.831892: #38    inner/outer(us):    0/1     ts:946700320.745975929 count:2
           -159      d....  1530.131903: #39    inner/outer(us):    0/1     ts:946700336.045985853 count:2
           -159      d....  1533.191895: #40    inner/outer(us):    0/1     ts:946700339.105978979 count:1
           -159      d....  1534.211901: #41    inner/outer(us):    0/1     ts:946700340.125985563 count:2
           -159      d....  1571.951896: #42    inner/outer(us):    0/1     ts:946700377.865979289 count:2
           -159      d....  1639.271892: #43    inner/outer(us):    0/1     ts:946700445.185976197 count:2
           -159      d....  1664.771895: #44    inner/outer(us):    0/1     ts:946700470.685979917 count:2
           -159      d....  1679.051899: #45    inner/outer(us):    0/1     ts:946700484.965983799 count:1
           -159      d....  1768.811900: #46    inner/outer(us):    0/1     ts:946700574.725985300 count:2

After a half an hour sleep, this is what is logged to file trace. I'm trying to understand this result. Trying to take assistance in Steven Rosetedt's walkthrough of the hwlat tracer output . In another 20 minutes test I ran with similar conditions, there also appeared a line where numerous encounters of a 3 millisecond latency occurred, rather than the frequent 0, 1 or 2 microsec. It looks like this (see #16's count):

# tracer: hwlat
#
# entries-in-buffer/entries-written: 20/20   #P:2
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
           -169      d....   328.696323: #1     inner/outer(us):    0/2     ts:946685131.612004155 count:2
           -169      d....   329.711962: #2     inner/outer(us):    0/1     ts:946685132.627644489 count:2
           -169      d....   332.771947: #3     inner/outer(us):    0/1     ts:946685135.687627240 count:2
           -169      d....   333.791924: #4     inner/outer(us):    0/1     ts:946685136.707605282 count:2
           -169      d....   334.811973: #5     inner/outer(us):    0/1     ts:946685137.727653991 count:2
           -169      d....   335.831926: #6     inner/outer(us):    0/1     ts:946685138.747608742 count:2
           -169      d....   336.851921: #7     inner/outer(us):    0/1     ts:946685139.767602242 count:2
           -169      d....   345.011927: #8     inner/outer(us):    0/1     ts:946685147.927608371 count:2
           -169      d....   365.411950: #9     inner/outer(us):    0/1     ts:946685168.327631214 count:2
           -169      d....   455.171927: #10    inner/outer(us):    0/1     ts:946685258.087609424 count:2
           -169      d....   471.491916: #11    inner/outer(us):    0/1     ts:946685274.407599140 count:2
           -169      d....   616.331919: #12    inner/outer(us):    0/1     ts:946685419.247601959 count:2
           -169      d....   805.031908: #13    inner/outer(us):    0/1     ts:946685607.947590632 count:2
           -169      d....  1285.451914: #14    inner/outer(us):    0/1     ts:946686088.367599861 count:2
           -169      d....  1324.211913: #15    inner/outer(us):    0/1     ts:946686127.127599213 count:2
           -169      d....  1540.451925: #16    inner/outer(us):    1/3     ts:946686344.024041733 count:457
           -169      d....  1541.471970: #17    inner/outer(us):    0/2     ts:946686344.387656150 count:2
           -169      d....  1546.571943: #18    inner/outer(us):    0/1     ts:946686349.487628111 count:2
           -169      d....  1547.591934: #19    inner/outer(us):    0/2     ts:946686350.507622153 count:2
           -169      d....  1548.611929: #20    inner/outer(us):    0/1     ts:946686351.527615528 count:2

Since then I have made more half an hour/whole hour tests, their results are similar to those of the first log file shown. Please note that according to the hardware latency detector documentation : > The tracer hwlat_detector is a special purpose tracer that is used to > detect large system latencies induced by the behavior of certain > underlying hardware or firmware, independent of Linux itself. The code > was developed originally to detect SMIs (System Management Interrupts) > on x86 systems, **however there is nothing x86 specific about this > patchset.** I'm assuming that the reasons that x86 processors need SMIs (temperature management, ...) are inevitably present on arm cores too. How should these results be interpreted? \ Hardware latency from the Cortex-A7 doesn't exceed 3 microseconds?\ Is there something wrong with what I’m doing?

user656857 (11 rep)

Oct 12, 2024, 11:58 AM • Last activity: Oct 17, 2024, 11:24 AM

4 votes

1 answers

400 views

When (and how) are file permission/access decisions made in kernel space?

linux permissions kernel strace ftrace

(I think) I have a relatively good understanding of how Linux permissions work - traditional Unix DAC permissions, and how they're represented, security context in relation to processes (i.e. cred struct, capabilities, etc.), hooks for LSMs, etc., and the link they have to (E)UID/E(GID), etc. What I...

(via

-cmd

), is when these checks are actually done (presumably) in kernel space, and at *which point* the access decision is actually made. To elaborate: I have a root-owned directory

/dir

chmod-ed to

-x------

(*500*). As an unprivileged user, I run

/bin/ls /dir

and receive a permission denied error, as expected. If I run

-s 10000 /bin/ls /dir

I can see the

syscall, which returns a

-1

back to user space, and sets

(i.e. "permission denied"). There's then a subsequent

syscall to print the error message to STDERR, all as expected. At no point here do I ever see

syscalls being called. If I run the same test using trace-cmd, as

-cmd record -p function_graph -F /bin/ls

with the same unprivileged user doing

/bin/ls /dir

in a separate terminal tab (where

-cmd

will trigger on the execution of the

binary), I can basically map most of the syscalls I'm seeing, one-to-one, between strace and trace-cmd outputs. However, the discrepancy is that, in the trace-cmd output after the

syscall, rather than seeing the write outputs and it exiting, I can see

calls being made (and the full call tree of iterating over items in the directory, etc.). Because of limitations with ftrace, I guess (and more likely limitations in my understanding), I'm not seeing syscall arguments or returns in the output, so I can't see when, if at all, I'm getting permission errors, but from my limited understanding, the kernel appears to be actually doing the full directory listing, but just not returning the output back to user space. So can anyone explain what's actually happening - why

-cmd

shows the kernel actually doing the

syscalls, while

exits after the open call fails? Most of the forum posts (on here and elsewhere) with questions pertaining to permissions suggest that checks are done on opening the file, but provide little more detail than that, though this aligns with what I can see with

(but not

-cmd

). Looking at the kernel source, and following that through manually from the

syscalls, I can see functions call sequences such as

-> do_inode_permission -> generic_permission -> acl_permission_check

with the latter actually doing comparisons on the classic Unix permissions, and returning from this. I can also see the hooks elsewhere for the LSMs, like SELinux, Apparmor, SMACK, Tomoyo, etc., which I guess will do their bits. My assumption would be that the directory is attempted to be opened, permission checks would happen, kernel says no, and returns the error back to user space, and it would never get to the point of actually doing the directory listing. However, based on the

-cmd

output appearing to actually show the directory listing happening, I'm not sure any more. Any information - in as much technical depth as possible - would be greatly appreciated. Also, I understand that using Systemtap may be able to offer more information to me, but my knowledge of that is *very* limited! For awareness, my testing has been on Kernel versions 2.6.35 and 6.5.0 on Debian-based distros, with comparatively similar results.

genericuser99 (119 rep)

Mar 26, 2024, 12:39 PM • Last activity: Mar 26, 2024, 06:23 PM

2 votes

0 answers

180 views

What's the difference between ftrace and blktrace tracepoints?

linux-kernel c ftrace

For instance, we have `/sys/kernel/debug/tracing/events/block/block_bio_complete` and `block_rq_complete` which is basically IO completion trace points for `ftrace`, I believe. Using `sudo blktrace -d /dev/sdb1 -o - | blkparse -i -` we might see a line like this: 8,33 1 3 0.001086286 0 C RA 11581378...

                                  For instance, we have /sys/kernel/debug/tracing/events/block/block_bio_complete and block_rq_complete which is basically IO completion trace points for ftrace, I believe. Using sudo blktrace -d /dev/sdb1 -o - | blkparse -i - we might see a line like this:

	  8,33   1        3     0.001086286     0  C  RA 1158137880 + 8 
                                               ^ IO completion action

The C here is the IO completion event according to the documentation but is this the same as block_bio_complete or is it somehow different. Checking the source code, I can see at https://elixir.bootlin.com/linux/v5.6/source/block/blk-core.c#L1441 :

    trace_block_rq_complete(req, blk_status_to_errno(error), nr_bytes);

And at https://elixir.bootlin.com/linux/v5.6/source/block/bio.c#L1863 :

    trace_block_bio_complete(bio->bi_disk->queue, bio, blk_status_to_errno(bio->bi_status));

Which of the two is the C action? Or does blktrace use a separate set of actions - searches tell me ftrace and blktrace use different reporting mechanisms, but I cannot work out how each tracepoint for both mechanisms relate to kernel source code. Is there an easy way to map tracepoints to source code?

Ken Y-N (251 rep)

Mar 31, 2023, 01:20 AM

5 votes

2 answers

852 views

Change format of syscall event trace output to ftrace

linux debugging system-calls strace ftrace

I enabled ftrace event tracing for sys_enter_openat syscall. The respective output format given at events/syscalls/sys_enter_openat/format is print fmt: "dfd: 0x%08lx, filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(R...

                                  I enabled ftrace event tracing for sys_enter_openat syscall. The respective output format given at events/syscalls/sys_enter_openat/format is

    print fmt: "dfd: 0x%08lx, filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(REC->flags)), ((unsigned long)(REC->mode))

As expected a sample output line to ftrace is something like

    msm_irqbalance-1338   ...1 211710.033931: sys_openat(dfd: ffffff9c, filename: 5af693f224, flags: 2, mode: 0)

Is there a way to change output format such that filename: 5af693f224 can be shown as filename:  instead of hex(5af693f224)?

So basically is there a way to change output format while tracing a particular event(eg. sys_enter_openat above) to ftrace.

I guess this would have been possible using systemtap or krpobe but my setup does not allow its use as of now.

Madhur Rawat (151 rep)

Apr 15, 2019, 06:41 AM • Last activity: Dec 16, 2021, 08:29 PM

1 votes

0 answers

41 views

How to get the cpu number of a trace event in a custom probe function?

linux kernel linux-kernel ftrace

I am registering a probe function for trace events using a Kernel module. For eg the probe function for the trace event sched_kthread_stop would be: void handle_kthread_stop(void *data, struct task_struct *t) And register this using tracepoint_probe_register() Now an actual trace in the trace buffer...

                                  I am registering a probe function for trace events using a Kernel module.
For eg the probe function for the trace event sched_kthread_stop would be:

    void handle_kthread_stop(void *data, struct task_struct *t)

And register this using 

    tracepoint_probe_register()

Now an actual trace in the trace buffer looks like this:

    bash-9002   d..5  2145.303974: sched_wakeup: comm=kworker/u8:1 pid=46 prio=120 target_cpu=003

Here  is the CPU ID.
I want to get this CPU ID info in the probe function.

Can someone tell me how to get the CPU ID in the probe function?

Madhuparna Bhowmik (163 rep)

Apr 19, 2020, 07:02 PM • Last activity: Apr 21, 2020, 06:16 AM

2 votes

1 answers

193 views

Multiple Tracing Events Readers

linux fifo tracing ftrace rasdaemon

**Question** Does the tracing pipes in Linux, for tracing events, accepts simultaneous readers? **Context** I am trying to get traces from the ```RAS``` subsystem while running some benchmarks. This subsystem reports hardware error events using trace events. The tool I am using to get the events in...

**Question** Does the tracing pipes in Linux, for tracing events, accepts simultaneous readers? **Context** I am trying to get traces from the

subsystem while running some benchmarks. This subsystem reports hardware error events using trace events. The tool I am using to get the events in the **RASdaemon** tool. I know the tool subscribe to the

FIFO on each cpu. Furthermore, the benchmark subscribe to performance events which, **I believe that**, are also advertised through the same pipes. The

works like the

which has the following description: > The output is the same as the “trace” file but this file is meant to be streamed with live tracing. Reads from this file will block until new data is retrieved. Unlike the “trace” file, this file is a consumer. This means reading from this file causes sequential reads to display more current data. Once data is read from this file, it is consumed, and will not be read again with a sequential read. The “trace” file is static, and if the tracer is not adding more data, it will display the same information every time it is read. Unlike the “trace” file, opening this file for reading will not temporarily disable tracing. I understand the concept of FIFO being a consumer file and why **each read will produce different data**. However, do the Kernel make any arrangements for multiple subscribers (reader) to these events?

Alexandre Santana (21 rep)

Jan 27, 2020, 02:52 PM • Last activity: Jan 31, 2020, 11:34 AM

2 votes

1 answers

240 views

How can I send dmesg printout to ftrace subsystem?

linux kernel dmesg ftrace

How can I send `dmesg` printout to `ftrace` subsystem? I like correlate the `dmesg` msg with the functions call `graph` in `ftrace`. Thanks

                                  How can I send dmesg printout to ftrace subsystem?

I like correlate the dmesg msg with the functions call graph in ftrace. 

Thanks

tony-p-lee (141 rep)

Aug 7, 2019, 01:00 AM • Last activity: Aug 21, 2019, 11:45 PM

3 votes

0 answers

332 views

Linux task state different in /proc/pid/stat compared to Ftrace output of sched events

linux proc scheduling multithreading ftrace

I'm trying to trace the flow of a piece of code just to understand the points where the tasks go into D/S/R states (uninterruptible/interruptible/running). I do understand the meaning of each state, however doing an exercise shows confusing results and I wasn't able till now to figure out why. The p...

                                  I'm trying to trace the flow of a piece of code just to understand the points where the tasks go into D/S/R states (uninterruptible/interruptible/running). I do understand the meaning of each state, however doing an exercise shows confusing results and I wasn't able till now to figure out why.

The piece of code I'm running:


    void foo() {

        pthread_setname_np(pthread_self(), "lock_func/1");

        while(quit == false)
        {
            for(int64_t idx=0; idx lock(mutex1);
        while(!flag)
            cv1.wait(lock);
        flag = false;
    }
  
    int main(int argc, char *argv[]) {

        ...
        std::thread t1 = std::thread(foo);
        std::thread t2 = std::thread(bar);
        ...
        if (mlockall( MCL_CURRENT | MCL_FUTURE ) -0      d...       335998: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       335999: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336000: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336001: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336002: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336003: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336004: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336005: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336006: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336007: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336008: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336009: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
          -0      d...       336010: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120
     lock_func/1-30296  d...       336011: sched_switch: prev_comm=lock_func/1 prev_pid=30296 prev_prio=120 prev_state=D ==> next_comm=swapper/0 next_pid=0 next_prio=120
          -0      d...       336012: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=lock_func/1 next_pid=30296 next_prio=120


Observe how prev_state for my lock_func1 (bar) is always **D** (uninterruptible). Looking at the code a context switch can happen only on the condition variable sleep, or when time slice is finished (running in CFS scheduler)

    cat /proc/30295/task/30296/wchan
    futex_wait_queue_me

which according to the docs is a interruptible function.
Checking multiple times the task state in proc shows always **S**:

    cat /proc/30295/task/30296/stat
    30295 (lock_func/1) S ...

So the stat shows S-interruptible state, the **futex_wait_queue_me** is interruptible, however, the ftrace show always at the context switch time **prev_state=D**. Running this code and recording several seconds with ftrace shows only **prev_state=D** for my lock_func/1 (bar). I never get the S-interruptible.

Can someone explain why? Is the scheduler doing some state change in between?

Thanks. 
                                

mdaniel (151 rep)

Aug 7, 2019, 08:27 AM

1 votes

1 answers

387 views

How writing in debugfs controls ftrace?

linux-kernel ftrace

Ftrace is controlled by manipulating the debug fs files. But how is it possible ? How kernel be aware of writing in this files and start its required actions?for exmple if 1 is written to tracing_on , kernel turns on tracing . Is it something related to a special file structure or kernel subscribes...

                                  Ftrace is controlled by manipulating the debug fs files. But how is it possible ? How kernel be aware of writing in this files and start its required actions?for exmple if 1 is written to tracing_on , kernel turns on tracing .
Is it something related to a special file structure or kernel subscribes to changes of debugfs or I'm missing something obvious ? 
                                

fjahan (11 rep)

May 26, 2019, 01:51 PM • Last activity: May 26, 2019, 06:15 PM

0 votes

1 answers

242 views

functions folder in trace_stat folder of ftrace

linux linux-kernel debugging profiling ftrace

I am learning ftrace, I read from the following link that I can enable function profiling using the 'function_profile_enabled' file. mount -t tracefs nodev /sys/kernel/tracing cd /sys/kernel/tracing echo 1 > function_profile_enabled echo "function" > current_tracer https://www.kernel.org/doc/html/v4...

                                  I  am learning ftrace, I read from the following link that I can enable function profiling using the 'function_profile_enabled' file.

    mount -t tracefs nodev /sys/kernel/tracing
    cd /sys/kernel/tracing
    echo 1 > function_profile_enabled
    echo "function" > current_tracer

https://www.kernel.org/doc/html/v4.18/trace/ftrace.html 

The above link tells that the information is present in trace_stats/function (function0, function1..)

    $ ls trace_stat/function*
    trace_stat/function0   trace_stat/function15  trace_stat/function21  trace_stat/function28  trace_stat/function6
    trace_stat/function1   trace_stat/function16  trace_stat/function22  trace_stat/function29  trace_stat/function7
    trace_stat/function10  trace_stat/function17  trace_stat/function23  trace_stat/function3   trace_stat/function8
    trace_stat/function11  trace_stat/function18  trace_stat/function24  trace_stat/function30  trace_stat/function9
    trace_stat/function12  trace_stat/function19  trace_stat/function25  trace_stat/function31
    trace_stat/function13  trace_stat/function2   trace_stat/function26  trace_stat/function4
    trace_stat/function14  trace_stat/function20  trace_stat/function27  trace_stat/function5
    root tracing

I only have 4 cores on my Ubuntu VM

    $ grep -c ^processor /proc/cpuinfo     
    4

Can anyone explain me the actual significance of this file function in trace_stat folder
functions folder in trace_stat folder
                                

md.jamal (301 rep)

Jan 26, 2019, 09:08 AM • Last activity: May 23, 2019, 05:43 PM

0 votes

1 answers

172 views

Performance reduction activating trace-functionality in kernel

linux-kernel performance trace ftrace

Is there any performance reduction to be expected from just activating trace (ftrace) functionality in the Linux kernel, but not using it at run time?

                                  Is there any performance reduction to be expected from just activating trace (ftrace) functionality in the Linux kernel, but not using it at run time?
                                

arash javan (381 rep)

Mar 1, 2019, 08:35 AM • Last activity: May 23, 2019, 05:40 PM

1 votes

0 answers

930 views

how to use ftrace to function_trace a kernel modules interrupt?

kernel kernel-modules interrupt debugfs ftrace

I have a Banana Pi Arm device with 2 cores (1.2 GHz) running a Debian kernel which I enabled ftrace kernel debug options. I want to determine the amount of time spent in a kernel module function. The function is of the form static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, stru...

                                  I have a Banana Pi Arm device with 2 cores (1.2 GHz) running a Debian kernel which I enabled ftrace kernel debug options. I want to determine the amount of time spent in a kernel module function. The function is of the form 

    static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, struct pt_regs *regs);

In my debug file system after I load the module I see the name exposed as a filter. I am tracing all functions. I tried to enable the trace_printk statements but I can not see it printed in my trace file. Here is the implementation of the function:

    /** @brief handler for rising signal */
    static irq_handler_t vtgpio_irq_handler(unsigned int irq, void *dev_id, struct pt_regs *regs) {
      trace_printk(KERN_INFO "rise\n");
      pause();
      return (irq_handler_t) IRQ_HANDLED; // return that we all good                                                                                                                               
    }

I know that the function is being called because I see the printk statement in dmesg | tail if I replace trace_printk with printk. Also I know that some of the modules functions are traceable like writing to sysfs store functions. For example I can see 

    static struct kobj_attribute mode_attr = __ATTR(mode, 0660, mode_show, mode_store);

The ftrace from writing to mode_store.

    1)               |  sys_write() {
    1)   0.708 us    |    fget_light();
    1)               |    vfs_write() {
    1)   0.875 us    |      rw_verify_area();
    1)               |      sysfs_write_file() {
    1)   0.459 us    |        mutex_lock();
    1)               |        get_zeroed_page() {
    1)               |          __get_free_pages() {
    1)               |            __alloc_pages_nodemask() {
    1)   0.708 us    |              next_zones_zonelist();
    1)               |              get_page_from_freelist() {
    1)   0.458 us    |                next_zones_zonelist();
    1)   1.000 us    |                __zone_watermark_ok();
    1)               |                kmap_atomic() {
    1)   0.625 us    |                  add_preempt_count();
    1)   0.625 us    |                  page_address();
    1)   9.250 us    |                }
    1)               |                __kunmap_atomic() {
    1)   0.750 us    |                  sub_preempt_count();
    1)   5.459 us    |                }
    1) + 41.000 us   |              }
    1) + 53.209 us   |            }
    1)   0.542 us    |            page_address();
    1) + 62.541 us   |          }
    1) + 66.958 us   |        }
    1)   0.917 us    |        sysfs_get_active();
    1)               |        mode_store() {
    1)               |          /* VT-GPIO_TEST: resume */
    1)               |          gpio_direction_output() {
    1)               |            _raw_spin_lock_irqsave() {
    1)               |              __raw_spin_lock_irqsave() {
    1)   0.834 us    |                add_preempt_count();
    1)   5.375 us    |              }
    1)   9.458 us    |            }
    1)   0.834 us    |            gpio_ensure_requested();
    1)               |            _raw_spin_unlock_irqrestore() {
    1)   0.708 us    |              sub_preempt_count();
    1)   5.125 us    |            }
    1)               |            sunxi_gpio_direction_out() {
    1)   1.125 us    |              gpio_set_one_pin_io_status();
    1)   1.167 us    |              gpio_write_one_pin_value();
    1) + 12.333 us   |            }
    1) + 47.250 us   |          }
    1)               |          gpio_direction_input() {
    1)               |            _raw_spin_lock_irqsave() {
    1)               |              __raw_spin_lock_irqsave() {
    1)   0.625 us    |                add_preempt_count();
    1)   5.167 us    |              }
    1)   9.667 us    |            }
    1)   0.583 us    |            gpio_ensure_requested();
    1)               |            _raw_spin_unlock_irqrestore() {
    1)   0.708 us    |              sub_preempt_count();
    1)   5.166 us    |            }
    1)               |            sunxi_gpio_direction_in() {
    1)   0.667 us    |              gpio_set_one_pin_io_status();
    1)   5.208 us    |            }
    1) + 38.833 us   |          }
    1) ! 105.833 us  |        }
    1)   0.875 us    |        sysfs_put_active();
    1)   0.875 us    |        mutex_unlock();
    1) ! 204.500 us  |      }
    1)   0.416 us    |      __fsnotify_parent();
    1)               |      fsnotify() {
    1)               |        __srcu_read_lock() {
    1)   0.583 us    |          add_preempt_count();
    1)   0.667 us    |          sub_preempt_count();
    1)   9.542 us    |        }
    1)               |        __srcu_read_unlock() {
    1)   0.584 us    |          add_preempt_count();
    1)   0.667 us    |          sub_preempt_count();
    1)   9.041 us    |        }
    1) + 30.125 us   |      }
    1) ! 252.708 us  |    }
    1) ! 262.292 us  |  }
   
The mode_store in this trace actually writes HIGH or LOW to a pin that the IRQ handler is registered to.  Any tips will be very helpful, I am trying to measure the overhead of the IRQ handler.
                                

channon (193 rep)

Jul 24, 2018, 08:28 PM • Last activity: Jul 27, 2018, 05:28 PM

2 votes

0 answers

643 views

Tool for displaying timeline view of kernel/userspace execution trace

linux kernel tracing ftrace

I've used the ftrace tool **trace-cmd** for recording scheduling tracepoints and syscall tracepoints. I can inspect them visually using the [kernelshark][1] GUI program. It dislays per-core timelines and per-task timelines as seen in the screenshot below. [![enter image description here][2]][2] I ca...

                                  I've used the ftrace tool **trace-cmd** for recording scheduling tracepoints and syscall tracepoints. I can inspect them visually using the kernelshark  GUI program. It dislays per-core timelines and per-task timelines as seen in the screenshot below.

I can see when interrupts are triggered, when the scheduler is invoked and when system calls are being made and when they return. But when a system call is being made, kernelshark makes no distinction between the task running in userspace and it running in kernelspace.

I want a tool that allows me to see for a specific core when it is executing an interrupt handler; the currently traced process in userspace; or some task in userspace. Something similar to this picture:

Does anyone know of a tool for transforming ftrace data into that sort of format, be it as an svg file, a chrome trace-file or using any other format or tool?

Daniel Näslund (1932 rep)

Feb 13, 2018, 04:22 PM

1 votes

0 answers

128 views

PMU and tracepoints

performance monitoring cpu ftrace

I am searching for information about the mechanism of creation of tracepoints. From this topic [https://unix.stackexchange.com/questions/326621/what-are-kernel-pmu-event-s-in-perf-events-list][1] and this blog [brendangregg][2] I've got that tracepoints are some macros inside the kernel code which a...

                                  I am searching for information about the mechanism of creation of tracepoints. From this topic https://unix.stackexchange.com/questions/326621/what-are-kernel-pmu-event-s-in-perf-events-list  and this blog brendangregg  I've got that tracepoints are some macros inside the kernel code which are called when they are enabled, but I'm a little bit confused how tracepoint connected with PMU (because it mentioned in both links)? 

                                

Igor Masternoy (111 rep)

Jan 20, 2018, 03:42 PM

1 votes

0 answers

386 views

Function profiling on multi-core platform for specific PID using ftrace

linux-kernel strace trace ftrace

I'm trying to use ftrace for function profiling based on function_profile_enabled as described [here][1]. I would like to capture function calls only for a certain PID using set_ftrace_pid as described in the same link. However, when I set PID and do function_profile_enabled it still captures functi...

                                  I'm trying to use ftrace for function profiling based on function_profile_enabled as described here . I would like to capture function calls only for a certain PID using set_ftrace_pid as described in the same link. However, when I set PID and do function_profile_enabled it still captures function calls on all CPUs. So, considering context-switching it is really hard to filter out unwanted CPUs from /sys/kernel/debug/tracing/trace_stat/function*. The function_graph feature works with set_ftrace_pid just fine. Brandan Gregg had the same problem as found here , but I'm using 4.4 kernel, which is generations after.

                                

Dimon (131 rep)

Oct 23, 2017, 11:32 PM

Showing page 1 of 20 total questions