Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

2 votes

1 answers

3670 views

linux bpftool not found in linux-tools-common for kernel 4.19.232

I am trying to check if `bpf` is properly installed in my linux kernel. It's enabled in the kernel as shown: ``` jakew@desktop:~$ cat config | grep BPF CONFIG_CGROUP_BPF=y CONFIG_BPF=y CONFIG_BPF_SYSCALL=y CONFIG_BPF_JIT_ALWAYS_ON=y # CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set CONFIG_IPV6_SEG6_BPF=y C...

I am trying to check if bpf is properly installed in my linux kernel. It's enabled in the kernel as shown:

jakew@desktop:~$ cat config | grep BPF
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT_ALWAYS_ON=y
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
CONFIG_IPV6_SEG6_BPF=y
CONFIG_NETFILTER_XT_MATCH_BPF=y
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=m
CONFIG_NET_CLS_BPF=y
CONFIG_NET_ACT_BPF=y
CONFIG_BPF_JIT=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
# CONFIG_TEST_BPF is not set

I tried to install bpftool to check up on bpf, and it seems like i need to install it via linux_tools_common with: sudo apt install linux_tools_common. When I tried to run bpftool, it showed this error:

WARNING: bpftool not found for kernel 4.19.232

  You may need to install the following packages for this specific kernel:
    linux-tools-4.19.232-4.19.232
    linux-cloud-tools-4.19.232-4.19.232

  You may also want to install one of the following packages to keep up to date:
    linux-tools-4.19.232
    linux-cloud-tools-4.19.232

however, sudo apt install linux-tools-4.19.232 showed an error saying package not found. Not sure how I can get bpftool to work so that I can explore more into bpf? Thanks

jake wong (141 rep)

Dec 4, 2022, 07:37 PM • Last activity: May 31, 2025, 08:02 PM

2 votes

1 answers

50 views

How are network stack modifications tested?

networking linux-kernel virtual-machine ebpf

I have the task to develop a modification (using eBPF) of the TCP stack of the Linux kernel, and I need to test its interoperability with non-modified kernels. Specifically, the eBPF program should be able to inspect some TCP header information, and possible retransmit the packet to a different IP....

                                  I have the task to develop a modification (using eBPF) of the TCP stack of the Linux kernel, and I need to test its interoperability with non-modified kernels. Specifically, the eBPF program should be able to inspect some TCP header information, and possible retransmit the packet to a different IP. How are such modifications usually tested?

So far I've considered:

1. Using one VM for the modified version, and one VM for the unmodified, and then using mininet to connect them and create various testing scenarios.
2. Using containernet (instead of mininet) to load containers with the eBPF program, but attempt to create a runtime flag in the eBPF program so that it will only act in one of the containers.
3. Use separate hardware, and create a physical testbed (which I find undesirable).

How do those of you that work on such projects usually test your changes?
                                

Joppiedoppie (23 rep)

May 30, 2025, 01:27 PM • Last activity: May 30, 2025, 02:00 PM

1 votes

0 answers

26 views

What does the phrase "consider native interface" refer to when the nftables wiki says that xt_bpf match is unsupported

iptables nftables ebpf

In [this](https://wiki.nftables.org/wiki-nftables/index.php/Supported_features_compared_to_xtables) list of unsupported xtables features. xt_bpf is listed as one of the unsupported features. The comment says to "consider native interface". But what interface is being referred to here? This is the on...

                                  In [this](https://wiki.nftables.org/wiki-nftables/index.php/Supported_features_compared_to_xtables)  list of unsupported xtables features. xt_bpf is listed as one of the unsupported features. The comment says to "consider native interface". But what interface is being referred to here? This is the only ever mention of bpf in the entire nftables wiki.
                                

Philippe (569 rep)

Mar 21, 2025, 01:19 PM

2 votes

0 answers

25 views

Task name shown as <...> in the output of EBPF program

linux-kernel kernel ebpf ftrace

I wrote a simple EBPF program which prints a message when the `execve` system call is invoked. I print the message using the `bpf_trace_printk` function. In the output, the task name for some processes is shown as ` `. For example, I ran the `sh` command and the output corresponding to that is the f...

I wrote a simple EBPF program which prints a message when the execve system call is invoked. I print the message using the bpf_trace_printk function. In the output, the task name for some processes is shown as `. For example, I ran the sh` command and the output corresponding to that is the following:

b'           -588130   ...21 512047.173196: bpf_trace_printk: Hello World!'

As you can see above, the task name is shown as ` instead of sh`. Why is that happening? How can I get the actual task name to be displayed?

russell.price (53 rep)

Jan 14, 2025, 10:32 PM

0 votes

1 answers

51 views

eBPF `bpf_core_read` returns incorrect value

linux-kernel ebpf

As @andy-dalton suggests. I changed type of `err` and initialized it. But it still outputs the same results. The modified code: ```c SEC("sockops") int bpf_sockops_cb(struct bpf_sock_ops *skops) { u32 op = 0; long err; err = bpf_core_read(&op, sizeof(op), &skops->op); if (err) { bpf_printk("err code...

As @andy-dalton suggests. I changed type of err and initialized it. But it still outputs the same results. The modified code:

SEC("sockops")
int bpf_sockops_cb(struct bpf_sock_ops *skops) {
    u32 op = 0;

    long err;

    err = bpf_core_read(&op, sizeof(op), &skops->op);
    if (err) {
        bpf_printk("err code %ld\n", err);
    }

    bpf_printk("op1 = %u, op2 = %u \n", op, skops->op);

    return 0;
}

The outputs:

curl-286392   ....1 44631.729219: bpf_trace_printk: op1 = 3245625024, op2 = 3 

            curl-286392   ....1 44631.729223: bpf_trace_printk: op1 = 3245625024, op2 = 2 

            curl-286392   ....1 44631.729223: bpf_trace_printk: op1 = 3245625024, op2 = 1 

            curl-286392   ....1 44631.729224: bpf_trace_printk: op1 = 3245625024, op2 = 6

---------- My eBPF program as follows.

#include "vmlinux.h"
#include 
#include 
#include 
#include 
#include "common.h"

char __license[] SEC("license") = "GPL";

SEC("sockops")
int bpf_sockops_cb(struct bpf_sock_ops *skops) {
	u32 op;

    int err;

    err = bpf_core_read(&op, sizeof(op), &skops->op);
    if (err) {
        bpf_printk("err\n");
    }

    bpf_printk("op1 = %u, op2 = %u \n", op, skops->op);

	return 0;
}

When I check the tracing output by cat /sys/kernel/debug/tracing/trace_pipe, it shows that op1 != op2.

curl-1466326  ....1 241519.609044: bpf_trace_printk: op1 = 3913290112, op2 = 3 

            curl-1466326  ....1 241519.609048: bpf_trace_printk: op1 = 3913290112, op2 = 2 

            curl-1466326  ....1 241519.609048: bpf_trace_printk: op1 = 3913290112, op2 = 1 

            curl-1466326  ....1 241519.609049: bpf_trace_printk: op1 = 3913290112, op2 = 6

Why does op1 and op2 become different ?

maplgebra (121 rep)

Dec 9, 2024, 06:13 AM • Last activity: Dec 10, 2024, 03:22 AM

0 votes

0 answers

44 views

How does bpftrace implement its printf function？

ebpf

the bpftrace language supports the function `printf` which can write something to the terminal, but as far as I know ebpf running in kernel mode cannot call arbitrary kernel functions, so how is that implemented？ My rough guess is that bpftrace compile its `printf` function to `bpf_trace_printk`(whi...

                                  the bpftrace language supports the function printf which can write something to the terminal, but as far as I know ebpf running in kernel mode cannot call arbitrary kernel functions, so how is that implemented？

My rough guess is that bpftrace compile its printf function to bpf_trace_printk(which is a bpf helper being able to write formated text into TraceFS) and then the tracer process reads from /sys/kernel/tracing/trace_pipe to duplicate those text to stdout of the tracer process. But that routine would just be too slow.

炸鱼薯条德里克 (1435 rep)

Dec 7, 2024, 01:33 AM • Last activity: Dec 7, 2024, 03:42 AM

0 votes

0 answers

29 views

BPF program attached to `getname` won't get called when calling the `renameat2` syscall

linux-kernel system-calls ebpf

I'm fiddling with a BPF program that needs to attach to the two "getname" functions that are being called from the `renameat2` syscall, defined in [linux/fs/namei.c][1] as: ```c SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname, int, newdfd, const char __user *, newname, unsigned...

I'm fiddling with a BPF program that needs to attach to the two "getname" functions that are being called from the renameat2 syscall, defined in linux/fs/namei.c as:

SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
		int, newdfd, const char __user *, newname, unsigned int, flags)
{
	return do_renameat2(olddfd, getname(oldname), newdfd, getname(newname),
				flags);
}

getname calls getname_flags, which in turn calls strncpy_from_user. I need to access the char __user * name parameter, thus I tried creating kprobes, fentries and fexits (with a simple "print" program) to try and intercept all three of those functions. With getname*, I get a lot of output meaning that my BPF program are actually being runned. Although, when calling "renameat2" (e.g. when using the linux mv command), I get no output at all. This is, in essence, the program I'm currently using, which doesn't get called when using the mv command:

SEC("fentry/getname_flags")
int BPF_PROG(hijack_getname, char *filename) {
  uid_t uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
  if (uid == 1002) { //hardcoded uid
    bpf_printk(" [%s]", filename);
  }
}

If I create a BPF tracepoint program that attaches to the entry and exit of renameat2, I can clearly see that there's no "getname" call between entry and exit. As I said, I also tried with kprobe and fexit. I can't manage to attach to strncpy_from_user without getting some weird errors about "Os: 22 - invalid argument" I really can't figure out what's happening, thus any help would be appreciated :,) (P.S. I also posted this on stackoverflow)

Dennis Orlando (81 rep)

Dec 4, 2024, 05:18 PM

0 votes

0 answers

39 views

When `BPF(eBPF)` traces the call stack, all user-mode functions are `[unknown]`. Why is this?

linux ebpf

Experimental environment ```bash ┌──[root@vms99.liruilongs.github.io]-[/usr/share/bcc/tools] └─$hostnamectl Static hostname: vms99.liruilongs.github.io Icon name: computer-vm Chassis: vm Machine ID: ea70bf6266cb413c84266d4153276342 Boot ID: 0d01838b0095494c82d1befb174a317d Virtualization: vmware Ope...

Experimental environment

┌──[root@vms99.liruilongs.github.io]-[/usr/share/bcc/tools]
└─$hostnamectl
   Static hostname: vms99.liruilongs.github.io
         Icon name: computer-vm
           Chassis: vm
        Machine ID: ea70bf6266cb413c84266d4153276342
           Boot ID: 0d01838b0095494c82d1befb174a317d
    Virtualization: vmware
  Operating System: Rocky Linux 8.9 (Green Obsidian)
       CPE OS Name: cpe:/o:rocky:rocky:8:GA
            Kernel: Linux 4.18.0-513.9.1.el8_9.x86_64
      Architecture: x86-64
┌──[root@vms99.liruilongs.github.io]-[/usr/share/bcc/tools]
└─$

When using BPF/eBPF to trace the call stack, I found that all user-mode functions are [unknown]

┌──[root@vms99.liruilongs.github.io]-[/usr/share/bcc/tools]
└─$profile
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
^C
    _raw_spin_unlock_irqrestore
    _raw_spin_unlock_irqrestore
    prepare_to_swait_event
    rcu_gp_kthread
    kthread
    ret_from_fork
    -                rcu_sched (14)
        1

    kmem_cache_alloc_node
    kmem_cache_alloc_node
    __alloc_skb
    __ip_append_data.isra.50
    ip_append_data.part.51
    ip_send_unicast_reply
    tcp_v4_send_reset
    tcp_v4_rcv
    ip_protocol_deliver_rcu
    ip_local_deliver_finish
    ip_local_deliver
    ip_rcv
    __netif_receive_skb_core
    process_backlog
    __napi_poll
    net_rx_action
    __softirqentry_text_start
    do_softirq_own_stack
    do_softirq.part.16
    __local_bh_enable_ip
    ip_finish_output2
    ip_output
    __ip_queue_xmit
    __tcp_transmit_skb
    tcp_connect
    tcp_v4_connect
    __inet_stream_connect
    inet_stream_connect
    __sys_connect
    __x64_sys_connect
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    -                haproxy (1203)
        1

    show_vma_header_prefix
    show_vma_header_prefix
    show_map_vma
    show_map
    seq_read
    vfs_read
    ksys_read
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    [unknown]
    -                awk (39726)
        1
.............        
┌──[root@vms99.liruilongs.github.io]-[/usr/share/bcc/tools]
└─$

What is the reason for this? Is it because the program lacks debugging information? Or some other reason? I used Python to write a lock demo, and this also happened

┌──[root@vms99.liruilongs.github.io]-[~]
└─$cat lock_demo.py
import threading
import time

lock = threading.Lock()

def worker(id):
    print(f"Worker {id} started")
    with lock:
        print(f"Worker {id} acquired lock")
        time.sleep(2)  # 模拟长时间的计算或 I/O
    print(f"Worker {id} released lock")

threads = []
for i in range(5):
    t = threading.Thread(target=worker, args=(i,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print("All workers finished")

threadsnoop

┌──[root@vms99.liruilongs.github.io]-[~]
└─$threadsnoop
TIME(ms)   PID     COMM             FUNC
0          51671   b'python3'       b'[unknown]'
0          51671   b'python3'       b'[unknown]'
0          51671   b'python3'       b'[unknown]'
0          51671   b'python3'       b'[unknown]'
0          51671   b'python3'       b'[unknown]'

offcputime

┌──[root@vms99.liruilongs.github.io]-[~/FlameGraph]
└─$offcputime  -p pgrep -f lock_demo.py
Tracing off-CPU time (us) of PID 51397 by user + kernel stack... Hit Ctrl-C to end.
^C
    .......................
    finish_task_switch
    __sched_text_start
    schedule
    futex_wait_queue_me
    futex_wait
    do_futex
    __x64_sys_futex
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    -                python3 (51402)
        157

    finish_task_switch
    __sched_text_start
    schedule
    futex_wait_queue_me
    futex_wait
    do_futex
    __x64_sys_futex
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    -                python3 (51400)
        213

    finish_task_switch
    __sched_text_start
    schedule
    futex_wait_queue_me
    futex_wait
    do_futex
    __x64_sys_futex
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    -                python3 (51397)
        267

    finish_task_switch
    __sched_text_start
    schedule
    do_nanosleep
    hrtimer_nanosleep
    common_nsleep_timens
    __x64_sys_clock_nanosleep
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    [unknown]
    -                python3 (51400)
        2002609

    finish_task_switch
    __sched_text_start
    schedule
    do_nanosleep
    hrtimer_nanosleep
    common_nsleep_timens
    __x64_sys_clock_nanosleep
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    [unknown]
    [unknown]
    -                python3 (51402)
        2003178
..................
┌──[root@vms99.liruilongs.github.io]-[~/FlameGraph]
└─$

Any help will be greatly appreciated, best wishes

山河以无恙 (185 rep)

Oct 21, 2024, 01:58 AM • Last activity: Oct 22, 2024, 01:40 AM

4 votes

1 answers

839 views

How to get the current cgroup ID from C/C++?

system-calls cgroups ebpf

The [eBPF helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) define `bpf_get_current_cgroup_id` for eBPF programs, which does the obvious thing ``` u64 bpf_get_current_cgroup_id(void) Return A 64-bit integer containing the current cgroup id based on the cgroup within which t...

The [eBPF helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) define bpf_get_current_cgroup_id for eBPF programs, which does the obvious thing

u64 bpf_get_current_cgroup_id(void)

        Return A 64-bit integer containing the current cgroup id
               based on the cgroup within which the current task
               is running.

However I can't find an equivalent system call (something similar to [getpid](https://man7.org/linux/man-pages/man2/getppid.2.html)) that I can use in a regular old C program Am I just completely missing the relevant function? Or does userspace need to do something different to get the cgroup ID for the current task?

user547386 (41 rep)

Oct 31, 2022, 08:15 PM • Last activity: Aug 16, 2024, 09:24 AM

0 votes

1 answers

110 views

New added android kernel bpf helpers are not detected

linux-kernel kernel android ebpf

I'm trying to patch an android kernel 4.9 to support `probe_read_{user, kernel} and probe_read_{user, kernel}` helpers. For the backporting I took example from another patch that adds `bpf_probe_read_str` helper. While I've patched the kernel to add the helpers and running bpftrace --info, the str h...

I'm trying to patch an android kernel 4.9 to support probe_read_{user, kernel} and probe_read_{user, kernel} helpers. For the backporting I took example from another patch that adds bpf_probe_read_str helper. While I've patched the kernel to add the helpers and running bpftrace --info, the str helper shows up but the newly added ones don't. bpftrace output

System
  OS: Linux 4.9.337-g4fcceb75c5cd #1 SMP PREEMPT Sat May 18 17:26:12 EEST 2024
  Arch: aarch64

Build
  version: v0.19.1
  LLVM: 14.0.6
  unsafe probe: yes
  bfd: no
  libdw (DWARF support): no

libbpf: failed to find valid kernel BTF
Kernel helpers
  probe_read: yes
  probe_read_str: yes
  probe_read_user: no
  probe_read_user_str: no
  probe_read_kernel: no
  probe_read_kernel_str: no
  get_current_cgroup_id: no
  send_signal: no
  override_return: no
  get_boot_ns: no
  dpath: no
  skboutput: no
  get_tai_ns: no
  get_func_ip: no

Kernel features
  Instruction limit: -1
  Loop support: no
  btf: no
  module btf: no
  map batch: no
  uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): no

Map types
  hash: yes
  percpu hash: yes
  array: yes
  percpu array: yes
  stack_trace: yes
  perf_event_array: yes
  ringbuf: no

Probe types
  kprobe: yes
  tracepoint: yes
  perf_event: yes
  kfunc: no
  kprobe_multi: no
  raw_tp_special: no
  iter: no

probe_read_{user, kernel} and probe_read_{user, kernel}

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index cc5ba47062e8..48762ecbfd66 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -88,6 +88,7 @@ static inline unsigned long __copy_from_user_nocache(void *to,
  * happens, handle that and return -EFAULT.
  */
 extern long probe_kernel_read(void *dst, const void *src, size_t size);
+extern long probe_kernel_read_strict(void *dst, const void *src, size_t size);
 extern long __probe_kernel_read(void *dst, const void *src, size_t size);
 
 /*
@@ -126,6 +127,8 @@ extern long notrace probe_user_write(void __user *dst, const void *src, size_t s
 extern long notrace __probe_user_write(void __user *dst, const void *src, size_t size);
 
 extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
+long strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr,
+		long count);
 extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
 				     long count);
 extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 67d7d771a944..d1036b0ba1fa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -552,6 +552,11 @@ enum bpf_func_id {
 	 */
 	BPF_FUNC_get_socket_uid,
 
+	BPF_FUNC_probe_read_user,
+	BPF_FUNC_probe_read_kernel,
+	BPF_FUNC_probe_read_user_str,
+	BPF_FUNC_probe_read_kernel_str,
+
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 83b20092b84c..e872ab1fb235 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -85,44 +85,6 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
 	return ret;
 }
 
-static const struct bpf_func_proto bpf_probe_read_proto = {
-	.func		= bpf_probe_read,
-	.gpl_only	= true,
-	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_RAW_STACK,
-	.arg2_type	= ARG_CONST_STACK_SIZE,
-	.arg3_type	= ARG_ANYTHING,
-};
-
-BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size, const void *, unsafe_ptr)
-{
-	int ret;
-
-	/*
-	 * The strncpy_from_unsafe() call will likely not fill the entire
-	 * buffer, but that's okay in this circumstance as we're probing
-	 * arbitrary memory anyway similar to bpf_probe_read() and might
-	 * as well probe the stack. Thus, memory is explicitly cleared
-	 * only in error case, so that improper users ignoring return
-	 * code altogether don't copy garbage; otherwise length of string
-	 * is returned that can be used for bpf_perf_event_output() et al.
-	 */
-	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
-	if (unlikely(ret < 0))
-		memset(dst, 0, size);
-
-	return ret;
-}
-
-static const struct bpf_func_proto bpf_probe_read_str_proto = {
-	.func           = bpf_probe_read_str,
-	.gpl_only       = true,
-	.ret_type       = RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_RAW_STACK,
-	.arg2_type	= ARG_CONST_STACK_SIZE,
-	.arg3_type	= ARG_ANYTHING,
-};
-
 BPF_CALL_3(bpf_probe_write_user, void *, unsafe_ptr, const void *, src,
 	   u32, size)
 {
@@ -465,6 +427,139 @@ static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_user, void *, dst, u32, size,
+	   const void __user *, unsafe_ptr)
+{
+	int ret = probe_user_read(dst, unsafe_ptr, size);
+
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_user_proto = {
+	.func		    = bpf_probe_read_user,
+	.gpl_only	  = true,
+	.ret_type	  = RET_INTEGER,
+	.arg1_type  = ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_probe_read_user_str, void *, dst, u32, size,
+	   const void __user *, unsafe_ptr)
+{
+	int ret = strncpy_from_unsafe_user(dst, unsafe_ptr, size);
+
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_user_str_proto = {
+	.func		= bpf_probe_read_user_str,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static __always_inline int
+bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr,
+			     const bool compat)
+{
+	int ret;
+	ret = compat ? probe_kernel_read(dst, unsafe_ptr, size) :
+	      probe_kernel_read_strict(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+	return ret;
+}
+
+BPF_CALL_3(bpf_probe_read_kernel, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, false);
+}
+
+static const struct bpf_func_proto bpf_probe_read_kernel_proto = {
+	.func		= bpf_probe_read_kernel,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_probe_read_compat, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	return bpf_probe_read_kernel_common(dst, size, unsafe_ptr, true);
+}
+
+static const struct bpf_func_proto bpf_probe_read_compat_proto = {
+	.func		= bpf_probe_read_compat,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static __always_inline int
+bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr,
+				 const bool compat)
+{
+	int ret;
+	/*
+	 * The strncpy_from_unsafe_*() call will likely not fill the entire
+	 * buffer, but that's okay in this circumstance as we're probing
+	 * arbitrary memory anyway similar to bpf_probe_read_*() and might
+	 * as well probe the stack. Thus, memory is explicitly cleared
+	 * only in error case, so that improper users ignoring return
+	 * code altogether don't copy garbage; otherwise length of string
+	 * is returned that can be used for bpf_perf_event_output() et al.
+	 */
+	ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) :
+	      strncpy_from_unsafe_strict(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+	return ret;
+}
+
+BPF_CALL_3(bpf_probe_read_kernel_str, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, false);
+}
+
+static const struct bpf_func_proto bpf_probe_read_kernel_str_proto = {
+	.func		= bpf_probe_read_kernel_str,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_probe_read_compat_str, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	return bpf_probe_read_kernel_str_common(dst, size, unsafe_ptr, true);
+}
+
+static const struct bpf_func_proto bpf_probe_read_compat_str_proto = {
+	.func		= bpf_probe_read_compat_str,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -474,10 +569,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_map_update_elem_proto;
 	case BPF_FUNC_map_delete_elem:
 		return &bpf_map_delete_elem_proto;
-	case BPF_FUNC_probe_read:
-		return &bpf_probe_read_proto;
 	case BPF_FUNC_probe_read_str:
-		return &bpf_probe_read_str_proto;
+		return &bpf_probe_read_compat_str_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_tail_call:
@@ -504,6 +597,17 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_current_task_under_cgroup_proto;
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
+	case BPF_FUNC_probe_read_user:
+		return &bpf_probe_read_user_proto;
+	case BPF_FUNC_probe_read_kernel:
+		return &bpf_probe_read_kernel_proto;
+	case BPF_FUNC_probe_read:
+		return &bpf_probe_read_compat_proto;
+	case BPF_FUNC_probe_read_user_str:
+		return &bpf_probe_read_user_str_proto;
+	case BPF_FUNC_probe_read_kernel_str:
+		return &bpf_probe_read_kernel_str_proto;
+
 	default:
 		return NULL;
 	}
diff --git a/mm/maccess.c b/mm/maccess.c
index 03ea550f5a74..583935a288ad 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -47,6 +47,9 @@ probe_write_common(void __user *dst, const void *src, size_t size)
 long __weak probe_kernel_read(void *dst, const void *src, size_t size)
     __attribute__((alias("__probe_kernel_read")));
 
+long __weak probe_kernel_read_strict(void *dst, const void *src, size_t size)
+    __attribute__((alias("__probe_kernel_read")));
+
 long __probe_kernel_read(void *dst, const void *src, size_t size)
 {
 	long ret;
@@ -157,6 +160,10 @@ EXPORT_SYMBOL_GPL(probe_user_write);
  * If @count is smaller than the length of the string, copies @count-1 bytes,
  * sets the last byte of @dst buffer to NUL and returns @count.
  */
+long __weak strncpy_from_unsafe_strict(char *dst, const void *unsafe_addr,
+				       long count)
+    __attribute__((alias("strncpy_from_unsafe")));
+
 long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
 {
 	mm_segment_t old_fs = get_fs();
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a339bea1f4c8..e6caf916d217 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -516,6 +516,11 @@ enum bpf_func_id {
 	 */
 	BPF_FUNC_get_socket_uid,
 
+	BPF_FUNC_probe_read_user,
+	BPF_FUNC_probe_read_kernel,
+	BPF_FUNC_probe_read_user_str,
+	BPF_FUNC_probe_read_kernel_str,
+
 	__BPF_FUNC_MAX_ID,
 };

bpf: add bpf_probe_read_str helper

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 83b20092b84c..59182e6d6f51 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -465,6 +465,36 @@ static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
+	   const void *, unsafe_ptr)
+{
+	int ret;
+
+	/*
+	 * The strncpy_from_unsafe() call will likely not fill the entire
+	 * buffer, but that's okay in this circumstance as we're probing
+	 * arbitrary memory anyway similar to bpf_probe_read() and might
+	 * as well probe the stack. Thus, memory is explicitly cleared
+	 * only in error case, so that improper users ignoring return
+	 * code altogether don't copy garbage; otherwise length of string
+	 * is returned that can be used for bpf_perf_event_output() et al.
+	 */
+	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_probe_read_str_proto = {
+	.func		= bpf_probe_read_str,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
+	.arg2_type	= ARG_CONST_STACK_SIZE,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -504,6 +534,8 @@ static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
 		return &bpf_current_task_under_cgroup_proto;
 	case BPF_FUNC_get_prandom_u32:
 		return &bpf_get_prandom_u32_proto;
+	case BPF_FUNC_probe_read_str:
+		return &bpf_probe_read_str_proto;
 	default:
 		return NULL;
 	}

Marcel (1 rep)

May 18, 2024, 02:54 PM • Last activity: May 24, 2024, 04:52 PM

7 votes

2 answers

6151 views

How do packets flow through the kernel

linux netfilter ipv4 ebpf ipvs

When it comes to packet filtering/management I never actually know what is going on inside the kernel. There are so many different tools that act on the packets, either from userspace (modifying kernel-space subsystems) or directly on kernel-space. Is there any place where each tool documents the in...

                                  When it comes to packet filtering/management I never actually know what is going on inside the kernel. There are so many different tools that act on the packets, either from userspace (modifying kernel-space subsystems) or directly on kernel-space.

Is there any place where each tool documents the interaction with other tools, or where they act. I feel like there should be a diagram somewhere specifying what is going on for people who aren't technical enough to go and read the kernel code.

So here's my example:

A packet is received on one of my network interfaces and I have:
- UFW
- iptables
- IPv4 subsystem (routing)
- IPVs
- eBPF

Ok, so I know that UFW is a frontend for iptables, and iptables is a frontend for Netfiler. So now we're on kernel space and our tools are Netfiler, IPVs, IPv4 and eBPF.

Again, the interactions between Netfilter and the IPv4 subsystems are easy to find since these are very old (not in a bad way) subsystems, so lack of docs would be very strange. This diagram is an overview of the interaction:

But what about IPVs and eBPF? What's the actual order in which kernel subsystems act upon the packets when these two are in the kernel?

I always find amazing people who try to go into the guts and help others understand, for example, [this description of the interaction between LVS and Netfilter](http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.filter_rules.html) .

But shouldn't this be documented in a more official fashion? I'm not looking for an explanation here as to how these submodules interact, I know I could find it myself by searching. My question is more general as to why is there no official documentation that actually tries to explain what is going on inside these kernel subsystems. Is it documented somewhere that I just don't know of? Is there any reason not to try to explain these tools?

I apologize if I'm not making any sense. I just started learning about these things.

AFP_555 (311 rep)

Jan 26, 2022, 05:55 PM • Last activity: May 23, 2024, 03:53 PM

2 votes

1 answers

250 views

eBPF in real-time systems

ebpf rtos

I've a question about real-time systems, in particular in LynxOS (LynxOS-178). I would need information on the compatibility and presence of eBPF in these systems. Can anyone help me? I haven't found any documentation online, but I would like to have this information to be able to evaluate which pro...

                                  I've a question about real-time systems, in particular in LynxOS (LynxOS-178).

I would need information on the compatibility and presence of eBPF in these systems.

Can anyone help me?

I haven't found any documentation online, but I would like to have this information to be able to evaluate which product is best to use to achieve my purpose.

Serena Schenone (21 rep)

Apr 30, 2024, 11:16 AM • Last activity: May 1, 2024, 08:13 AM

2 votes

1 answers

565 views

How to get argv[0] in bpftrace?

ebpf

I have this rather simple script: ```bpftrace #!/usr/bin/bpftrace tracepoint:syscalls:sys_enter_exec* { @start[pid] = nsecs; printf("START;%-6d;", pid); join(args->argv); } tracepoint:syscalls:sys_enter_exit* { $from = @start[pid]; $until = nsecs; printf("STOP;%-5d;%-16d\n", pid, $until-$from); } ``...

I have this rather simple script:

#!/usr/bin/bpftrace
tracepoint:syscalls:sys_enter_exec*
{
    @start[pid] = nsecs;
    printf("START;%-6d;", pid);
	join(args->argv);
}
tracepoint:syscalls:sys_enter_exit*
{
    $from = @start[pid];
    $until = nsecs;
	printf("STOP;%-5d;%-16d\n", pid, $until-$from);
}

I'd much rather have it print args->argv instead of printing the often multi-line join(args->argv). Problem is that printf("START;%-6d;%s", pid, args->argv); doesn't work:

/tmp/foo.bt:5:5-48: ERROR: printf: %s specifier expects a value of type string (integer supplied)
    printf("START;%-6d;%s", pid, args->argv);
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pretty sure args->argv is a string array, so this kind of surprises me. How do I solve this?

Marcus Müller (47107 rep)

Sep 19, 2023, 09:49 AM • Last activity: Dec 14, 2023, 09:13 AM

0 votes

0 answers

79 views

Can I use systemd resource management to deny port only outside containers

systemd fedora container podman ebpf

On an up-to-date fedora 39, I have set up podman for rootless containers and I limit the ports a user may bind to by creating /etc/systemd/system/user-1000.slice.d/user-resources.conf with ``` [Slice] SocketBindAllow = 12345 SocketBindDeny = any ``` Now as expected, the user cannot bind to port 2020...

On an up-to-date fedora 39, I have set up podman for rootless containers and I limit the ports a user may bind to by creating /etc/systemd/system/user-1000.slice.d/user-resources.conf with

[Slice]
SocketBindAllow   = 12345
SocketBindDeny    = any

Now as expected, the user cannot bind to port 20202 for example:

$ nc -4 -lp 20202
Ncat: bind to 0.0.0.0:20202: Operation not permitted. QUITTING.

However what bothers me is that it's not even possible to bind to a denied port *within* a container without exposing the port:

$ podman run docker.io/library/alpine nc -lp 20202
nc: bind: Operation not permitted

Is this a bug? Is there anything I can do about it?

Gamification (231 rep)

Dec 7, 2023, 10:02 AM

1 votes

0 answers

273 views

eBPF vs verified Linux Kernel Modules

linux-kernel kernel-modules ebpf

In what way is eBPF superior to a kernel module verified on the user-side? I'm not disputing the value of verified code; both approaches would be fully statically verified. Both approaches require capabilities usually only given to privileged users. However, running the verifier in user-space gives...

                                  In what way is eBPF superior to a kernel module verified on the user-side?

I'm not disputing the value of verified code; both approaches would be fully statically verified.
Both approaches require capabilities usually only given to privileged users.

However, running the verifier in user-space gives the user more choices between verifiers, safety-levels, and permissible assumptions. The verifier can also be more rapidly developed separately from the kernel.

---

**Things I read before asking this question**:

- I found [this hackernews thread](https://news.ycombinator.com/item?id=14726311) , which only says that _some_ limited eBPF filters do not need privilege, but I understand most eBPF applications still will require privilege?

- I found [this page](https://github.com/nyrahul/ebpf-guide/blob/master/docs/ebpf_vs_kernmod.rst) , which claims
  > Kernel modules have a specific entry point (init_module()) and exit (cleanup_module()) point. eBPF can be hooked to any kprobe/kretprobe/tracepoint and thus can be used for tracing

  Contrary to the previous quote, it seems kprobe/kretprobe/tracepoints can be hooked (aka registered) from loadable kernel modules according to Linux documentation on [kprobes/kretprobes](https://docs.kernel.org/trace/kprobes.html)  and [tracepoints](https://docs.kernel.org/trace/tracepoints.html) .

- [This page](https://github.com/nyrahul/ebpf-guide/blob/master/docs/ebpf_vs_kernmod.rst)  also claims that eBPF cannot be pre-empted and kernel modules can. Whenever I Google anything about the Linux kernel preempting module code, they always talk about pre-emptying user-space never kernel-space. I don't know what it means that the "Kernel module follows regular kernel code preemption logic" but on the other hand "eBPF instruction-set execution cannot be preempted by kernel".

charmoniumQ (255 rep)

Sep 28, 2023, 04:21 PM • Last activity: Sep 28, 2023, 04:32 PM

1 votes

1 answers

918 views

Redirect port using TC BPF

linux networking tc ebpf

I'm want to use `TC BPF` to redirect incoming traffic from port `80` to port `8080`. Below is my own code, but I've also tried the example from [man 8 tc-bpf](https://man7.org/linux/man-pages/man8/tc-bpf.8.html) (search for `8080`) and I get the same result. ``` #include #include #include #include #...

I'm want to use TC BPF to redirect incoming traffic from port 80 to port 8080. Below is my own code, but I've also tried the example from [man 8 tc-bpf](https://man7.org/linux/man-pages/man8/tc-bpf.8.html) (search for 8080) and I get the same result.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
                                            __u16 old_port, __u16 new_port)
{
	bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
						old_port, new_port, sizeof(new_port));
	bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
						&new_port, sizeof(new_port), 0);
}

SEC("tc_my")
int tc_bpf_my(struct __sk_buff *skb)
{
	struct iphdr ip;
	struct tcphdr tcp;
	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr), &ip, sizeof(struct iphdr))) {
		bpf_printk("bpf_skb_load_bytes iph failed");
		return TC_ACT_OK;
	}

	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr) + (ip.ihl  %pI4:%u", &ip.saddr, src_port, &ip.daddr, dst_port);

	if (dst_port != 80)
		return TC_ACT_OK;

	set_tcp_dport(skb, ETH_HLEN + sizeof(struct iphdr), __constant_htons(80), __constant_htons(8080));

	return TC_ACT_OK;
}

char LICENSE[] SEC("license") = "GPL";

On machine A, I am running: clang -g -O2 -Wall -target bpf -c tc_my.c -o tc_my.o tc qdisc add dev ens160 clsact tc filter add dev ens160 ingress bpf da obj tc_my.o sec tc_my nc -l 8080 On machine B: nc $IP_A 80 On machine B, nc seems connected, but ss shows: SYN-SENT 0 1 $IP_B:53442 $IP_A:80 users:(("nc",pid=30180,fd=3)) On machine A, connection remains in SYN-RECV before being dropped. I was expecting my program to behave as if I added this iptables rule: iptables -t nat -A PREROUTING -p tcp -m tcp --dport 80 -j REDIRECT --to-port 8080 Maybe my expectations are wrong, but I would like to understand why. How can I get my TC BPF redirect to work? SOLUTION ----------------- Following the explanation in my accepted answer, here is an example code which works for TCP, does ingress NAT 90->8080, and egress de-NAT 8080->90.

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
								 __u16 old_port, __u16 new_port)
{
	bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
						old_port, new_port, sizeof(new_port));
	bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
						&new_port, sizeof(new_port), 0);
}

static inline void set_tcp_sport(struct __sk_buff *skb, int nh_off,
								 __u16 old_port, __u16 new_port)
{
	bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
						old_port, new_port, sizeof(new_port));
	bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, source),
						&new_port, sizeof(new_port), 0);
}

SEC("tc_ingress")
int tc_ingress_(struct __sk_buff *skb)
{
	struct iphdr ip;
	struct tcphdr tcp;
	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr), &ip, sizeof(struct iphdr)))
	{
		bpf_printk("bpf_skb_load_bytes iph failed");
		return TC_ACT_OK;
	}

	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr) + (ip.ihl  %pI4:%u", &ip.saddr, src_port, &ip.daddr, dst_port);

	if (dst_port != 90)
		return TC_ACT_OK;

	set_tcp_dport(skb, ETH_HLEN + sizeof(struct iphdr), __constant_htons(90), __constant_htons(8080));

	return TC_ACT_OK;
}

SEC("tc_egress")
int tc_egress_(struct __sk_buff *skb)
{
	struct iphdr ip;
	struct tcphdr tcp;
	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr), &ip, sizeof(struct iphdr)))
	{
		bpf_printk("bpf_skb_load_bytes iph failed");
		return TC_ACT_OK;
	}

	if (0 != bpf_skb_load_bytes(skb, sizeof(struct ethhdr) + (ip.ihl  %pI4:%u", &ip.saddr, src_port, &ip.daddr, dst_port);

	if (src_port != 8080)
		return TC_ACT_OK;

	set_tcp_sport(skb, ETH_HLEN + sizeof(struct iphdr), __constant_htons(8080), __constant_htons(90));

	return TC_ACT_OK;
}

char LICENSE[] SEC("license") = "GPL";

Here is how I build and loaded the different sections in my program:

clang -g -O2 -Wall -target bpf -c tc_my.c -o tc_my.o
tc filter add dev ens32 ingress bpf da obj /tc_my.o sec tc_ingress
tc filter add dev ens32 egress bpf da obj /tc_my.o sec tc_egress

greenro (13 rep)

Sep 14, 2023, 01:04 PM • Last activity: Sep 15, 2023, 09:25 AM

10 votes

2 answers

4705 views

Understanding of BPF

performance tcpdump ebpf

When I need to capture some packets using `tcpdump`, I use command like: ``` tcpdump -i eth0 "dst host 192.168.1.0" ``` I always think the *dst host 192.168.1.0* part is something called BPF, Berkeley Packet Filter. To me, it's a simple language to filter network packets. But today my roommate tells...

When I need to capture some packets using tcpdump, I use command like:

tcpdump -i eth0 "dst host 192.168.1.0"

I always think the *dst host 192.168.1.0* part is something called BPF, Berkeley Packet Filter. To me, it's a simple language to filter network packets. But today my roommate tells me that BPF can be used to capture performance info. According to his description, it's like the tool perfmon on Windows. Is it true? Is it the same BPF as I mentioned in the beginning of the question?

Fajela Tajkiya (1065 rep)

Apr 18, 2022, 10:07 PM • Last activity: Sep 1, 2023, 10:17 AM

1 votes

1 answers

797 views

Log all commands executed regardless of shell?

security logs exec tracing ebpf

Suppose a user runs the following command: zcat file.gz | grep something | gzip > grepped.gz I'm looking for a kernel feature (a BPF filter perhaps?) that would note all of the `execve`s, chain together their stdins/stdouts and reconstruct that in a similar form, putting it into system logs. Is ther...

                                  Suppose a user runs the following command:

    zcat file.gz | grep something | gzip > grepped.gz

I'm looking for a kernel feature (a BPF filter perhaps?) that would note all of the execves, chain together their stdins/stdouts and reconstruct that in a similar form, putting it into system logs. Is there a way to do that without interfacing with the shells?

d33tah (1381 rep)

May 10, 2023, 10:45 AM • Last activity: May 10, 2023, 06:21 PM

1 votes

1 answers

161 views

DPROBES (DTRACE_PROBE) for measuring high latency stuff under 1µsec

linux performance scheduling sleep ebpf

Currently, I'm analyzing the performance of a high latency application but I'm not confident in my measurements at all. So far, I have used `DPROBES` for instrumentation and [BCC/funclatency][1] for measuring. Would someone be able to verify those numbers? Also, if someone knows of a better method,...

                                  Currently, I'm analyzing the performance of a high latency application but I'm not confident in my measurements at all. So far, I have used DPROBES for instrumentation and BCC/funclatency  for measuring. 
Would someone be able to verify those numbers? Also, if someone knows of a better method, please let me know.

measuring usleep(1) : avg = 53962 nsecs

    #include 
    #include 
    
    int main() {
       int i;
       for(i=0; i
    
    int main() {
       int i;
       for(i=0; i
    #include 
    #include 
    #include 
    int main() {
      
    	struct timespec tim, tim2;
       	tim.tv_sec = 0;
       	tim.tv_nsec = 200L;
    	int i;
    	for (i=0; i<100000; i++){
    	DTRACE_PROBE("hello-usdt", probe-main-start);
    	if(nanosleep(&tim , &tim2) < 0 ){
          		printf("Nano sleep system call failed \n");
          		return -1;
    	}	
    	DTRACE_PROBE("hello-usdt", probe-main-end);
    	}
    	printf("Nano sleep successfull \n");
    
    	return 0;
    }


A little modification was made to the funclatency code:

    # attach probes
    
    usdt = USDT(path = "path to application")
    usdt.enable_probe(probe = "probe-main-start", fn_name = "trace_func_entry")
    usdt.enable_probe(probe = "probe-main-end", fn_name = "trace_func_return")
    b = BPF(text = bpf_text, usdt_contexts = [usdt])

Am I unable to measure anything below 0.8usec with this method?
Furthermore I cannot believe that nanosleep(200) "oversleep" by 50 usec.



                                

Bahamas (113 rep)

Jan 28, 2023, 02:31 PM • Last activity: Jan 31, 2023, 09:30 PM

2 votes

2 answers

814 views

What are the limitations of eBPF feature-wise?

linux-kernel ebpf

I understood it is mainly used for observability (ie read-only). I saw you can route packets, but can you do more than that? Can you also manipulate the file system, send signals and write from an eBPF program?

                                  I understood it is mainly used for observability (ie read-only).   
I saw you can route packets, but can you do more than that?  
Can you also manipulate the file system, send signals and write from an eBPF program? 
                                

funerr (123 rep)

Sep 27, 2022, 10:48 PM • Last activity: Sep 28, 2022, 09:24 AM

Showing page 1 of 20 total questions