Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

3 votes

1 answers

49 views

File acess permissions missing after setuid() system call

I have a file access problem in a self developed daemon process after a setuid() system call. I already post this question to [SO][1] but the impression is that the problem is not C++ related but Linux related and so maybe there is someone here who could help me solving it. My daemon program cannot...

                                  I have a file access problem in a self developed daemon process after a setuid() system call. I already post this question to SO  but the impression is that the problem is not C++ related but Linux related and so maybe there is someone here who could help me solving it.

My daemon program cannot access a configuration file after a setuid(iUid) systemcall even though iUid is owner of the configuration file. Why?

I am writing a controller daemon in C++ for home automation which finally will run on an raspberry pi with Raspberry Pi OS. It is started with root permissions as after start it should read an SSL certifacate which only root is granted read access. After the SSL certifacte is read the daemon should switch to user 'pvmonitor' as root permissions are no longer needed. This is done by

    setuid( iUid );

and I have checked with ps that the process runs as user 'pvmonitor'.

The configuration file for this daemon is located at /etc/SmartHome/converd.conf and is owned by user pvmonitor.

    ls -la /etc/SmartHome/
    total 24
    drwxrwx---+   2 pvmonitor www-data 4096 Jul 17 20:07 .
    drwxr-xr-x+ 107 root      root     4096 Jul 17 20:07 ..
    -rw-r-----+   1 pvmonitor www-data  705 Jul 17 20:07 coverd.conf

The raspberry pi is booted from network and the file system is mounted from a NAS which provides an ACL. Also ACL grants access permission to user pvmonitor:

    getfacl /etc/
    getfacl: Removing leading '/' from absolute path names
    # file: etc/
    # owner: root
    # group: root
    user::rwx
    [...]
    group::---
    group:users:rwx			#effective:r-x
    group:www-data:r-x
    mask::r-x
    other::r-x
    [...]

    getfacl /etc/SmartHome/
    getfacl: Removing leading '/' from absolute path names
    # file: etc/SmartHome/
    # owner: pvmonitor
    # group: www-data
    user::rwx
    [...]
    user:pvmonitor:rwx
    [...]
    group::---
    [...]
    group:www-data:r-x
    mask::rwx
    other::---
    [...]

    getfacl /etc/SmartHome/coverd.conf 
    getfacl: Removing leading '/' from absolute path names
    # file: etc/SmartHome/coverd.conf
    # owner: pvmonitor
    # group: www-data
    user::rw-
    [...]
    user:pvmonitor:rwx		#effective:r--
    [...]
    group::---
    [...]
    group:www-data:r-x		#effective:r--
    mask::r--
    other::---

In addition the output of stat:

    stat /etc
      File: /etc
      Size: 4096      	Blocks: 16         IO Block: 4096   directory
    Device: 0,22	Inode: 74579976    Links: 107
    Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2024-12-03 22:14:03.809660810 +0100
    Modify: 2025-07-17 20:07:13.645754180 +0200
    Change: 2025-07-17 20:07:13.645754180 +0200
     Birth: -

    stat /etc/SmartHome/
      File: /etc/SmartHome/
      Size: 4096      	Blocks: 16         IO Block: 4096   directory
    Device: 0,22	Inode: 74581572    Links: 2
    Access: (0770/drwxrwx---)  Uid: ( 1004/pvmonitor)   Gid: (  133/www-data)
    Access: 2025-07-17 20:06:03.525754180 +0200
    Modify: 2025-07-17 20:07:08.395754180 +0200
    Change: 2025-07-17 20:35:52.235754180 +0200
     Birth: -

    stat /etc/SmartHome/coverd.conf 
      File: /etc/SmartHome/coverd.conf
      Size: 705       	Blocks: 16         IO Block: 131072 regular file
    Device: 0,22	Inode: 74581810    Links: 1
    Access: (0640/-rw-r-----)  Uid: ( 1004/pvmonitor)   Gid: (  133/www-data)
    Access: 2025-07-17 20:07:08.395754180 +0200
    Modify: 2025-07-17 20:07:08.395754180 +0200
    Change: 2025-07-18 09:33:38.783696180 +0200
     Birth: -

With

    sudo -u pvmonitor less /etc/SmartHome/coverd.conf

I can read the configuration file without any problem.

But when I try to open the configuration file in my daemon process after the setuid(); command I get an "permission denied" error. Here is a minimum reproducable example which is based on excerpts of my daemons code:

    #include 
    #include 
    #include 
    #include 
    
    const char *ptConfigFile = "/etc/SmartHome/coverd.conf";
    
    void printConfig( void )
    {
        std::cout << "Try to open file " << ptConfigFile << std::endl;
        FILE *ptfTest;
        ptfTest = fopen( ptConfigFile, "r" );
        if (ptfTest != nullptr)
        {
          char sLine;
          while (!feof(ptfTest))
          {
            fgets(sLine,1023,ptfTest);
            std::cout << sLine;
          }
          fclose( ptfTest );
        }
        else
          perror( "Failed to open file" );
    }
    int main(int argc, char **argv )
    {
      int iUid = 1004;
      std::cout << "User id is now " << getuid() << std::endl;
      printConfig();
      std::cout << "Switch to user id " << iUid << std::endl;
      if (iUid == 0 || setuid(iUid)== 0)
      {
        std::cout << "User id is now " << getuid() << std::endl;
        printConfig();
        return 0;
      }
      std::cerr << "Could not switch user id." << std::endl;
      return -1;
    }

1004 is the user id of user pvmonitor. The output of this example is:

    sudo ./test 
    User id is now 0
    Try to open file /etc/SmartHome/coverd.conf
    CERTFILE=[...]
    [...]
    Switch to user id 1004
    User id is now 1004
    Try to open file /etc/SmartHome/coverd.conf
    Failed to open file: Permission denied

In addition here is the output when I run the test program with strace:

    sudo strace ./test 
    execve("./test", ["./test"], 0x7fc90538b0 /* 13 vars */) = 0
    [...]
    setuid(1004)                            = 0
    getuid()                                = 1004
    write(1, "User id is now 1004\n", 20User id is now 1004
    )   = 20
    write(1, "Try to open file /etc/SmartHome/"..., 44Try to open file /etc/SmartHome/coverd.conf
    ) = 44
    openat(AT_FDCWD, "/etc/SmartHome/coverd.conf", O_RDONLY) = -1 EACCES (Permission denied)
    dup(2)                                  = 3
    fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
    newfstatat(3, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
    write(3, "Failed to open file: Permission "..., 39Failed to open file: Permission denied
    ) = 39
    close(3)                                = 0
    exit_group(0)                           = ?

What am I doing wrong?









                                

Holger (33 rep)

Jul 17, 2025, 06:37 PM • Last activity: Jul 18, 2025, 12:24 PM

-1 votes

1 answers

2182 views

How does an open(at) syscall result in a file being written to disk?

linux kernel devices system-calls vfs

I'm trying to learn as as much as I can about about the interplay between syscalls, the VFS, device driver handling and ultimately, having the end device do some operation. I thought I would look at a fairly trivial example - creating a file - and try to understand the underlying process in as much...

                                  I'm trying to learn as as much as I can about about the interplay between syscalls, the VFS, device driver handling and ultimately, having the end device do some operation. I thought I would look at a fairly trivial example - creating a file - and try to understand the underlying process in as much detail as possible.

I created a bit of C, to do little more than open a (non-existing) file for writing, compiled this (without optimization), and took a peek at it with strace when I ran it. In particular, I wanted to focus on the openat syscall, and why and how this call was ultimately able to not only create the file object / file description, but also actually do the writing to disk (for reference, EXT4 file system, SATA HDD).

Broadly speaking, excluding some of the checks and ancilliary bits and pieces, my understanding of the process is as follows (and please correct me if I'm way off!):

 - ELF is mapped into memory
 - libc is mapped into memory
 - fopen is called
 - libc does its open
 - openat syscall is called, with the O_CREAT flag among others
 - Syscall number is put into RAX register
 - Syscall args (e.g. file path, etc.) are put into RDI register (and RSI, RDX, etc. as appropriate)
 - Syscall instruction issue, and CPU transition to ring 0
 - System_call code pointed to by MSR_LSTAR register invoked
 - registers pushed to kernel stack
 - Function pointer from RAX called at offset into sys_call_table
 - asmlinkage wrapper for the actual openat syscall code is invoked

And at that point my understanding is lacking, but ultimately, I know that:

 1. The open call returns a file descriptor, which is unique to the process, and maintained globally within the kernel's file descriptor table
 2. The FD maps to a file description file object
 3. The file object is populated with, among other structures, inode structure, inode_operations, file_operations, etc.
 4. The file operations table should map generic syscalls to the respective device drivers to handle the respective calls (such that, for example, when a write syscall is called, the respective driver write call is called instead for the device on which the file resides, e.g. a SCSI driver)
 5. This mapping is based on the major/minor numbers for that file/device
 6. Somewhere along the line, code is invoked which causes a instructions to be send to the device drive for the hard drive, which gets send to the disk controller, which causes a file to be written to the hard disk, though whether this is via interrupts or DMA, or some other method of I/O I'm not sure
 7. Ultimately, the disk controller sends a message back to the kernel to say it's done, and the kernel returns control back to use space.

I'm not too good at following the kernel source, though I've tried a little, but feel there's a lot I'm missing. My questions are as follows:

I've found some functions which return, and destroy FDs in the kernel source, but can't find where the code is which actually populates the file object / file description for the file. 

A) On an open or openat syscall, when a new file is created, how is the file structure populated? Where does the data come from? Specifically, how are the file_operations and inode_operations, etc. for that file populated? how does the kernel know, when populating that structure, that the file operations for this particular file need to be that of the SCSI driver, for instance?

B) Where in the process - and particularly with reference to the source - does the switch to the device driver happen? For example, if an ioctl or similar was called, I would expect to some reference to the instruction to be called for the respective device, and some memory address for the data to be passed on, but I can't find where that happens.

From looking at the kernel source, all I can really find is code that assigns a new FD, but nothing that populates the file structure, no anything that calls the respective file operations to transfer control to a device driver.

Apologies that this is a really long-winded description, but I'm really trying to learn as much as possible, and although I have a basic grasp of C, I really struggle to understand others' code.

Hoping someone with greater knowledge than I can help clarify some of these things for me, as I seem to have hit a figurative brick wall. Any advice would be greatly appreciated.



**Edit:**
Hopefully the following points will clarify what technical detail I'm after.

 - The open or openat syscalls take a file path, and flags (with the latter also being passed an FD pointing to a directory)
 - When the O_CREAT flag is also passed, the file is 'created' if it doesn't exist
 - Based on the file path, the kernel is able to identify the device type this file should be
 - The device type is identified from the major/minor numbers ordinarily - for a file that already exists, these are stored in the inode structure for the file (as member i_rdev) and the stat structure for the file (as members st_dev for the device type of the file system on which the file resides, and st_rdev for the device type of the file itself)

So really, my questions are:

 1. When a file is created with either of the open syscalls, the respective inode and stat structure must also be created and populated - how do the open syscalls do this (when all they have to go on at this point is the file path, and flags? Do they look at the inode or stat structure of the parent directory, and copy the relevant structure members from this?

 2. At which point (i.e. where in the source) does this happen?

 3. It's my understanding that when these open syscalls are invoked, it needs to know the device type, in order for the VFS to know what device driver code to invoke. On creating a new file, where the device type has yet to be set in the file object structures, what happens? What code is called?

 4. Is the sequence more like:

user process tries to open new file -> open('/tmp/foo', O_CREAT) open -> look up structure for '/tmp', get its device type -> get unused FD -> populate inode/stat structure, including setting device type to that of parent -> based on device type, map file operations / inode operations to device driver code -> call device driver code for open syscall -> send appropriate instruction to disk controller to write new file to disk -> tidy up things, do checks, etc. -> return FD to user calling process?
                                

genericuser99 (119 rep)

Aug 25, 2020, 10:19 PM • Last activity: Jul 6, 2025, 03:04 PM

7 votes

1 answers

473 views

Syscalls required by glibc calls

linux c system-calls glibc

Are there any lists compiled that provide a list of linux system calls used per function in a standard glibc build? For example, `free()` requires `mmap`, `munmap`, `mprotect`, `prlimit64`, and `brk`. If necessary I can figure this out by grepping the source code or some strace wizardry, but I don't...

                                  Are there any lists compiled that provide a list of linux system calls used per function in a standard glibc build?

For example, free() requires mmap, munmap, mprotect, prlimit64, and brk.

If necessary I can figure this out by grepping the source code or some strace wizardry, but I don't want to reinvent the wheel. I've been searching the web for about a week with no avail, mostly just turning up info on system call wrappers.

I am aware that officially there is no such certainty, but I know from practical experience that this info changes for most functions very rarely.

I asked this on Stack Overflow, but this forum seems more appropriate, since I am looking for documentation (which may or may not exist). 

Thanks

user30972097 (73 rep)

Jul 5, 2025, 11:20 PM • Last activity: Jul 6, 2025, 01:56 PM

178 votes

6 answers

663363 views

How to find application's path from command line?

terminal system-calls

For example, I have `git` installed on my system. But I don't remember where I installed it, so which command is fit to find this out?

                                  For example, I have git installed on my system.
But I don't remember where I installed it, so which command is fit to find this out?
                                

Anders Lind (2525 rep)

Jan 7, 2012, 07:33 PM • Last activity: Jun 17, 2025, 10:08 PM

0 votes

1 answers

2501 views

Get executable name in syscall?

linux linux-kernel system-calls

So I am writing a system call in Linux. I want to print a message that looks like ``` printk(KERN_DEBUG "PROC_DEBUG [%s, %s]: %s", executable, current->pid, message); ``` Where executable is the name of the executable that is created when I link a source file against the library used to call the sys...

So I am writing a system call in Linux. I want to print a message that looks like

printk(KERN_DEBUG "PROC_DEBUG [%s, %s]: %s", executable, current->pid, message);

Where executable is the name of the executable that is created when I link a source file against the library used to call the syscall. So if I run the command "cc -o sourcefile.c -L ./a -lfilename", is what I want to print for the executable. (The pid is the process ID of the process that is running the executable, and message is a parameter I pass to the system call.) I was trying to use this code to get the executable name

struct task_struct *task = get_current();
char task_com[TASK_COM_LEN];
get_task_com(task_com, task);

But I am getting an error that 'TASK_COM_LEN' is undeclared, so what am I missing? Is there an easier way to get the executable name? Something like "current->executable"? I'm not finding any great matches on the web.

TheIncrediblyStupidOne (1 rep)

Oct 1, 2022, 02:39 PM • Last activity: May 27, 2025, 01:08 AM

0 votes

2 answers

142 views

What is the difference between user-space and kernel-space program/application?

systemd linux-kernel system-calls

I am currently learning about Kernels in operating system and I often come across the terms "user-space applications" and "programs"—especially in the context of the kernel's System Call Interface (SCI), which provides services to these user-space entities. Additionally, while studying the Linux boo...

                                  I am currently learning about Kernels in operating system and I often come across the terms "user-space applications" and "programs"—especially in the context of the kernel's System Call Interface (SCI), which provides services to these user-space entities.

Additionally, while studying the Linux boot process, I read that the kernel starts the init process (or systemd in modern distributions), which in turn starts user-space applications.

I'm a bit confused about what exactly qualifies as a user-space application or user-space program, also the kernel-space application or program too

lost_decimal (9 rep)

May 21, 2025, 12:56 PM • Last activity: May 21, 2025, 01:43 PM

40 votes

4 answers

8053 views

Why are Linux system call numbers in x86 and x86_64 different?

linux system-calls

I know that the system call interface is implemented on a low level and hence architecture/platform dependent, not "generic" code. Yet, I cannot clearly see the reason why system calls in Linux 32-bit x86 kernels have numbers that are not kept the same in the similar architecture Linux 64-bit x86_64...

                                  I know that the system call interface is implemented on a low level and hence architecture/platform dependent, not "generic" code.

Yet, I cannot clearly see the reason why system calls in Linux 32-bit x86 kernels have numbers that are not kept the same in the similar architecture Linux 64-bit x86_64? What is the motivation/reason behind this decision?

My first guess has been that a backgrounding reason has been to keep 32-bit applications runnable on a x86_64 system, so that via an reasonable offset to the system call number the system would know that user-space is 32-bit or 64-bit respectively. This is however not the case. At least it seems to me that read() being system call number 0 in x86_64 cannot be aligned with this thought.

Another guess has been that changing the system call numbers might have a security/hardening background, something I was not able to confirm myself.

Being ignorant to the challenges of implementation the architecture-dependent code parts, I still wonder **how changing the system call numbers**, when there seems no need (as even a 16-bit register would store largely more then the currently ~346 numbers to represent all calls), **would help to achieve anything, other than break compatibility** (though using the system calls through a library, libc, mitigates it).

humanityANDpeace (15072 rep)

Jan 19, 2017, 04:12 PM • Last activity: May 19, 2025, 02:53 AM

16 votes

4 answers

4052 views

Call a Linux syscall from a scripting language

linux scripting syscalls

I want to call a Linux syscall (or at least the libc wrapper) directly from a scripting language. I don't care what scripting language - it's just important that it not be compiled (the reason basically has to do with not wanting a compiler in the dependency path, but that's neither here nor there)....

                                  I want to call a Linux syscall (or at least the libc wrapper) directly from a scripting language. I don't care what scripting language - it's just important that it not be compiled (the reason basically has to do with not wanting a compiler in the dependency path, but that's neither here nor there). Are there any scripting languages (shell, Python, Ruby, etc) that allow this?

In particular, it's the [getrandom](http://man7.org/linux/man-pages/man2/getrandom.2.html)  syscall.

joshlf (395 rep)

Mar 24, 2017, 08:58 PM • Last activity: Apr 24, 2025, 01:39 PM

2 votes

2 answers

1943 views

select(2) on FIFO on macOS

c system-calls portability

On Linux the included program returns from `select` and exits: $ gcc -Wall -Wextra select_test.c -o select_test $ ./select_test reading from read end closing write end first read returned 0 second read returned 0 selecting with read fd in fdset select returned On OS X, the `select` blocks forever an...

                                  On Linux the included program returns from select and exits:

    $ gcc -Wall -Wextra select_test.c -o select_test
    $ ./select_test
    reading from read end
    closing write end
    first read returned 0
    second read returned 0
    selecting with read fd in fdset
    select returned

  On OS X, the select blocks forever and the program does not exit.  The Linux behavior matches my expectation and appears to conform to the following bit of the POSIX manual page for select:

> A descriptor shall be considered ready for reading when a call to an input function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully. (The function might return data, an end-of-file indication, or an error other than one indicating that it is blocked, and in each of these cases the descriptor shall be considered ready for reading.)

Since read(2) on the read end of the fifo will always return EOF, my reading says that it should always be considered ready by select.

Is macOS's behavior here well-known or expected?  Is there something else in this example that leads to the behavior difference?

A further note is that if I remove the read calls then macOS's select returns. This and some other experiments seem to indicate that once an EOF has been read from the file, it will no longer be marked as ready if select is called on it later.

## Example Program

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    #define FILENAME "select_test_tmp.fifo"
    
    int main() 
	{
		pid_t pid;
		int r_fd, w_fd;
		unsigned char buffer;
		fd_set readfds;

		mkfifo(FILENAME, S_IRWXU);

		pid = fork();
		if (pid == -1) 
		{
			perror("fork");
			exit(1);
		}

		if (pid == 0) 
		{
			w_fd = open(FILENAME, O_WRONLY);
			
			if (w_fd == -1) 
			{
				perror("open");
				exit(1);
			}
			
			printf("closing write end\n");
			close(w_fd);
			exit(0);
		}

		r_fd = open(FILENAME, O_RDONLY);
		if (r_fd == -1) 
		{
			perror("open");
			exit(1);
		}

		printf("reading from read end\n");
		
		if (read(r_fd, &buffer, 10) == 0) 
		{
			printf("first read returned 0\n");
		} 
		else 
		{
			printf("first read returned non-zero\n");
		}

		if (read(r_fd, &buffer, 10) == 0) 
		{
			printf("second read returned 0\n");
		} 
		else 
		{
			printf("second read returned non-zero\n");
		}

		FD_ZERO(&readfds);
		FD_SET(r_fd, &readfds);

		printf("selecting with read fd in fdset\n");
		if (select(r_fd + 1, &readfds, NULL, NULL, NULL) == -1) 
		{
			perror("select");
			exit(1);
		}
		
		printf("select returned\n");
		unlink(FILENAME);
		exit(0);
    }
                                

Steven D (47418 rep)

Aug 2, 2017, 12:48 AM • Last activity: Apr 19, 2025, 02:01 AM

35 votes

7 answers

84230 views

Where do you find the syscall table for Linux?

syscalls

I see a lot of people online referencing arch/x86/entry/syscalls/syscall_64.tbl for the syscall table, that works fine. But a lot of others reference /include/uapi/asm-generic/unistd.h which is commonly found in the headers package. How come `syscall_64.tbl` shows, 0 common read sys_read The right a...

                                  I see a lot of people online referencing 

    arch/x86/entry/syscalls/syscall_64.tbl

for the syscall table, that works fine. But a lot of others reference 

    /include/uapi/asm-generic/unistd.h

which is commonly found in the headers package. How come syscall_64.tbl shows,

    0 common  read      sys_read

The right answer, and unistd.h shows,

    #define __NR_io_setup 0
    __SC_COMP(__NR_io_setup, sys_io_setup, compat_sys_io_setup)

And then it shows __NR_read as 

    #define __NR_read 63
    __SYSCALL(__NR_read, sys_read)

Why is that 63, and not 1? How do I make sense of out of /include/uapi/asm-generic/unistd.h? Still in /usr/include/asm/ there is

    /usr/include/asm/unistd_x32.h
    #define __NR_read (__X32_SYSCALL_BIT + 0)
    #define __NR_write (__X32_SYSCALL_BIT + 1)
    #define __NR_open (__X32_SYSCALL_BIT + 2)
    #define __NR_close (__X32_SYSCALL_BIT + 3)
    #define __NR_stat (__X32_SYSCALL_BIT + 4)
    
    /usr/include/asm/unistd_64.h
    #define __NR_read 0
    #define __NR_write 1
    #define __NR_open 2
    #define __NR_close 3
    #define __NR_stat 4
    
    /usr/include/asm/unistd_32.h
    #define __NR_restart_syscall 0
    #define __NR_exit 1           
    #define __NR_fork 2           
    #define __NR_read 3           
    #define __NR_write 4          

Could someone tell me the difference between these unistd files. Explain how unistd.h works? And what the best method for finding the syscall table?
                                

Evan Carroll (34663 rep)

Feb 4, 2018, 04:40 AM • Last activity: Mar 27, 2025, 04:03 AM

2 votes

0 answers

41 views

Wrong attributes bitmask in READDIR requests on NFSv4.1

kernel nfs c system-calls nfsv4

I'm struggling the following problem. I have an NFS v4.1 mount, where I have a directory with a couple of thousands files. I'm trying to list their names and types. Even with a minimal example program taken from the `getdents` man page, I see a strange behaviour of the NFS client: - first few `READD...

                                  I'm struggling the following problem.

I have an NFS v4.1 mount, where I have a directory with a couple of thousands files. I'm trying to list their names and types. Even with a minimal example program taken from the getdents man page, I see a strange behaviour of the NFS client:

- first few READDIR RPCs have an attribute bitmask set to what you would expect, i.e. to all the attributes that the server supports
- after the second call to getdents (or readdir, doesn't matter), the NFS READDIR RPCs change - attributes bitmask is set to just RDAttr_Error and FileId

Can some explain, what could cause the change and why ls seems not to cause it?

dmk (21 rep)

Feb 17, 2025, 05:53 PM • Last activity: Feb 17, 2025, 05:55 PM

0 votes

0 answers

35 views

How to trace recvfrom and sendto syscall each time apache2/httpd handle incoming http request?

linux-kernel apache-httpd system-calls strace

So, I decided to start learn about system call with `strace` and want to observe network-related system call on apache2 processes, here's how I attach it: ``` pidof -s apache2 pstree -sTp strace -f -e trace=%network -p ``` and while observing, I notice that strace print some syscall, however I can't...

So, I decided to start learn about system call with strace and want to observe network-related system call on apache2 processes, here's how I attach it:

pidof -s apache2
pstree -sTp 
strace -f -e trace=%network -p

and while observing, I notice that strace print some syscall, however I can't find the associated recvfrom or sendto syscall with file descriptor that correspond to the accept syscall in which contain the ip address of client (my browser) when I make a http request my assumption is that when a request is handled by apache, it spawn new processes as a child process, and since I attach the strace to the parent apache2 process, why the strace not follow its child despite I specify -f option?

ReYuki (33 rep)

Feb 6, 2025, 07:24 AM

0 votes

1 answers

176 views

How to better understand and reverse-engineer system calls within processes given a specific example

system-calls strace drm

I am very new to linux and as such would appreciate any pointers with respect to understanding system calls and having the ability, knowledge and tools to reverse-engineer their origin or their process flow. As the title suggests, i present an example, being my analysis of an Xorg process that i tra...

                                  I am very new to linux and as such would appreciate any pointers with respect to understanding system calls and having the ability, knowledge and tools to reverse-engineer their origin or their process flow.

As the title suggests, i present an example, being my analysis of an Xorg process that i traced in my linux desktop environment. As such, i am attempting to understand the process flow of DRM_IOCTL calls, in this case a specific DRM_IOCTL_CURSOR2 system call that takes place within the process. My goal is to understand what triggers this call within this desktop environment, or rather what steps I can take in general to investigate inquiries like this

From my limited understanding I am aware that Xorg is spawned as a subprocess of SDDM but aside from initiating the Xorg server, I am at a blank in trying to figure out how to walk through the process and  identify triggers for certain process calls or perhaps the use of tools to do so. As such this is a conceptual question on how to approach analyses such as this in general. Would this require specific knowledge of the process at hand and its architecture. Would there be any general approaches I can take on my system to trace systemcalls much like deducing ppids of processes for my own interest.

As of now I have vague familiarity using tools like strace, bpftrace and general command line tools like ps & lsof. Apologies if this is a broad question, if so I will be happy to narrow it further

N S (1 rep)

Dec 28, 2024, 02:32 PM • Last activity: Dec 28, 2024, 05:28 PM

0 votes

0 answers

29 views

BPF program attached to `getname` won't get called when calling the `renameat2` syscall

linux-kernel system-calls ebpf

I'm fiddling with a BPF program that needs to attach to the two "getname" functions that are being called from the `renameat2` syscall, defined in [linux/fs/namei.c][1] as: ```c SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname, int, newdfd, const char __user *, newname, unsigned...

I'm fiddling with a BPF program that needs to attach to the two "getname" functions that are being called from the renameat2 syscall, defined in linux/fs/namei.c as:

SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
		int, newdfd, const char __user *, newname, unsigned int, flags)
{
	return do_renameat2(olddfd, getname(oldname), newdfd, getname(newname),
				flags);
}

getname calls getname_flags, which in turn calls strncpy_from_user. I need to access the char __user * name parameter, thus I tried creating kprobes, fentries and fexits (with a simple "print" program) to try and intercept all three of those functions. With getname*, I get a lot of output meaning that my BPF program are actually being runned. Although, when calling "renameat2" (e.g. when using the linux mv command), I get no output at all. This is, in essence, the program I'm currently using, which doesn't get called when using the mv command:

SEC("fentry/getname_flags")
int BPF_PROG(hijack_getname, char *filename) {
  uid_t uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
  if (uid == 1002) { //hardcoded uid
    bpf_printk(" [%s]", filename);
  }
}

If I create a BPF tracepoint program that attaches to the entry and exit of renameat2, I can clearly see that there's no "getname" call between entry and exit. As I said, I also tried with kprobe and fexit. I can't manage to attach to strncpy_from_user without getting some weird errors about "Os: 22 - invalid argument" I really can't figure out what's happening, thus any help would be appreciated :,) (P.S. I also posted this on stackoverflow)

Dennis Orlando (81 rep)

Dec 4, 2024, 05:18 PM

3 votes

1 answers

466 views

Why doesn't Linux support mmap by path?

linux-kernel system-calls mmap

The `mmap` syscall needs a fd as parameter, but when you close that fd, the mmap is still alive in the process's memory address space. Therefore keeping an mmap doesn't need an opened fd, so why dose Linux only support creating an mmap of a file using a fd, but not a file-name-path? Wouldn't it be n...

                                  The mmap syscall needs a fd as parameter, but when you close that fd, the mmap is still alive in the process's memory address space. 

Therefore keeping an mmap doesn't need an opened fd, so why dose Linux  only support creating an mmap of a file using a fd, but not a file-name-path? Wouldn't it be nice if we can have a mmapat syscall just like openat and execveat?

If mmap creates an extra reference to that file, why can't we have a mmapat which atomically creates such an reference at the first time without take an fd of the process then release it later.

Is there any historical or security reason for not having such syscall on Linux kernel?

炸鱼薯条德里克 (1435 rep)

Feb 2, 2019, 01:32 AM • Last activity: Oct 17, 2024, 08:51 AM

7 votes

3 answers

7174 views

Filter out failed syscalls from strace log

system-calls strace

I can run `strace` on a command like `sleep 1` and see what files it's accessing like this: strace -e trace=file -o strace.log sleep 1 However, on my machine, many of the calls have a return value of -1 indicating that the file does not exist. For example: $ grep '= -1 ENOENT' strace.log | head acce...

                                  I can run strace on a command like sleep 1
and see what files it's accessing like this:

    strace -e trace=file -o strace.log sleep 1

However, on my machine, many of the calls have a return value of -1
indicating that the file does not exist. For example:

    $ grep '= -1 ENOENT' strace.log | head
    access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
    access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
    access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_MEASUREMENT", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_ADDRESS", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_NAME", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    open("/usr/lib/locale/en_US.UTF-8/LC_PAPER", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

I'm not really interested in the files that don't exist,
I want to know what files the process actually found and read from.
Aside from grep -v '=-1 ENOENT',
how can I reliably filter out failed calls?

# Addendum #

I was surprised to learn
that strace has had this feature in the works since 2002
in the form of the -z flag, which is an alias for -e status=successful,
fully functional [since version 5.2](https://github.com/strace/strace/commit/e45a594cb08394c96f71105db9bacf08aa4c734d) 
([2019-07-12](https://github.com/strace/strace/releases/tag/v5.2)) ,
also available as --successful-only [since version 5.6](https://github.com/strace/strace/commit/092724f8041cdfb64dcaf68a2d8ba877b509ea83)  ([2020-04-07](https://github.com/strace/strace/releases/tag/v5.6)) .

Also available since version 5.2 is the complement of -z, the -Z flag,
which is an alias for -e status=failed,
available as --failed-only since version 5.6.

The -z flag was [first added in a commit from 2002](https://github.com/strace/strace/commit/17f8fb3484e94976882f65b7a3aaffc6f24cd75d)  and released in version 4.5.18 ([2008-08-28](https://github.com/strace/strace/releases/tag/v4.5.18)) ,
bit it had never been [documented](https://github.com/strace/strace/commit/de6e53308ca58da7d357f8114afc74fff7a18043)  because it was not working properly.

Relevant links:

- only seeing successful system calls

  Sat Nov 2 23:07:23 UTC 2002 

  > When using strace I  sometimes like to see the system calls
which work (instead of all the system calls).
  >
  > I've been porting this patch for years, it seems very useful.
  >
  > With the -z option, you don't see opens on files which aren't there
(very useful tracking down what a program actually does, instead of
trying to do).

  https://lists.strace.io/pipermail/strace-devel/2002-November/000232.html 

- strace: -z option doesn't work properly

  Date: Sun, 12 Jan 2003 09:33:01 UTC

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=176376 

- tracing only failing syscalls

  Created: 2004-03-19 
 
  https://sourceforge.net/p/strace/feature-requests/3/ 

- [strace-4.15] Proposal: Output Staging for -z Option (print successful syscalls only) / Patch included

  Tue Jan 17 09:35:54 UTC 2017 

  https://lists.strace.io/pipermail/strace-devel/2017-January/005941.html 

- [PATCH v1] Implemented output staging for failed/successful syscalls

  Wed Jan 18 16:01:20 UTC 2017 

  https://lists.strace.io/pipermail/strace-devel/2017-January/005950.html 

- Fix -z option

  Feb 28, 2018

  https://github.com/strace/strace/issues/49 

- [PATCH 0/3] Stage output for -z and new -Z options

  Mon Apr 1 21:13:02 UTC 2019 

  https://lists.strace.io/pipermail/strace-devel/2019-April/008706.html 

- strace -z flag

  Mon Jun 10 05:29:19 UTC 2019 

  https://lists.strace.io/pipermail/strace-devel/2019-June/008808.html 
                                

Nathaniel M. Beaver (1398 rep)

Apr 6, 2018, 08:26 PM • Last activity: Sep 13, 2024, 04:18 PM

1 votes

0 answers

24 views

Retrieving the process descriptor during syscall

linux kernel process system-calls stack

In Linux, there is a per-process kernel stack that stores at the bottom of it (or top if the stack grows upwards) a small struct named thread_info, which in turn points to the task_struct of the related process. This way it is easy to retrieve the pointer to the process's descriptor when handling a...

                                  In Linux, there is a per-process kernel stack that stores at the bottom of it (or top if the stack grows upwards) a small struct named thread_info, which in turn points to the task_struct of the related process. This way it is easy to retrieve the pointer to the process's descriptor when handling a syscall in kernel-mode.

But how does the kernel even switch to this per-process stack? In which step during the context switch, does the kernel check/store data about the underlying process?

Can someone please provide a good explaination of the steps involved during such user-space -> syscall -> kernel-space context switch?

A lot of sources online try to explain the workings of context switching, but most of them mainly say general concepts and not detailed explainations of the procedure.

Idan Rosenzweig (11 rep)

Sep 4, 2024, 06:07 PM

1 votes

2 answers

420 views

Is systemd the first process that runs in user mode in linux?

systemd linux-kernel system-calls

I know that switching from user mode to kernel mode occurs continuously via system calls. My question is if systemd is the exact point during the starting of a linux system where the first foundamental transition from kernel mode to user mode occurs. If this is the point where the operating system s...

                                  I know that switching from user mode to kernel mode occurs continuously via system calls. My question is if systemd is the exact point during the starting of a linux system where the first foundamental transition from kernel mode to user mode occurs. If this is the point where the operating system says : "From now on ( starting from systemd and all its direct or indirect children ) every process except me that wants to run in kernel mode must do so via system calls".

Kode1000 (11 rep)

Aug 19, 2024, 07:22 AM • Last activity: Aug 19, 2024, 09:56 AM

378 votes

8 answers

54084 views

How can I find the implementations of Linux kernel system calls?

linux-kernel source system-calls

I am trying to understand how a function, say `mkdir`, works by looking at the kernel source. This is an attempt to understand the kernel internals and navigate between various functions. I know `mkdir` is defined in `sys/stat.h`. I found the prototype: /* Create a new directory named PATH, with per...

                                  I am trying to understand how a function, say mkdir, works by looking at the kernel source. This is an attempt to understand the kernel internals and navigate between various functions. I know mkdir is defined in sys/stat.h. I found the prototype:

    /* Create a new directory named PATH, with permission bits MODE.  */
    extern int mkdir (__const char *__path, __mode_t __mode)
         __THROW __nonnull ((1));

Now I need to see in which C file this function is implemented. From the source directory, I tried

    ack "int mkdir"

which displayed

    security/inode.c
    103:static int mkdir(struct inode *dir, struct dentry *dentry, int mode)
    
    tools/perf/util/util.c
    4:int mkdir_p(char *path, mode_t mode)
    
    tools/perf/util/util.h
    259:int mkdir_p(char *path, mode_t mode);

But none of them matches the definition in sys/stat.h.

*Questions*

 1. Which file has the mkdir implementation?
 2. With a function definition like the above, how can I find out which file has the implementation? Is there any pattern which the kernel follows in defining and implementing methods?

NOTE: I am using kernel 2.6.36-rc1 .
                                

Navaneeth K N (3998 rep)

Aug 19, 2010, 02:26 PM • Last activity: Aug 16, 2024, 01:13 PM

4 votes

1 answers

839 views

How to get the current cgroup ID from C/C++?

system-calls cgroups ebpf

The [eBPF helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) define `bpf_get_current_cgroup_id` for eBPF programs, which does the obvious thing ``` u64 bpf_get_current_cgroup_id(void) Return A 64-bit integer containing the current cgroup id based on the cgroup within which t...

The [eBPF helper functions](https://man7.org/linux/man-pages/man7/bpf-helpers.7.html) define bpf_get_current_cgroup_id for eBPF programs, which does the obvious thing

u64 bpf_get_current_cgroup_id(void)

        Return A 64-bit integer containing the current cgroup id
               based on the cgroup within which the current task
               is running.

However I can't find an equivalent system call (something similar to [getpid](https://man7.org/linux/man-pages/man2/getppid.2.html)) that I can use in a regular old C program Am I just completely missing the relevant function? Or does userspace need to do something different to get the cgroup ID for the current task?

user547386 (41 rep)

Oct 31, 2022, 08:15 PM • Last activity: Aug 16, 2024, 09:24 AM

Showing page 1 of 20 total questions