Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

2 votes
2 answers
101 views
Update mmap mapping to be readonly without overwriting existing data
I'm making a custom ELF loader to learn how the dynamic loader works behind the scenes, and one of the program headers often found in them is `PT_GNU_RELRO`, which tells the loader to make that segment read-only after performing relocations. However, it doesn't look like there's a good way to update...
I'm making a custom ELF loader to learn how the dynamic loader works behind the scenes, and one of the program headers often found in them is PT_GNU_RELRO, which tells the loader to make that segment read-only after performing relocations. However, it doesn't look like there's a good way to update existing memory mappings' protections without replacing the entire thing. MAP_UNINITIALIZED seems to be what I'm looking for, but mmap(2) states that it doesn't work on most systems for security reasons. >**MAP_UNINITIALIZED (since Linux 2.6.33)** > >Don't clear anonymous pages. This flag is intended to improve performance on embedded devices. This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory). Which is reasonable for loosening permissions, but I'm looking to restrict them. Is there a way, as a user process, to update a mmap mapping to be read-only without replacing existing data at that address?
Electro_593 (23 rep)
Jun 21, 2025, 02:49 AM • Last activity: Jun 21, 2025, 08:43 AM
4 votes
1 answers
1130 views
Process memory layout - difference between heap, data and mmap areas
I see in the web many conflicting or unclear descriptions of the memory layout of a Linux process. Usually the [common diagram]( https://stackoverflow.com/q/64038876/8529284) looks like: [![enter image description here][1]][1] And a common [description](https://www.quora.com/Is-the-data-segment-is-p...
I see in the web many conflicting or unclear descriptions of the memory layout of a Linux process. Usually the [common diagram]( https://stackoverflow.com/q/64038876/8529284) looks like: enter image description here And a common [description](https://www.quora.com/Is-the-data-segment-is-part-of-the-heap-or-the-heap-is-part-of-it/answer/Sudarshan-43?ch=15&oid=30002660&share=af08bbcb&srid=2KkSm&target_type=answer) would say that: > The data segment contains only global or static variable which have a > predefined value and can be modified. Heap contains the dynamically > allocated data that is stored in a memory section we refer that as > heap section and this section typically starts where data segments > ends. And [also](https://stackoverflow.com/a/14954147/8529284) : > The heap is, generally speaking, one specific memory region created by > the C runtime, and managed by malloc (which in turn uses the brk > and sbrk system calls to grow and shrink). > > mmap is a way of creating new memory regions, independently of > malloc (and so independently of the heap). munmap is simply its > inverse, it releases these regions. Many of the those explanations seem outdated, and I find many discrepancies. For instance, many articles - as the answer above - claim that the heap is used my malloc, but this is actualy a library call that's using either sbrk or mmap, as the malloc [man page](https://man7.org/linux/man-pages/man3/malloc.3.html) says: > Normally, malloc() allocates memory from the heap, and adjusts > the size of the heap as required, using sbrk(2). When allocating > blocks of memory larger than **MMAP_THRESHOLD** bytes, the glibc > malloc() implementation allocates the memory as a private > anonymous mapping using mmap(2). So if malloc in many cases in implemented by mmap, what's the difference between the heap and and the mmap area? Another thing that seems like a contradiction is that many articles (as the malloc man page itself) claim that brk/sbrk adjust the size of the heap, but their [man page](https://man7.org/linux/man-pages/man2/brk.2.html) says it actually adjust the size of the **data segment**: > brk() and sbrk() change the location of the **program break**, > which > defines the end of the process's data segment (i.e., the program > break is the first location after the end of the uninitialized > data segment). So I'm trying to get a clear, up-to-date overall explanation of the memory layout of processes nowadays with the different segments, that also addresses those questions: 1. What is the difference between the heap and the mmap areas? (From some tests I was attempting, by looking at the addresses I got from mmap and comparing to the range of the heap in /proc/self/maps, it seems that some mmap allocated pages are actually allocated inside the heap segment.) 2. Does the **break** signifies the end of the **data segment**, or the end of the **heap**? Other related questions: * [how brk pointer grow after calling malloc](https://unix.stackexchange.com/q/610939/273579) * [When is the heap used for dynamic memory allocation?](https://unix.stackexchange.com/q/411408/273579)
aviro (6925 rep)
Feb 13, 2024, 01:07 PM • Last activity: Jun 9, 2025, 06:03 AM
7 votes
2 answers
4146 views
mmap: effect of other processes writing to a file previously mapped read-only
I am trying to understand what happens when a file, which has been mapped into memory by the `mmap` system call, is subsequently written to by other processes. I have `mmap`ed memory with `PROT_READ` protection in "process A". If I close the underlying file descriptor in process A, and another proce...
I am trying to understand what happens when a file, which has been mapped into memory by the mmap system call, is subsequently written to by other processes. I have mmaped memory with PROT_READ protection in "process A". If I close the underlying file descriptor in process A, and another process later writes to that file (not using mmap; just a simple redirection of stdout to the file using > in the shell), is the mmaped memory in the address space of process A affected? Given that the pages are read-only, I would expect them not to change. However, process A is being terminated by SIGBUS signals as a result of invalid memory accesses (Non-existent physical address at address 0x[...]) when trying to parse the mapped memory. I am suspecting that this is stemming from writes to the backing file by other processes. Would setting MAP_PRIVATE be sufficient to completely protect this memory from other processes?
user001 (3808 rep)
May 19, 2019, 10:12 PM • Last activity: Apr 22, 2025, 01:04 PM
0 votes
1 answers
100 views
Writable and executable memory regions
I wrote a simple Python script to scan `/proc/{pid}/maps` for regions that are writable and executable on my computer. It came up with a few hits surprisingly, all private anonymous. Wondering why a program would ever need writable executable region these days? What are these being used for? ``` /pr...
I wrote a simple Python script to scan /proc/{pid}/maps for regions that are writable and executable on my computer. It came up with a few hits surprisingly, all private anonymous. Wondering why a program would ever need writable executable region these days? What are these being used for?
/proc/1286/maps
	 ['/usr/lib/xorg/Xorg\x00:0\x00-seat\x00seat0\x00-auth\x00/var/run/lightdm/root/:0\x00-nolisten\x00tcp\x00vt7\x00-novtswitch\x00']
	 7f5860c03000-7f5860c04000 rwxp 00000000 00:00 0
/proc/2659/maps
	 ['xfwm4\x00--display\x00:0.0\x00--sm-client-id\x002c1781f72-47a5-494a-a3e7-32424563\x00']
	 7ffb7d804000-7ffb7d805000 rwxp 00000000 00:00 0
/proc/404436/maps
	 ['xfce4-terminal\x00--geometry=180x56-0-0\x00']
	 7f44aa15a000-7f44aa18a000 rwxp 00000000 00:00 0
/proc/404436/maps
	 ['xfce4-terminal\x00--geometry=180x56-0-0\x00']
	 7f44aa19b000-7f44aa1fb000 rwxp 00000000 00:00 0
/proc/404436/maps
	 ['xfce4-terminal\x00--geometry=180x56-0-0\x00']
	 7f44aaa5c000-7f44aaa7c000 rwxp 00000000 00:00 0
/proc/404436/maps
	 ['xfce4-terminal\x00--geometry=180x56-0-0\x00']
	 7f44aabba000-7f44aabca000 rwxp 00000000 00:00 0
/proc/404436/maps
	 ['xfce4-terminal\x00--geometry=180x56-0-0\x00']
	 7f44ac736000-7f44ac766000 rwxp 00000000 00:00 0
/proc/407109/maps
	 ['/usr/lib/firefox-esr/firefox-esr\x00-contentproc\x00-childID\x001\x00-isForBrowser\x00-prefsLen\x0037585\x00-prefMapSize\x00265304...']
	 10737c04c000-10737c05c000 rwxp 00000000 00:00 0
Script:
#!/usr/bin/env python3
import sys
import os
import re
import glob
from os.path import dirname, join

def main():
    map_files = list(filter(lambda f: re.match(r'^\d+$', f.split('/')), glob.glob('/proc/*/maps')))
    for map_file in map_files:
        with open(map_file, 'r') as map_f:
            for line in map_f.readlines():  # for each mapped region
                [start, end, perms, offset, dev, inode, pathname] = parse_maps_line(line)
                if 'x' in perms and 'w' in perms:
                    print(map_file)
                    with open(join(dirname(map_file), 'cmdline'), 'r') as cmd_f:
                        print('\t', cmd_f.readlines())
                    print('\t', line.strip())



def parse_maps_line(line):
    ''' The format of the file is:
    address           perms offset  dev   inode       pathname
    00400000-00452000 r-xp 00000000 08:02 173521      /usr/bin/dbus-daemon
    '''
    [address, perms, offset, dev, inode, pathname] = re.split(r'\s+', line, 5)
    [start, end] = address.split('-')
    return [int(start, 16), int(end, 16), perms, int(offset, 16), dev, inode, pathname]


if __name__ == "__main__":
    main()
**UPDATE:** ChatGPT gave a pretty good answer: While generally avoided and discouraged, a region may be writable and executable to support: 1. JIT. 2. Self modifying code. 3. Dynamically loaded code. I'm still interested in understanding specifically why all these processes - Xorg, xfwm4, xfce4-terminal and firefox-esr would need executable regions.
spinkus (500 rep)
Jan 16, 2025, 01:37 PM • Last activity: Mar 29, 2025, 10:05 PM
5 votes
1 answers
1291 views
Linux HugeTLB: What is the advantage of the filesystem approach?
Moved Post Notice -------------------- I just moved this question (with slight modifications) from a StackOverflow question (which I have deleted, since cross-posting is strongly discouraged), which has not been answered over there and might be better suited here. There were two comments (but no ans...
Moved Post Notice -------------------- I just moved this question (with slight modifications) from a StackOverflow question (which I have deleted, since cross-posting is strongly discouraged), which has not been answered over there and might be better suited here. There were two comments (but no answers) made at the StackOverflow question. This is a short summary of those (note that you might need to read the actual question to understand this): * The filesystem approach enables you to use libhugetlbfs which can do all sorts of things. * That does not really convince me - if I as an application programmer can allocate huge pages without going via the filesystem, so could libhugetlbfs, right? * Going via the filesystem allows you to set permissions on who can allocate huge pages. * Sure, but it's not required to go via the filesystem. If anyone can do mmap(…, MAP_HUGETLB, …), anyone who is denied access on a filesystem level can still exhaust all huge pages by going the mmap way. Actual Question =============== I am currently exploring the various ways of allocating memory in huge pages under Linux. I somehow can not wrap my head around the concept of the HugeTLB 'filesystem'. Note that I'm not talking about transparent huge pages - those are a whole different beast. ## The Conventional Way The conventional wisdom (as e.g. presented in [the Debian Wiki](https://wiki.debian.org/Hugepages#Enabling_HugeTlbPage) or [the Kernel docs](https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html#using-huge-pages)) seems to be: - Make sure set your kernel configuration correctly - set various kernel parameters right - mount a special filesystem (hugetlbfs) to some arbitrary directory, say /dev/hugepages/ (that seems to be the default on Fedora…) - mmap() a file within that directory into your address space, i.e., something like:
int fd = open("/dev/hugepages/myfile, O_CREAT | O_RDWR, 0755);
void * addr = mmap(0, 10*1024*1024, (PROT_READ | PROT_WRITE), MAP_SHARED, fd, 0);
… and if these two calls succeed, I should have addr pointing to 10 MB of memory allocated in five 2 MB huge pages. Nice. ## The Easy Way However, this seems awfully overcomplicated? At least on Linux 5.15 the whole filesystem thing seems to be completely unnecessary. I just tried this: * kernel configured with HugeTLBfs * kernel parameters set correctly (i.e., vm.nr_hugepages > 0) * no hugetlbfs mounted anywhere And then just do an mmap of anonymous memory:
void *addr = mmap(0, 10*1024*1024, (PROT_READ | PROT_WRITE),
                  (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB), 0, 0);
This gives me 10 MB of memory allocated in huge pages (at least if I don't fail at interpreting the flags in the page table). ## Why the Filesystem? So my question is: Why the filesystem? Is it actually "necessary" to go via the filesystem, as the various guides suggest, and my attempt above was just lucky? Does the filesystem approach have other advantages (aside from having a file which represents parts of your RAM, which seems like a huge footgun…)? Or is this maybe just a remnant from some previous time, when MAP_ANONYMOUS | MAP_HUGETLB was not allowed?
Lukas Barth (231 rep)
Aug 2, 2023, 10:23 AM • Last activity: Dec 4, 2024, 02:13 PM
3 votes
1 answers
466 views
Why doesn't Linux support mmap by path?
The `mmap` syscall needs a fd as parameter, but when you close that fd, the mmap is still alive in the process's memory address space. Therefore keeping an mmap doesn't need an opened fd, so why dose Linux only support creating an mmap of a file using a fd, but not a file-name-path? Wouldn't it be n...
The mmap syscall needs a fd as parameter, but when you close that fd, the mmap is still alive in the process's memory address space. Therefore keeping an mmap doesn't need an opened fd, so why dose Linux only support creating an mmap of a file using a fd, but not a file-name-path? Wouldn't it be nice if we can have a mmapat syscall just like openat and execveat? If mmap creates an extra reference to that file, why can't we have a mmapat which atomically creates such an reference at the first time without take an fd of the process then release it later. Is there any historical or security reason for not having such syscall on Linux kernel?
炸鱼薯条德里克 (1435 rep)
Feb 2, 2019, 01:32 AM • Last activity: Oct 17, 2024, 08:51 AM
1 votes
0 answers
252 views
How can I pre-fault and lock memory pages that are mmap'd with MAP_PRIVATE?
I am writing a real-time linux application where I need to prevent any page faults from occuring after the initial startup of my application. My initial thought was just to call `mlockall(MCL_CURRENT | MCL_FUTURE);`. Calling this function returns no error code, but if I inspect my process's pages, i...
I am writing a real-time linux application where I need to prevent any page faults from occuring after the initial startup of my application. My initial thought was just to call mlockall(MCL_CURRENT | MCL_FUTURE);. Calling this function returns no error code, but if I inspect my process's pages, it looks like there are still many pages that have the Locked: size at 0 (which I assume means those pages can still cause a page-fault).
$ cat /proc//smaps |  grep -B 21 -A 2 "Locked:                0 kB"  | grep -B 1 "^Size:" | grep -v "Size" | grep -v "^\-\-"
7effd0021000-7effd4000000 ---p 00000000 00:00 0 
7effd4021000-7effd8000000 ---p 00000000 00:00 0 
7effd80c6000-7effdc000000 ---p 00000000 00:00 0 
7effddf02000-7effddfa0000 rw-s 00000000 00:05 368                        /dev/some_char_device
7effddfa0000-7effde1a0000 rw-s f0000000 00:05 368                        /dev/some_char_device
7effde1c1000-7effde1c2000 ---p 00000000 00:00 0 
7effde1c6000-7effde1ca000 rw-s f7c00000 00:05 368                        /dev/some_char_device
7effde1ca000-7effde1cb000 ---p 00000000 00:00 0 
7effe221b000-7effe221c000 ---p 00000000 00:00 0 
7effe2220000-7effe2223000 rw-s 00000000 00:05 90                         /dev/another_char_device
7effe22df000-7effe22e0000 ---p 00013000 08:02 2234654                    //shared_library1.so
7effe22fd000-7effe22fe000 ---p 0000c000 08:02 2231701                    //shared_library2.so
7effe23fc000-7effe23fd000 ---p 0001c000 08:02 2234652                    //shared_library3.so
7effe2e15000-7effe2e16000 ---p 00215000 08:02 1957                       /usr/lib/x86_64-linux-gnu/libc.so.6
7effe2e40000-7effe2e41000 ---p 00011000 08:02 2234649                    //shared_library4.so
7effe2f14000-7effe2f15000 ---p 00046000 08:02 2232115                    //shared_library5.so
7effe321a000-7effe321b000 ---p 0021a000 08:02 855                        /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
7effe3258000-7effe3259000 ---p 0001f000 08:02 2234643                    //shared_library6.so
7effe327d000-7effe327e000 ---p 00021000 08:02 2234641                    //shared_library7.so
7effe328a000-7effe328b000 ---p 00009000 08:02 2232116                    //shared_library8.so
7effe348e000-7effe348f000 ---p 00102000 08:02 91759                      //shared_library9.so
7effe34c6000-7effe34c8000 r--p 00000000 08:02 175                        /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7effe34f2000-7effe34fd000 r--p 0002c000 08:02 175                        /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffc1d1b0000-7ffc1d1b4000 r--p 00000000 00:00 0                          [vvar]
7ffc1d1b4000-7ffc1d1b6000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
# Some attempts and some questions... ## Attempt: Move the location of mlockall in the initialization function My main function has a bunch of dlopen calls. Previously the mlockall call was *after* the calls to dlopen. Moving the call to mlockall *before* the dlopen calls seems to lock in the memory of the shared libraries that are loaded after it. However, it does not lock in the memory of shared libraries loaded *before* the call to mlockall (those shared libraries are linked at compile-time and specified in the executable). **Why doesn't MCL_CURRENT lock in already-loaded libraries?**
$ cat /proc//smaps |  grep -B 21 -A 2 "Locked:                0 kB"  | grep -B 1 "^Size:" | grep -v "Size" | grep -v "^\-\-"
7fef0c021000-7fef10000000 ---p 00000000 00:00 0 
7fef10021000-7fef14000000 ---p 00000000 00:00 0 
7fef140c6000-7fef18000000 ---p 00000000 00:00 0 
7fef1875d000-7fef187fb000 rw-s 00000000 00:05 368                        /dev/some_char_device
7fef187fb000-7fef189fb000 rw-s f0000000 00:05 368                        /dev/some_char_device
7fef18a0a000-7fef18a0b000 ---p 00000000 00:00 0 
7fef1ca2e000-7fef1ca2f000 ---p 00000000 00:00 0 
7fef1ca33000-7fef1ca37000 rw-s f7c00000 00:05 368                        /dev/some_char_device
7fef1ca37000-7fef1ca38000 ---p 00000000 00:00 0 
7fef1ca3c000-7fef1ca3f000 rw-s 00000000 00:05 90                         /dev/another_char_device
7fef1d615000-7fef1d616000 ---p 00215000 08:02 1957                       /usr/lib/x86_64-linux-gnu/libc.so.6
7fef1da1a000-7fef1da1b000 ---p 0021a000 08:02 855                        /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
7fef1dcea000-7fef1dceb000 ---p 00102000 08:02 91760                      //shared_library9.so
7fef1dd22000-7fef1dd24000 r--p 00000000 08:02 175                        /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7fef1dd4e000-7fef1dd59000 r--p 0002c000 08:02 175                        /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
7ffece1c4000-7ffece1c8000 r--p 00000000 00:00 0                          [vvar]
7ffece1c8000-7ffece1ca000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
## Attempt: Use madvise to prefault pages I tried calling this prefault function (inspired from here ), but madvise seems to return a -1 or EPERM code on the pages with permissions ---p. If I understand correctly, those pages are mapped with MAP_PRIVATE and are supposed to be allocated in physical memory only when written to (as per the Copy-On-Write pattern). However, madvise does not seem to trigger the allocation, and just returns an error instead. **How can I pre-fault pages mapped with MAP_PRIVATE?**
void prefault()
{
    const pid_t pid = getpid();
    
    FILE *fp;
    char path[PATH_MAX];
    char buf;

    (void)snprintf(path, sizeof(path), "/proc/%" PRIdMAX "/maps", (intmax_t)pid);

    fp = fopen(path, "r");

    volatile uint8_t val;

    while (fgets(buf, sizeof(buf), fp)) {
        void *start, *end, *offset;
        int major, minor, n, ret;
        uint64_t inode;
        char prot;

        n = sscanf(buf, "%p-%p %4s %p %x:%x %" PRIu64 " %s\n",
            &start, &end, prot, &offset, &major, &minor,
            &inode, path);

   
        if (n = end) { continue; /* invalid addresse range */ }

        ret = madvise(start, (size_t)((uint8_t *)end - (uint8_t *)start), MADV_POPULATE_WRITE);
    }

    (void)fclose(fp);
}
Jay S. (61 rep)
Oct 2, 2024, 06:19 PM • Last activity: Oct 4, 2024, 08:40 PM
0 votes
1 answers
38 views
Is mmap holding a reference to the OFD specified by POSIX, or a Linux extension, and where is it documented?
I am using Open File Description (OFD) owned locks on Linux (`fcntl` with command `F_OFD_SETLK`). After locking a file, I memory mapped it, and closed the file descriptor. Another process tried to lock the same file, and was unable to do so until the first process unmapped the memory. It seems Linux...
I am using Open File Description (OFD) owned locks on Linux (fcntl with command F_OFD_SETLK). After locking a file, I memory mapped it, and closed the file descriptor. Another process tried to lock the same file, and was unable to do so until the first process unmapped the memory. It seems Linux, at least, keeps a reference to the open file description when a mapping is still active. POSIX.1-2024 documents that [mmap](https://pubs.opengroup.org/onlinepubs/9799919799/functions/mmap.html) adds a reference to the "file associated with the file descriptor". > The mmap() function shall add an extra reference to the file associated with the file descriptor fildes which is not removed by a subsequent close() on that file descriptor. This reference shall be removed when there are no more mappings to the file. A literal interpretation here would mean that the reference is to the file itself, but I don't know if that was the intent when the documentation was written. I would like to be able to rely on this behavior. Is there somewhere in POSIX where it's specified that I am missing? Could this be a defect report? If it's Linux exclusive, is there a reference anywhere that this was their intended behavior (and, possibly, their interpretation of the POSIX standard)? Test program (might require different feature test macros on other platforms):
-c
#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 

int main()
{
	char filename[] = "/tmp/ofd-test.XXXXXX";
	int fd = mkstemp(filename);
	if (fd < 0) {
		perror("mkstemp");
		return 1;
	}

	fprintf(stderr, "created file '%s'\n", filename);

	struct flock lock = {
		.l_len = 0,
		.l_pid = 0,
		.l_whence = SEEK_SET,
		.l_start = 0,
		.l_type = F_WRLCK,
	};
	if (fcntl(fd, F_OFD_SETLK, &lock) < 0) {
		perror("first lock");
		return 1;
	}

	void *ptr = mmap(0, 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	if (ptr == MAP_FAILED) {
		perror("mmap");
		return 1;
	}

	close(fd);

	int newfd = open(filename, O_RDWR);
	if (newfd < 0) {
		perror("re-open");
		return 1;
	}

	lock.l_pid = 0;
	if (fcntl(newfd, F_OFD_SETLK, &lock) == 0) {
		fputs("locking after mmap worked\n", stderr);
		return 1;
	}
	perror("locking after mmap");

	munmap(ptr, 1024);

	lock.l_pid = 0;
	if (fcntl(newfd, F_OFD_SETLK, &lock) < 0) {
		perror("locking after munmap");
		return 1;
	}
	fputs("locking after munmap worked\n", stderr);

	if (unlink(filename) < 0) {
		perror("unlink");
		return 1;
	}

	return 0;
}
For me, this outputs:
created file '/tmp/ofd-test.Pyf3oj'
locking after mmap: Resource temporarily unavailable
locking after munmap worked
&#201;rico Rolim (35 rep)
Sep 24, 2024, 03:10 PM • Last activity: Sep 24, 2024, 04:08 PM
6 votes
1 answers
1424 views
Does mmap() update the page table after every page fault?
Based on my research on mmap(), I understand that mmap uses demand paging to copy in data to the kernel page cache only when the virtual memory address is touched, through page fault. If we are reading files that are bigger than the page cache, then some stale page in the page cache will have to be...
Based on my research on mmap(), I understand that mmap uses demand paging to copy in data to the kernel page cache only when the virtual memory address is touched, through page fault. If we are reading files that are bigger than the page cache, then some stale page in the page cache will have to be swapped out reclaimed. So my question is, will the page table be updated to map the corresponding virtual memory address to the address of the old stale page in the cache (now containing new data)? How does this happen? Is this part of the mmap() system call?
prajasek (63 rep)
Apr 3, 2024, 01:38 AM • Last activity: Apr 5, 2024, 10:50 AM
28 votes
6 answers
15978 views
what is the purpose of memory overcommitment on Linux?
I know about [memory overcommitment][1] and I profoundly dislike it and usually disable it. I am *not* thinking of [setuid][2]-based system processes (like those running [`sudo`][3] or [postfix][4]) but of an ordinary Linux process started on some command line by some user not having admin privilege...
I know about memory overcommitment and I profoundly dislike it and usually disable it. I am *not* thinking of setuid -based system processes (like those running sudo or postfix ) but of an ordinary Linux process started on some command line by some user not having admin privileges. A well written program could malloc (or mmap which is often used by malloc) more memory than available and crash when using it. Without memory overcommitment, that malloc or mmap would fail and the well written program would catch that failure. The poorly written program (using malloc without checks against failure) would crash when using the result of a failed malloc. Of course virtual address space (which gets extended by mmap so by malloc) is not the same as RAM (RAM is a resource managed by the kernel, see this ; processes have their virtual address space initialized by execve(2) and extended by mmap & sbrk so don't consume *directly* RAM, only virtual memory ). Notice that optimizing RAM usage could be done with madvise(2) (which could give a hint, using MADV_DONTNEED to the kernel to swap some pages onto the disk), when really needed. Programs wanting some overcommitment could use mmap(2) with MAP_NORESERVE. My understanding of memory overcommitment is as if every memory mapping (by execve or mmap) is using implicitly MAP_NORESERVE My perception of it is that it is simply useful for very buggy programs. But IMHO a real developer should *always* check failure of malloc, mmap and related virtual address space changing functions (e.g. like here ). And most free software programs whose source code I have studied have such check, perhaps as some xmalloc function.... Are there real life programs, e.g. packaged in a typical Linux distributions, which actually need and are using memory overcommitment in a sane and useful way? I know none of them! What are the disadvantages of disabling memory overcommitment? Many older Unixes (e.g. SunOS4, SunOS5 from the previous century) did not have it, and IMHO their malloc (and perhaps even the general full-system performance, malloc-wise) was not much worse (and improvements since then are unrelated to memory overcommitment). I believe that memory overcommitment is a misfeature for lazy programmers. The user of that program could setup some resource limit for setrlimit(2) called with RLIMIT_AS by the parent process (e.g. ulimit builtin of /bin/bash; or limit builtin of zsh , or any modern equivalent for e.g. at, crontab, batch, ...), or a grand-parent process (up to eventually /sbin/init of pid 1 or its modern systemd variant).
Basile Starynkevitch (10709 rep)
May 2, 2018, 04:56 PM • Last activity: Mar 25, 2024, 02:54 PM
2 votes
0 answers
231 views
Which consistency guarantees do POSIX shared memory objects have?
On Linux, POSIX shared memory objects [1] use a `tmpfs` via `/dev/shm`. A `tmpfs` in turn is said to "live completely in the page cache" [2] (I'm assuming swap has not been enabled). I am wondering what the consistency / no-tearing guarantees are when using a POSIX SHM object, `mmap`ed into a progra...
On Linux, POSIX shared memory objects use a tmpfs via /dev/shm. A tmpfs in turn is said to "live completely in the page cache" (I'm assuming swap has not been enabled). I am wondering what the consistency / no-tearing guarantees are when using a POSIX SHM object, mmaped into a programs address space. Example: Assume a POSIX SHM object shared between two processes A and B, both mmaped into their respective address space. The size of that object is 8kB or two pages, assuming 4kB pages and the object being page-aligned. 1. A issues two sequential writes, the first writes into the first page (first 4k block), the second into the second page. 2. B polls the shared object / both pages. Is it possible that the reads of B are torn, meaning that B reads a fresh and updated second page but a stale first page? \ https://www.man7.org/linux/man-pages/man7/shm_overview.7.html \ https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html \ This would be the associated pseudo-code in C:
int fd = shm_open(...);
void *share = mmap(0, 8192, $flags, fd, 0);
memcpy(share       , data1, 4096);
memcpy(share + 4096, data2, 4096);
Philipp Friese (21 rep)
Mar 19, 2024, 08:54 AM
3 votes
1 answers
112 views
How is 4kB RSS possible in Linux 4.x?
I have been the dev/maintainer of an open source IRC bot since the late 90s. The goal was always to make it as versatile & useful as possible in a small memory footprint. During the 2000s I also wrote some proof of concept code to squeeze useful programs down to just 4kB RSS, which wasn't too hard t...
I have been the dev/maintainer of an open source IRC bot since the late 90s. The goal was always to make it as versatile & useful as possible in a small memory footprint. During the 2000s I also wrote some proof of concept code to squeeze useful programs down to just 4kB RSS, which wasn't too hard to do on the 2.4 kernel. I made it happen with both init & agetty; that is, I made them run resident performing their duties inside a single 4kB page of memory. Now, color me surprised when one day I ask my bot to report on its memory usage and it responds with this: [Mar 27 2018] VM 1000 kB (Max 2988 kB), RSS 4 kB [ Code 212 kB, Data 68 kB, Libs 556 kB, Stack 132 kB ] To get 4kB RSS on kernel 2.4 I had to map all code, rodata and stack segments to the same page. Since I'm not doing that with the bot, even the theoretical limit should be 12kB. But with later kernels, there seems to be some extra accelerator mappings so that even unmapping stack and rodata still leaves 12kB mapped. The bot has been linked with libmusl so the "sane" standard RSS as its running was 54kB. I did create an ld script to reorder functions into blocks of rarely used to core essential, but still, 4kB isn't reasonable even in theory. The system is a Xeon with plenty of physical memory, no swap and no system load so there was no pressure to swap pages out. Any idea what happened here? I'm still interested in the possibility to remap everything to a single 4kB page, although to date I have only gotten it down to 12kB reproducible and 8kB unreproducible. The bot read the RSS from /proc and just reports what it reads unaltered. ps aux displayed the same VSZ & RSS as the bot reported.
Alonda (131 rep)
Feb 24, 2024, 07:11 PM • Last activity: Feb 26, 2024, 01:08 AM
0 votes
0 answers
264 views
How to read watchdog registers on x86 Linux?
I want to read the Intel iTCO watchdog registers on my Intel Lynx Point system. I found the watchdog here: [ 5598.341020] iTCO_wdt iTCO_wdt.1.auto: Found a Lynx Point TCO device (Version=2, TCOBASE=0x1860) It is connected to the ISA bridge LPC controller: 00:1f.0 ISA bridge: Intel Corporation H87 Ex...
I want to read the Intel iTCO watchdog registers on my Intel Lynx Point system. I found the watchdog here: [ 5598.341020] iTCO_wdt iTCO_wdt.1.auto: Found a Lynx Point TCO device (Version=2, TCOBASE=0x1860) It is connected to the ISA bridge LPC controller: 00:1f.0 ISA bridge: Intel Corporation H87 Express LPC Controller (rev 05) Subsystem: ASUSTeK Computer Inc. H87 Express LPC Controller Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- Kernel driver in use: lpc_ich Kernel modules: lpc_ich I found the ioports: cat /proc/ioports | grep -i tco 1830-1833 : iTCO_wdt.1.auto 1830-1833 : iTCO_wdt 1860-187f : iTCO_wdt.1.auto 1860-187f : iTCO_wdt And I found the iomem: cat /proc/iomem | grep -i wdt fed1f410-fed1f414 : iTCO_wdt.0.auto I tried to dump the memory via: memtool md 0xfed1f410 Then I set the timeout to a different value and compare the registers again: wdctl -s 10 memtool md 0xfed1f410 Nothing changes, why is that? What is wrong with my approach?
defoe (153 rep)
Nov 30, 2023, 08:52 AM
0 votes
1 answers
118 views
Can mmap be used to create a file which references memory subset of another file?
I'm interested in writing a program that can create two files, second file would be a "view" of first file and if modified, the first file would also be modified. Is this possible to do with mmap or at all? I know that using mmap i can have shared memory in RAM, but I need shared memory in non-volat...
I'm interested in writing a program that can create two files, second file would be a "view" of first file and if modified, the first file would also be modified. Is this possible to do with mmap or at all? I know that using mmap i can have shared memory in RAM, but I need shared memory in non-volatile memory aka hard drive. I cannot copy the first file or load it fully into RAM since I assume the file can be of very large size (GBs). After I find how to have the second file showing memory subset of first file I plan to make 3 files, first roleplaing as container and second and third showing different subsets of the first file. Second and third file are to be formatted with filesystem so that first file container holds in it's memory two filesystems accessible via second and third file. This I plan to accomplish by attaching the second and third files as loopback devices and mounting them. Is this doable of am I not seeing something?
trickingLethargy (3 rep)
Nov 21, 2023, 01:06 PM • Last activity: Nov 21, 2023, 01:23 PM
0 votes
0 answers
78 views
Searching a 1T block device for a specific byte sequence at a specified offset
I'm performing data recovery after an accident with dd. In the longer term, I'll need to use some recovery tools to try and repair the file system In the meantime, there's an image on the system that I need, which if I can find, I'll be able to use to image a device. The byte sequence is `"\x21\x35\...
I'm performing data recovery after an accident with dd. In the longer term, I'll need to use some recovery tools to try and repair the file system In the meantime, there's an image on the system that I need, which if I can find, I'll be able to use to image a device. The byte sequence is "\x21\x35\x2c\x66\xe4\xe8\x48\xe0\xf9\x4a\x92\x\x7f\x3f\xb7\x6e". I've tried using mmap in Python, but as far as I am aware, mmap.find() doesn't permit block devices from being opened, as I seem to get an error every time I try. I've tried using other tools, such as dd in combination with grep, but it searches at the start of the disk when it'll likely be towards the end. The scanning is taking an incredibly long time. So tl;dr what is the best method to search 1TB of data with the following requirements: * bytestring * at a specified offset * can search the file without trying to open it in one go (like Python's with open) * can read an unmounted block device
user587941 (1 rep)
Oct 6, 2023, 08:24 PM
0 votes
1 answers
200 views
How to measure mmap I/O latency?
I have an application which appears to be slowing/blocking at the same time there's a lot of disk I/O going on, so I suspect it's I/O operations within the application which are blocking. I can't imagine what else the problem might be, but I would like to confirm it. The problem is that the applicat...
I have an application which appears to be slowing/blocking at the same time there's a lot of disk I/O going on, so I suspect it's I/O operations within the application which are blocking. I can't imagine what else the problem might be, but I would like to confirm it. The problem is that the application largely uses mmap'd files for I/O, and thus they don't show up with strace. I know blocking I/O operations from mmap'd memory is going to be a page fault. But is there a way to measure the amount of time thread execution was suspended due to page faults?
phemmer (73711 rep)
Oct 5, 2023, 05:00 PM • Last activity: Oct 5, 2023, 06:00 PM
0 votes
1 answers
316 views
Mapping Segment of Guest RAM to host file, in PPC QEMU
My desire is conceptually simple, I have a file (really a PCIe resource file from /sys/bus/pci/device/.... but that isn't too relevant) on the host that I want to make available somewhere in guest memory, so that changes from either side get reflected to each other. Since my goal was to actually map...
My desire is conceptually simple, I have a file (really a PCIe resource file from /sys/bus/pci/device/.... but that isn't too relevant) on the host that I want to make available somewhere in guest memory, so that changes from either side get reflected to each other. Since my goal was to actually map a limited segment of PCIe address space in the host, I couldn't productively map the entire guest RAM. The base command that I am trying to add is listed below. The goal is to get memory id "bar0.ram" mapped *somewhere* in guest memory. qemu-system-ppc -M ppce500 -cpu e500 -m 64M -d guest_errors,unimp -bios $PWD/test.elf -s -object memory-backend-file,size=1m,id=bar0.ram,mem-path=/sys/bus/pci/devices/0000\:04\:00.0/resource0,share=on -monitor telnet:127.0.0.1:4999,server,nowait -nographic Perhaps this would be easier on ARM or x86, but PPC doesn't offer persistent memory, nvram, multiple memory slots backed by different files, or similar tricks (that I could figure out how to get working). It does offer ivshmem, but I was unable to figure out how to get that to be transparently mapped into the guest address space. Vaguely useful/related resources: - https://unix.stackexchange.com/questions/616596/mapping-guest-ram-to-file-in-qemu - https://superuser.com/questions/1795238/qemu-system-sparc-shared-memory-between-host-and-guest - https://blog.reds.ch/?p=1379
Seth Robertson (411 rep)
Aug 2, 2023, 02:41 AM
2 votes
1 answers
3448 views
Mapping guest RAM to file in qemu
We're emulating a Cortex M3 cpu and would like to pass some parameters to the guest during run-time. The simplest idea seems to be to write directly to some memory area. I tried simply adding `-mem-path /tmp/qemu.ram` which did nothing. Adding ``` -object memory-backend-file,id=mem,size=128K,mem-pat...
We're emulating a Cortex M3 cpu and would like to pass some parameters to the guest during run-time. The simplest idea seems to be to write directly to some memory area. I tried simply adding -mem-path /tmp/qemu.ram which did nothing. Adding
-object memory-backend-file,id=mem,size=128K,mem-path /tmp/qemu.ram \
worked in that qemu opened it at least. But nothing is written to it during run-time and there seems to be no connection between the guest memory map and the file at all. To clarify, what I expected to happen is that QEMU, instead of mallocing guest RAM, mmaps the file and uses that instead. This would enable me to seek, read and write from this file during run-time. What am I missing? Is there any other convenient way to get write access to RAM/MMIO of the guest during run-time?
Benjamin Lindqvist (361 rep)
Oct 27, 2020, 09:09 AM • Last activity: Aug 2, 2023, 01:38 AM
1 votes
1 answers
275 views
Partial fsyncs when writing to block device
I'm writing my own data store directly on top of a block device. To ensure durability I want to sync to disk. But here's the thing: I want to sync only part of it. I'm keeping a journal for crash recovery, and write my future changes to the journal before applying them to the actual place on disk. T...
I'm writing my own data store directly on top of a block device. To ensure durability I want to sync to disk. But here's the thing: I want to sync only part of it. I'm keeping a journal for crash recovery, and write my future changes to the journal before applying them to the actual place on disk. Then I want to ensure the journal changes are written to disk, and only then make the actual changes to the rest of the disk (which I don't care about fsyncing, until I checkpoint my journal). I could simply fsync the entire block device, but that forces a lot of things that aren't urgent to be written out. I have thought of two options, but I'm surprised there is no partial fsync(2) call and nobody asking for it from what I've found. 1. mmap(2) the full block device and use msync(2) to sync part of it. 2. open(2) the block device twice, once with O_SYNC and use one for lazy writes and one for my journal writes.
Jille Timmermans (53 rep)
Jun 20, 2023, 06:42 PM • Last activity: Jun 20, 2023, 07:03 PM
2 votes
1 answers
4381 views
Is it possible for two processes to use the same shared-memory without resorting to a file to obtain it, be it a memory-mapped file or /dev/shm file?
I'm curious because today the only way I know how to give two different processes the same shared-memory is through a memory-mapped file, in other words, both processes open the same memory-mapped file and write/read to/from it. That has penalties / drawbacks as the operating system needs to swap be...
I'm curious because today the only way I know how to give two different processes the same shared-memory is through a memory-mapped file, in other words, both processes open the same memory-mapped file and write/read to/from it. That has penalties / drawbacks as the operating system needs to swap between disk and memory. Apologies in advance if that is a silly question, but is there such a thing as a pure shared memory between processes, not backed by a file. If yes, how would the processes get a hold of it if not using a memory-mapped file or /dev/shm file?
ThreadFrank (25 rep)
Jun 2, 2023, 03:45 PM • Last activity: Jun 2, 2023, 03:58 PM
Showing page 1 of 20 total questions