Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

1 answers

210 views

Is sbatch-inside-sbatch a bad idea?

On a slurm cluster, is there ever a time when it’s **appropriate** to use `sbatch` *inside* an `sbatch` script? Or is it always a bad pattern? I’ve seen this in use, and it looks iffy: ``` #SBATCH -J RecursiveSbatchInvocation things=(apples bananas carrots) for thing in "${things[@]}"; do sbatch --w...

On a slurm cluster, is there ever a time when it’s **appropriate** to use sbatch *inside* an sbatch script? Or is it always a bad pattern? I’ve seen this in use, and it looks iffy:

#SBATCH -J RecursiveSbatchInvocation
things=(apples bananas carrots)
for thing in "${things[@]}"; do
  sbatch --wrap ./myscript.sh "$thing"
done

Compared to the two alternatives below, - Is there a danger to running sbatch inside sbatch? Will it potentially hog a lot of resources uselessly, or run very slowly for me, or be impolite to other users on the cluster, or have unpredictable/bad side effects? - Is there a realistic scenario where something *can only* be done with sbatch inside sbatch? **Alternative 1:** job array:

#SBATCH -J JobArrayInvocation
#SBATCH -a 0-2
things=(apples bananas carrots)
./myscript.sh "${things["$SLURM_ARRAY_TASK_ID"]}"

**Alternative 2:** srun inside sbatch with multiple tasks:

#SBATCH -J SrunInvocations
#SBATCH --ntasks=3
things=(apples bananas carrots)
for thing in "${things[@]}"; do
  srun --ntasks=1 ./myscript.sh "$thing" &
done

Somebody asked a similar question over on StackOverflow: “[**Can** I call sbatch recursively?](https://stackoverflow.com/questions/72747837/can-i-call-sbatch-recursively)” The two points in the discussion were (1) the cluster probably won’t allow it and (2) you can do the same thing with srun inside sbatch. But in my case the cluster _is_ allowing it. So I know it’s possible and I know “more canonical” ways to do this kind of thing, but I wonder if there’s a specific thing I can show people to say “please never use sbatch recursively.”

wobtax (1135 rep)

Mar 19, 2025, 04:18 PM • Last activity: Mar 20, 2025, 03:00 PM

0 votes

1 answers

62 views

how to configure or tune infiniband redhat 8

performance high-performance infiniband

I asked a similar question here : https://unix.stackexchange.com/questions/788297/nfs-v4-2-tuning with less than 50 servers on a closed infiniband [mellanox] HDR network switch all running RHEL-8.10: 1. is there anything other than `systemctl start opensm` that is needed on any one server to make th...

                                  I asked a similar question here : https://unix.stackexchange.com/questions/788297/nfs-v4-2-tuning 

with less than 50 servers on a closed infiniband [mellanox] HDR network switch all running RHEL-8.10:

1. is there anything other than systemctl start opensm that is needed on any one server to make this network run **optimally** in terms of speed?
2. Can someone either reply with specific concise instructions here or link to a website that provides testing one can do to validate a properly configured infiniband network?

*I read stuff (nvidia forums, reddit, etc) about how people complain their infiniband is no better than their 10 GbE network.  How can that be?*

*personally, if I do a scp of a single tar file between two servers on infiniband I see no better performance on infiniband versus over 10Gbe; scp is nice because it displays the transfer speed.  Referring back to my original NFS tuning question, is there any other protocol or network mechanism besides NFS with regard to an infiniband network?  How much could NFS [v4.2] be the factor in bad performance?*

use case is cfd and other commercial software written to run on clusters over MPI, for which intel oneapi is installed, no errors or warnings, same ~12 hour job over infiniband took 2 hours longer. I use scp as a simple way to get numbers across. Things get worse over NFS. When I need to manage data in the 20+ TB range I specifically tar things up and do a scp of it because it's faster than doing cp over NFS.

ron (8647 rep)

Mar 10, 2025, 01:44 PM • Last activity: Mar 10, 2025, 04:44 PM

0 votes

0 answers

22 views

How to set new features to N during kernel compilation from an old .config file?

linux-kernel kernel compiling cluster high-performance

I am compiling a custom linux kernel for a compute cluster. The cluster is currently running on kernel version 4.4.47 since last 5 years. I need to upgrade the kernel to a more recent version. I've chosen the version 6.6.76 since it has long term support. Now here is what I've tried: I have the old...

                                  I am compiling a custom linux kernel for a compute cluster. The cluster is currently running on kernel version 4.4.47 since last 5 years. I need to upgrade the kernel to a more recent version. I've chosen the version 6.6.76 since it has long term support. 

Now here is what I've tried: I have the old configuration file. So I just copy it and put it inside the kernel source tree's root as .config and then run make olddefconfig. This takes all the existing configuration and sets the default values for the newer options. However, this adds up several unwanted features that I find hard to disable manually. 

Is there a better way to do this such that the configuration takes all the old settings and sets newer settings to n?

Sâu (101 rep)

Feb 10, 2025, 12:37 PM

0 votes

1 answers

399 views

NFS v4.2 tuning

nfs performance nfsv4 high-performance tuning

https://www.youtube.com/watch?v=JXASmxGrHvY at 5:30 the statement is made > if you get NFS tuned just right it is incredibly fast for **ultra small file transfers***... at 6:05 > I've heard of 4.0GB/sec using a sequential read... but you have to have all the infrastructure tuned just right. What, wh...

                                  https://www.youtube.com/watch?v=JXASmxGrHvY 

 at 5:30 the statement is made 

> if you get NFS tuned just right it is incredibly fast for **ultra small file transfers***... 

at 6:05

> I've heard of 4.0GB/sec using a sequential read... but you have to have all the infrastructure tuned just right.

What, where, and how do I tune NFS v4.2 in **RHEL-8.10 or later** to achieve what is claimed above?   Is this claim true, which this youtube vid seems to be made 3 years ago?

Is there any good *NFS tuning* documentation as it pertains to NFS v4.2 in RHEL 8/9 or equivalent today?

*v4.2 is the latest of NFS correct?  Is there any proposed newer version of NFS on the horizon?*

**are there any better settings than the default in /etc/nfs.conf and /etc/nfsmount.conf ?**

**if I can place a bounty on this I will -->  what is the max transfer speed in GB/sec that should be had, in RHEL-8.10 or later, over 100gbps infiniband, on NFS v4.2 (assuming RDMA?) and all the "tuned" options ??** The only *tuning* I am aware of is putting rdma into effect over infiniband; if someone else knows better/more let me know.

ron (8647 rep)

Dec 17, 2024, 05:25 PM • Last activity: Dec 17, 2024, 08:38 PM

4 votes

2 answers

1868 views

SoftIRQs and Fast Packet Processing on Linux network

linux networking network-interface high-performance

I have been reading about performance tuning of Linux to get the fastest packet processing times when receiving financial market data. I see that when the NIC receives a packet, it puts it in memory via DMA, then raises a HardIRQ - which in turn sets some NAPI settings and raises a SoftIRQ. The Soft...

                                  I have been reading about performance tuning of Linux to get the fastest packet processing times when receiving financial market data. I see that when the NIC receives a packet, it puts it in memory via DMA, then raises a HardIRQ - which in turn sets some NAPI settings and raises a SoftIRQ. The SoftIRQ then uses NAPI/device drivers to read data from the RX Buffers via polling, but this is only run for some limited time (net.core.netdev_budget, defaulted to 300 packets). 
These are in reference to a real server running ubuntu, with a solarflare NIC
My questions are below:

1. If each HardIRQ raises a SoftIRQ, and the Device Driver reads multiple packets in 1 go (netdev_budget), what happens to the SoftIRQs raised by each of the packets that were drained from the RX buffer in 1 go (Each pack received will raise a hard and then soft irq)? Are these queued? 

2. Why does the NAPI use polling to drain the RX_buffer? The system has just generated a SoftIRQ and is reading the RX buffer, then why the polling?

3. Presumably, draining of the RX_Buffer via the softirq, will only happen from 1 specific RX_Buffer and not across multiple RX_Buffers? If so, then increasing the netdev_budget can delay the processing/draining of other RX_buffers? Or can this be mitigated by assigning different RX_buffers to different cores?

4. There are settings to ensure that HardIRQs are immediately raised and handled. However, SoftIRQs may be processed at a later time. Are there settings/configs to ensure that SoftIRQs related to network RX are also handled at top priority and without delays?

Nidhi (41 rep)

Jun 22, 2016, 12:27 PM • Last activity: Sep 22, 2024, 05:29 AM

2 votes

0 answers

107 views

Data Recovery from RAID 1 - Impossible to mount disk

mount data-recovery raid high-performance

I have an Nvidia DGXA100 station that I use for my research. It has started to shut down brutally (just a couple minutes after startup), probably a watercooling pump breaking down (that would be the second time in as many years). In any case, I'm on a deadline and I have important experimental data on it, and not enough time to extract all of it a couple 10s/100s of MBs at a time (before it shuts down and I wait ~1h for it to cool down naturally) -- we're talking maybe 100GB of data that I would like to transfer, total. So I have recovered the drives to try to extract the data from my laptop. The disks are supposed to be in RAID 1, I think it's hardware RAID but I'm not 100% sure. The station's OS is a fork of Ubuntu called DGXOS. Fiddling around with the drives, I felt stumped trying to extract the data, and so I reach out to you. No partitions are detected in /dev, only /dev/sda If I try sudo mount /dev/sda /mnt, I get

mount: /mnt: can't read superblock on /dev/sda.
       dmesg(1) may have more information after failed mount system call.

Using fdisk -l of df, the drive is not detected. testdisk does not detect the drive either. lsblk only outputs

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0     0B  0 disk

Looking around this site some people adviced trying sudo mdadm --examine /dev/sda, which yields mdadm: No md superblock detected on /dev/sda. I tried looking through dmesg | grep sda, which yielded

sudo dmesg | grep sda
[160532.911871] sd 0:0:0:0: [sda] 30515200 512-byte logical blocks: (15.6 GB/14.6 GiB)
[160532.912828] sd 0:0:0:0: [sda] Write Protect is off
[160532.912838] sd 0:0:0:0: [sda] Mode Sense: 43 00 00 00
[160532.913729] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[160532.923731]  sda: sda1
[160532.923883] sd 0:0:0:0: [sda] Attached SCSI removable disk
[160533.329382] FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[162600.133163] sda: detected capacity change from 30515200 to 0
[331437.482830] sd 0:0:0:0: [sda] Unit Not Ready
[331437.482842] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331437.482849] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331437.483450] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331437.483454] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331437.483457] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331437.484134] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331437.484138] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331437.484142] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331437.484302] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[331437.484305] sd 0:0:0:0: [sda] 0-byte physical blocks
[331437.484940] sd 0:0:0:0: [sda] Test WP failed, assume Write Enabled
[331437.485154] sd 0:0:0:0: [sda] Asking for cache data failed
[331437.485161] sd 0:0:0:0: [sda] Assuming drive cache: write through
[331437.485659] sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[331437.485662] sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes not a multiple of physical block size (0 bytes)
[331437.486262] sd 0:0:0:0: [sda] Attached SCSI disk
[331519.863944] sd 0:0:0:0: [sda] Unit Not Ready
[331519.863953] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331519.863962] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331519.864568] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331519.864574] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331519.864579] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331519.865310] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331519.865314] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331519.865318] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331519.865495] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[331519.865499] sd 0:0:0:0: [sda] 0-byte physical blocks
[331519.866028] sd 0:0:0:0: [sda] Test WP failed, assume Write Enabled
[331519.866204] sd 0:0:0:0: [sda] Asking for cache data failed
[331519.866206] sd 0:0:0:0: [sda] Assuming drive cache: write through
[331519.866728] sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[331519.866731] sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes not a multiple of physical block size (0 bytes)
[331519.867278] sd 0:0:0:0: [sda] Attached SCSI disk
[331991.085995] sd 0:0:0:0: [sda] Unit Not Ready
[331991.085999] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331991.086003] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331991.086291] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331991.086294] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331991.086296] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331991.086661] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[331991.086664] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[331991.086666] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[331991.086755] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[331991.086757] sd 0:0:0:0: [sda] 0-byte physical blocks
[331991.087041] sd 0:0:0:0: [sda] Test WP failed, assume Write Enabled
[331991.087128] sd 0:0:0:0: [sda] Asking for cache data failed
[331991.087130] sd 0:0:0:0: [sda] Assuming drive cache: write through
[331991.087414] sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[331991.087417] sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes not a multiple of physical block size (0 bytes)
[331991.088014] sd 0:0:0:0: [sda] Attached SCSI disk
                sda: rw=4096, sector=2, nr_sectors = 2 limit=0
[332162.048953] EXT4-fs (sda): unable to read superblock
[332816.960674] sd 0:0:0:0: [sda] Unit Not Ready
[332816.960690] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332816.960700] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332816.961241] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332816.961255] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332816.961264] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332816.961945] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332816.961955] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332816.961960] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332823.835766] sd 0:0:0:0: [sda] Unit Not Ready
[332823.835783] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332823.835792] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332823.836349] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332823.836359] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332823.836365] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332823.836892] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332823.836896] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332823.836899] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332930.787061] sd 0:0:0:0: [sda] Unit Not Ready
[332930.787077] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332930.787086] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332930.788060] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332930.788071] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332930.788077] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332930.788932] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[332930.788937] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[332930.788942] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[332930.789163] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[332930.789166] sd 0:0:0:0: [sda] 0-byte physical blocks
[332930.789758] sd 0:0:0:0: [sda] Test WP failed, assume Write Enabled
[332930.789955] sd 0:0:0:0: [sda] Asking for cache data failed
[332930.789959] sd 0:0:0:0: [sda] Assuming drive cache: write through
[332930.790552] sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[332930.790556] sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes not a multiple of physical block size (0 bytes)
[332930.791424] sd 0:0:0:0: [sda] Attached SCSI disk
                sda: rw=4096, sector=2, nr_sectors = 2 limit=0
[333074.300411] EXT4-fs (sda): unable to read superblock

Most of it didn't seem too helpful, besides maybe Volume was not properly unmounted. Some data may be corrupt. Please run fsck. So I ran sudo fsck /dev/sda, and the answer I got was (roughly translated):

fsck.ext2: invalid argument when trying to open /dev/sda

The superblock couldn't be read, or did not contain a correct ext2/ext3/ext4 filesystem.
If the peripheral is valid and truly contains an ext2/3/4 FS (and not a swapfs, ufs, or other), 
then then superblock is corrupted, and you could try running e2fsck with another block :
    e2fsck -b 8193 
 or
    e2fsck -b 32768

At this point I feel reluctant trying my luck with some e2fsck commands. I hope the Nvidia support has a solution for that, but they haven't answered me yet and probably won't during weekends. I should precise -- when I replace the drives in the server, it boots normally, besides the fact that it shuts down extremely quickly. So I would like to try and avoid possibly FS-breaking solutions, as the data is still "technically" intact. I haven't tried to plug a bootable key to do the data transfers, since I assume the problem would be identical, but maybe I will. I apologize for the lack of some details, the station was bought and used "as is", and Nvidia spins a lot of proprietary things in it that are extremely prone to breaking if you try to tweak them, so I didn't dig extremely deep into the station's exact configuration. Thanks a bunch for your time ! Frost Edit : User @frostschutz suggested that I look a little more closely at the dmesg output, this is what I got when plugging in, if it is any help:

[397566.232752] usb 2-2: new SuperSpeed USB device number 5 using xhci_hcd
[397566.246802] usb 2-2: New USB device found, idVendor=152d, idProduct=0578, bcdDevice= 5.08
[397566.246818] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[397566.246823] usb 2-2: Product: USB
[397566.246827] usb 2-2: Manufacturer: jmicron
[397566.246831] usb 2-2: SerialNumber: 0000000000080
[397566.249988] scsi host0: uas
[397566.250953] scsi 0:0:0:0: Direct-Access     USB      3.0              0508 PQ: 0 ANSI: 6
[397566.254890] sd 0:0:0:0: Attached scsi generic sg0 type 0
[397572.213020] sd 0:0:0:0: [sda] Unit Not Ready
[397572.213036] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[397572.213047] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[397572.214115] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[397572.214127] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[397572.214133] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[397572.215017] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[397572.215027] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[397572.215033] sd 0:0:0:0: [sda] ASC=0x44 >ASCQ=0x81 
[397572.215227] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[397572.215231] sd 0:0:0:0: [sda] 0-byte physical blocks
[397572.215768] sd 0:0:0:0: [sda] Test WP failed, assume Write Enabled
[397572.215931] sd 0:0:0:0: [sda] Asking for cache data failed
[397572.215933] sd 0:0:0:0: [sda] Assuming drive cache: write through
[397572.216484] sd 0:0:0:0: [sda] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[397572.216488] sd 0:0:0:0: [sda] Optimal transfer size 33553920 bytes not a multiple of physical block size (0 bytes)
[397572.217257] sd 0:0:0:0: [sda] Attached SCSI disk

Frost (21 rep)

Sep 14, 2024, 04:44 PM • Last activity: Sep 15, 2024, 08:34 AM

1 votes

0 answers

103 views

slub_min_objects : How comes 0 can stand as a valid / default value / rationale?

linux virtual-memory kernel-parameters high-performance

In the comments of the source code (`mm/slub.c`) of my linux_5.4 kernel, I can read : In order to reach satisfactory performance we must ensure that a minimum number of objects is in one slab. Otherwise we may generate too much activity on the partial lists which requires taking the list_lock. This...

                                  In the comments of the source code (mm/slub.c) of my linux_5.4 kernel, I can read :

    In order to reach satisfactory performance we must ensure that a minimum
    number of objects is in one slab. Otherwise we may generate too much
    activity on the partial lists which requires taking the list_lock. This is
    less a concern for large slabs though which are rarely used.

My understanding of *"a minimum number of objects"* is whatever **> 0**

From some benchmarks achieved under some linux-4, I had even understood that the optimal value had been found to be 2 * Number of CPUs.

In the documentation (Documentation/vm/slub.rst) I can read :

> .. slub_min_objects=x		(default 4)

Having 2 cores I feel happy with that default value. However, my bootlog tells :

    SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1

Of course I can set slub_min_objects=4 as part of my boot command line and successfully get MinObjects=4. However, I would like to understand how comes the default value went down to 0 ?

Have things changed elsewhere in linux kernel justifying that 0 is actually the best possible value ? or that we should in fact no longer care about this as a condition for better performance ?

**UPDATE 1** : Following first comment from @Paul_Pedant

The trace in the bootlog appears to be the result of the following instruction :

pr_info("SLUB: HWalign=%d, Order=%u-%u, MinObjects=%u, CPUs=%u, Nodes=%u\n",cache_line_size(),slub_min_order, slub_max_order, slub_min_objects,nr_cpu_ids, nr_node_ids);

slub_min_order being set following :

    get_option(&str, (int *)&slub_min_objects);

This obviously confirms Paul_Pedant's guess : ***"the bootlog is reporting the user-supplied value"***

The following code might give light to the value that will actually be used representing the minimum number of objects :

    min_objects = slub_min_objects;
    if (!min_objects)
        min_objects = 4 * (fls(nr_cpu_ids) + 1);

That is to say either the boot command line parameter or a default value based on the number of cpus. Which, in my case (2 cpus, position of the most significant set bit = 2) :

min_objects=4*(2+1)= **12**

Default for 4 
Default for 8 <= CPUs < 16 would be 20

And BTW, the documented default value of 4 possibly dating from single processor systems.

This question could be said closed if I could find somewhere some rationale behind the scaling.

MC68020 (8557 rep)

Apr 24, 2021, 08:50 AM • Last activity: Sep 29, 2023, 08:35 AM

18 votes

10 answers

4408 views

Filter file by line number

text-processing filter high-performance

Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L? Example: $ cat L.txt 1 3 $ cat F.txt Hello World Hallo Welt Hola mundo $ command-in-question -x L.txt F.txt Hello World Hola mundo I'm l...

                                  Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L?

Example:

    $ cat L.txt
    1
    3

    $ cat F.txt
    Hello World
    Hallo Welt
    Hola mundo

    $ command-in-question -x L.txt F.txt
    Hello World
    Hola mundo

I'm looking for a command that can handle a file L with 500 million or more entries; file L is sorted numerically.

Note: I'm halfway through an implementation for a command-in-question but I just wondered, whether one might be able to use some Unix tools here as well.

----

Update: Thank for all the answers, I learned a lot today! I would like to accept more than one answer, but that's not possible.

I took the fastest solution from the current answers an put them into a standalone tool: [filterline](https://github.com/miku/filterline) .
                                

miku (683 rep)

Jun 13, 2015, 10:27 AM • Last activity: Apr 17, 2023, 09:04 AM

11 votes

2 answers

6273 views

Set CPU to high performance

cpu cpu-frequency high-performance

I spent hours searching for an answer in Internet. All I could find doesn't help. I have Intel i9-9980HK, running under Ubuntu 20.04, kernel 5.4.0-33. The problem is that under the full load the CPU lowers the frequency to 2.7 GHZ, I guess in order to stay under low power budget. Whatever I try I ca...

                                  I spent hours searching for an answer in Internet. All I could find doesn't help. I have Intel i9-9980HK, running under Ubuntu 20.04, kernel 5.4.0-33.

The problem is that under the full load the CPU lowers the frequency to 2.7 GHZ, I guess in order to stay under low power budget. Whatever I try I can't make it run faster. It stays under 65C, quietly and slowly crunching numbers. For comparison the same machine under Windows runs from 3 to 4+ GHz under the full load.

What I tried:

 - Change the governor to performance. No effect.
 - Set /sys/devices/system/cpu/cpufreq/policyX/energy_performance_preference to performance. No effect.
 - sudo service thermald stop. No effect.
 - Increase /sys/devices/system/cpu/intel_pstate/turbo_pct. Access denied even for root.
 - Increase /sys/devices/system/cpu/cpufreq/policyX/scaling_min_freq. No effect.

I am lost. What does it want? Btw, /sys/devices/system/cpu/intel_pstate/status is active.

**Update**. I think I know the reason. When intel_pstate is active, it ignores all the settings (like governor, everything under /sys/devices/system/cpu/cpufreq). Tools like cpupower cannot control intel_pstate. So the question pretty much boils down to how control intel_pstate driver.
                                

facetus (308 rep)

Jun 19, 2020, 03:38 AM • Last activity: Nov 24, 2022, 04:17 PM

-2 votes

1 answers

501 views

How to display only the number of jobs running in HPC by specific user?

linux bash shell-script slurm high-performance

I want to Display only the number of jobs running in the HPC related to my username! I don't wanna see all the jobs as the command **`squeue`** does! Thanks!

                                  I want to Display only the number of jobs running in the HPC related to my username! 
I don't wanna see all the jobs as the command **squeue** does!

Thanks!
                                

Mahedi (1 rep)

Nov 14, 2022, 02:08 PM • Last activity: Nov 14, 2022, 02:10 PM

18 votes

4 answers

62101 views

Linux: how to know which processes are pinned to which core?

linux cpu process-management high-performance

Is there a way to know which cores currently have a process pinned to them? Even processes run by other users should be listed in the output. Or, is it possible to try pinning a process to a core but fail in case the required core already has a process pinned to it? PS: processes of interest must ha...

                                  Is there a way to know which cores currently have a process pinned
to them?

Even processes run by other users should be listed in the output.

Or, is it possible to try pinning a process to a core but
fail in case the required core already has a process pinned to it?

PS: processes of interest must have bin pinned to the given cores, not just
currently running on the given core

PS: this is not a duplicate, the other question is on how to ensure exclusive use of one CPU by one process. Here we are asking how to detect that a process was pinned to a given core (i.e. cpuset was used, not how to use it).
                                

daruma (488 rep)

Feb 19, 2018, 02:59 AM • Last activity: May 28, 2022, 05:29 AM

0 votes

0 answers

130 views

Is it possible to install Virtual Box or other free virtual machines on a cluster?

virtual-machine high-performance

We have a CentOS HPC cluster and despite asking, the admins do not like to give us root access. I have more experience with finding the packages I need in Ubuntu's apt and conda-forge etc. I am wondering if it is possible to ask them to install a virtual machine for us and not be worried about us br...

                                  We have a CentOS HPC cluster and despite asking, the admins do not like to give us root access. I have more experience with finding the packages I need in Ubuntu's apt and conda-forge etc. I am wondering if it is possible to ask them to install a virtual machine for us and not be worried about us breaking down the OS. Does anyone know if that is possible, and if so, can multiple clusters be utilized within an Ubuntu virtual machine installed on a CentOS? I assume it is reasonable to expect loss of performance for parallel simulations, but even at a moderate loss I can live with it.    
                                

MathX (101 rep)

Apr 29, 2022, 08:46 PM

0 votes

1 answers

571 views

Why is RAMFS much slower than Ram?

performance ram tmpfs high-performance

I have a 64GB DDR4 3200MHz memory installed on my PC. When I run `sysbench`, I get the following results: ``` # sysbench memory --memory-block-size=1M --memory-total-size=10G run sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 1 Initializ...

I have a 64GB DDR4 3200MHz memory installed on my PC. When I run sysbench, I get the following results:

# sysbench memory --memory-block-size=1M --memory-total-size=10G run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1024KiB
  total size: 10240MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 10240 (26329.53 per second)

10240.00 MiB transferred (26329.53 MiB/sec)


General statistics:
    total time:                          0.3876s
    total number of events:              10240

Latency (ms):
         min:                                    0.04
         avg:                                    0.04
         max:                                    0.08
         95th percentile:                        0.04
         sum:                                  386.04

Threads fairness:
    events (avg/stddev):           10240.0000/0.00
    execution time (avg/stddev):   0.3860/0.00

It indicates it works at up to 26 GB/s. So far so good. But when I mount ramfs and try a similar test, this number drops considerably:

# mount -t ramfs -o size=11G ramfs /mnt/ramfs/
# dd if=/dev/zero of=/mnt/ramfs/zero.img bs=1G count=10  conv=fdatasync
10+0 records in
10+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 3.51899 s, 3.1 GB/s

Indicating I'm only getting around 3GB/s writing speed. I understand the file system has some overhead but a drop from 26GB/s to 3GB/s would be a really big overhead. UPDATE 1 - Another test:

# time head -c 11G /dev/zero > /mnt/ramfs/zero.img

real    0m3.046s
user    0m0.225s
sys     0m2.808s

Am I doing something wrong? Is there a way to increase performance on RAMFS? Why is RAMFS so much slower than RAM itself?

Diogo Melo (153 rep)

Jul 23, 2021, 09:29 PM • Last activity: Jul 23, 2021, 11:45 PM

0 votes

2 answers

91 views

Is It Possible To Track GPU Performance Increase?

centos performance video gpu high-performance

CentOS7 I'm about to upgrade my gpu. Before I take action I am curious if there are any tests I can run on the cli that will track the performance of my current gpu so I can compare to the new gpu? For example, with increased performance of hard drives I use `hdparm` curious to see if there is somet...

                                  CentOS7

I'm about to upgrade my gpu.  Before I take action I am curious if there are any tests I can run on the cli that will track the performance of my current gpu so I can compare to the new gpu? 

For example, with increased performance of hard drives I use hdparm curious to see if there is something like this for graphics cards and my new gpu is going to be a massive upgrade I'd like to document the performance difference if possible.

mister mcdoogle (505 rep)

Feb 16, 2021, 11:24 PM • Last activity: Feb 28, 2021, 06:40 AM

0 votes

1 answers

460 views

How to tune the linux Scheduler for parallel computation?

java scheduling aws parallelism high-performance

I have a linux machine dedicated to running some parallel computation, and I'm trying to understand how to choose / tune the scheduler, and perhaps other parameters, to extract the most performance (this is deployed using AWS, so there's also some choice of what linux distribution to use, if that ma...

                                  I have a linux machine dedicated to running some parallel computation, and I'm trying to understand how to choose / tune the scheduler, and perhaps other parameters, to extract the most performance (this is deployed using AWS, so there's also some choice of what linux distribution to use, if that matters). 

I've implemented the computation in java because there are some subtle dependencies between different parts of the computation (there are about 5K "tasks" in all, but a task typically requires information from other tasks at multiple points in its execution). I want to consider two implementations.

Current Implementation
----------------------
In the current implementation, the number of threads is equal to the number of cores, and each thread picks up a task which is not waiting for any info, works on it until it is halted by some missing info, at which point it drops the task and picks up another. This continues until the computation is finished.

Here, it seems desirable that each CPU should be bound to a single thread throughout. **Should I "tell" the scheduler not to perform any timeslicing, or will this occur naturally? how do I make sure?** 

Another Possible Implementation
-------------------------------

I may change this so that each "task" has its own thread, using java's wait() and notify() paradigm instead instead of picking up and dropping compute tasks. Feel free to comment on the desirability of this change (having 5K tasks = threads, on a 96 core machine, possibly smaller if I can speed things up). But more importantly, assuming I implement this:
**how can I tell the scheduler to work with the largest possible time slices, except as mandated by the wait() and notify() calls? will using Java's yield() help?**

Some Related References
-----------------------

[This answer](https://unix.stackexchange.com/questions/466722/how-to-change-the-length-of-time-slices-used-by-the-linux-cpu-scheduler)  has some useful background on scheduling, and references [this](https://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com)  and [this](https://bugzilla.redhat.com/show_bug.cgi?id=969491)  for some more tunable parameters. The latter specifically mention a queue contention which I've also noticed when trying to scale up the number of processors in the "current implementation" above.

**ADDENDUM** [this](https://www.cs.ryerson.ca/mes/courses/cps530/programs/threads/Controlling/stateTransitions.html)  seems to say that unix (and linux?) do not time slice at all, and that the only way a thread is interrupted is if it is "preempted" by a higher priority thread, or initiates some blocking operation. Is this really the case?

Many thanks!
                                

Just Me (101 rep)

Jan 3, 2021, 10:21 AM • Last activity: Jan 3, 2021, 05:40 PM

3 votes

0 answers

1168 views

How can I display graphics on the screen from the kernel over the top of my X11/Wayland session?

linux-kernel x11 framebuffer drm high-performance

I want to draw simple (2D bitmapped) graphics onto my screen (in response to (simple) external inputs) with the lowest latency possible (the order of tens of milliseconds) so I can empirically test the results of drawing to the screen a) **in realtime**, b) with the least overhead possible, c) with...

                                  I want to draw simple (2D bitmapped) graphics onto my screen (in response to (simple) external inputs) with the lowest latency possible (the order of tens of milliseconds) so I can empirically test the results of drawing to the screen a) **in realtime**, b) with the least overhead possible, c) with page flipping completely disabled (tearing is fine).

Then I can compare this with drawing to the screen in various stereotypical (and arguably pathological :) ) scenarios (eg X11, Wayland, Wayland+XWayland, Wayland+XWayland+xcompmgr, etc).

To this end, how might I modify Linux so that I can draw **over the top of** my existing X11/Wayland session? Worded differently, yes, I want to fiddle with DRM (Direct Rendering Manager), 

 a. *from inside the kernel*,  
 b. *while X11 owns the DRM master*. :)

I suspect that a hardware overlay would be the easiest (and most hardware-accelerated!) way to keep X11/Wayland from scribbling over what I'm drawing. (I'm imagining alternatives involving implementing a shadow/write-through framebuffer cache, to avoid read-back... no thanks!)

I really, really want to confine my madness to a kernel module, and I don't want to have to fiddle with my graphics driver. :S (I'm incidentally using Intel graphics, albeit on an i915)

So, reading through https://dri.freedesktop.org/docs/drm/ , I get the idea that *maybe* I want to try and hack together something that lets me create a dumb framebuffer, create a plane looking at that dumb framebuffer, and then *double-maybe* set up some kind of DMA-BUF... no, wait, if I've got a plane pointing at the framebuffer, it's already on the screen... I think?

My main question is, how do I play fast and loose with DRM such that it basically tells X it's the only thing talking to the screen, while behind the scenes I'm managing an extra plane?

Answers very greatly appreciated!

---

My current understanding of Wayland is that it is fundamentally based around compositing, and by design cannot function without buffering one or more entire video frames before releasing it/them to the video card.

While X11 does not have this restriction - COMPOSITE is an optional extension - it uses a stream-based drawing protocol, and I am not aware of any method to draw directly into a window. The closest thing I know of is the use of the MIT-SHM extension, and while this does allow the use of shared memory, that involves at least two memory copies (me->kernel, kernel->X11), *and then* the poking of an XShmPutImage down the X11 pipe to tell X11 to please flip. This means that, even if I were to make my process run in realtime... well, I'm too chicken to run X in realtime as well, so I'd still have to wait for X to be scheduled, decode and reach my request in its command queue, and finally flip.

Hence my throwing my hands up in the air and trying to see if I can just shove my graphics drawing code straight into the kernel, and hopefully make it all coexist somehow.

I can imagine all of this additional overhead really adds up, and I want to quantify exactly *how* - or, alternatively, concretely establish that my mental models are completely incorrect and that the speed of contemporary hardware makes these concerns moot.

I am ***extremely*** curious to see what the impact would be if I eliminated all the bottlenecks, and also what the difference would be like comparing older systems versus more modern hardware.

Incidentally, this question is related to another question I asked [over here](https://unix.stackexchange.com/questions/500167/how-do-i-find-the-video-memory-regions-representing-whats-on-my-screen-from)  and [over here](https://stackoverflow.com/questions/42748390/directly-accessing-video-memory-within-the-linux-kernel-in-a-driver-agnostic-man) , both of which sadly still have no answers, and which I am no further forward on.

i336_ (1077 rep)

Aug 15, 2019, 03:10 PM • Last activity: Jun 26, 2020, 05:45 PM

42 votes

4 answers

6754 views

Why is Linux commonly used as operating system for supercomputers?

linux high-performance

As of November 2010, Linux is used on 459 out of the 500 supercomputers of the TOP500. Refer to [the table via Internet Archive][1]. What are the reasons behind this massive use of Linux in the supercomputer space? [1]: https://web.archive.org/web/20111111065347/https://www.top500.org/stats/list/36/...

                                  As of November 2010, Linux is used on 459 out of the 500 supercomputers of the TOP500. Refer to the table via Internet Archive .

What are the reasons behind this massive use of Linux in the supercomputer space?

orftz (687 rep)

Jun 4, 2011, 09:50 PM • Last activity: Nov 4, 2019, 01:57 PM

1 votes

2 answers

575 views

apache mod_security and performance

security apache-httpd performance high-performance

I have a list of bots to block, so I was thinking that fail2ban could be a solution until I realized that mod_security would be more efficient in this kind of tasks. The number of bots is huge, so the file of configuration will contain a long list. My question is about performance (memory, processor...

                                  
I have a list of bots to block, so I was thinking that fail2ban could be a solution until I realized that mod_security would be more efficient in this kind of tasks.

The number of bots is huge, so the file of configuration will contain a long list.

My question is about performance (memory, processor, disk ..etc):

Is having a huge list of bots to block will affect the performance of apache in a site with huge traffic ?

4m1nh4j1 (1923 rep)

Jun 25, 2014, 12:27 PM • Last activity: Oct 24, 2019, 02:01 AM

2 votes

2 answers

1065 views

Code for submitting job on cluster

ssh cluster high-performance cluster-ssh

I use the following code to submit a job on a cluster, but I don't know what these code means. Can some one explain me what the following code means if possible line-by-line. #!/bin/bash #PBS -N NAME_OF_JOB #PBS -l nodes=1:ppn=20 #PBS -l matlab_user=1 #PBS -l matlab_lic=20 #PBS -l min_walltime=1:00...

                                  I use the following code to submit a job on a cluster, but I don't know what these code means. Can some one explain me what the following code means if possible line-by-line. 

    #!/bin/bash
    #PBS -N NAME_OF_JOB
    #PBS -l nodes=1:ppn=20
    #PBS -l matlab_user=1
    #PBS -l matlab_lic=20
    #PBS -l min_walltime=1:00
    #PBS -q small
    #PBS -S /bin/bash
    ##PBS -V
    ##PBS -m abe
    #PBS -j oe
    #
    cd $PBS_O_WORKDIR
    cat $PBS_NODEFILE
    export PATH=/opt/software/matlabr2014a/mdcs/bin:$PATH
    matlab -nodisplay -r "code1" -logfile code1.log

Thanks

                                

pkj (123 rep)

Apr 11, 2015, 07:26 AM • Last activity: Mar 9, 2019, 01:28 PM

1 votes

1 answers

478 views

Live changing bjobs output

parallelism high-performance platform-lsf

When using LSF command `bjobs`, I would like to get instantly changing output if I submit another job, because I feel stressful to run the same command again and again. I would like something like `top` refreshing the output of the list of processes. [![I have to re-run the command to see updates][2...

                                  When using LSF command bjobs, I would like to get instantly changing output if I submit another job, because I feel stressful to run the same command again and again. I would like something like top refreshing the output of the list of processes.

In top that is not needed, it autorefreshes again and again.

I would like to auto-refresh the output of the bjobs command automatically.

Joshua Salazar (385 rep)

Feb 11, 2019, 03:41 PM • Last activity: Feb 11, 2019, 04:18 PM

Showing page 1 of 20 total questions