Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

0 votes

0 answers

45 views

Can "perf mem" command detect remote memory access on CXL NUMA nodes?

I wonder that if `perf mem` can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't...

                                  I wonder that if perf mem can detect the remote memory access on CXL NUMA Nodes. I got an AMD-EPYC-9654 server, and CXL Mem is on the Numa Node 2. I run a task on node 0, which accessed the remote Node 2 memory continuously. But unfortunately I could not test on my machine, because perf mem didn't work on AMD CPUs (https://community.amd.com/t5/server-processors/issues-with-perf-mem-record/m-p/95270) . 

Who can help me?

SeenThrough (1 rep)

Feb 11, 2025, 02:01 AM

0 votes

0 answers

38 views

How can I enable support for the HMAT table?

linux acpi numa smbios

I have access to a server and want to check its HMAT table. However, the HMAT table is not present (the SRAT and SLIT are though). I checked the Linux kernel config and the HMAT is enabled (`CONFIG_ACPI_HMAT=y` and `CONFIG_ACPI=y`). So probably the issue is with the hardware and firmware. Can I enab...

                                  I have access to a server and want to check its HMAT table.

However, the HMAT table is not present (the SRAT and SLIT are though).
I checked the Linux kernel config and the HMAT is enabled (CONFIG_ACPI_HMAT=y and CONFIG_ACPI=y). So probably the issue is with the hardware and firmware. 

Can I enable the HMAT, and if so, how?

Here's the server spec (let me know if more information is needed):

    $ uname -a
    Linux node0.acpi-tinkering-0.prismgt-pg0.clemson.cloudlab.us 5.15.0-122-generic #132-Ubuntu SMP Thu Aug 29 13:45:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

    $ sudo dmidecode -t 2
    # dmidecode 3.3
    Getting SMBIOS data from sysfs.
    SMBIOS 3.3 present.
    
    Handle 0x0200, DMI type 2, 8 bytes
    Base Board Information
    	Manufacturer: Dell Inc.
    	Product Name: 024PW1
    	Version: A00
    	Serial Number: .13D52G3.CNIVC001610605.

    $ sudo dmidecode -t 0
    # dmidecode 3.3
    Getting SMBIOS data from sysfs.
    SMBIOS 3.3 present.
    
    Handle 0x0000, DMI type 0, 26 bytes
    BIOS Information
    	Vendor: Dell Inc.
    	Version: 2.8.4
    	Release Date: 06/23/2022
    	Address: 0xF0000
    	Runtime Size: 64 kB
    	ROM Size: 32 MB
    	Characteristics:
    		ISA is supported
    		PCI is supported
    		PNP is supported
    		BIOS is upgradeable
    		BIOS shadowing is allowed
    		Boot from CD is supported
    		Selectable boot is supported
    		EDD is supported
    		Japanese floppy for Toshiba 1.2 MB is supported (int 13h)
    		5.25"/360 kB floppy services are supported (int 13h)
    		5.25"/1.2 MB floppy services are supported (int 13h)
    		3.5"/720 kB floppy services are supported (int 13h)
    		8042 keyboard services are supported (int 9h)
    		Serial services are supported (int 14h)
    		CGA/mono video services are supported (int 10h)
    		ACPI is supported
    		USB legacy is supported
    		BIOS boot specification is supported
    		Function key-initiated network boot is supported
    		Targeted content distribution is supported
    		UEFI is supported
    	BIOS Revision: 2.8

    sudo dmidecode -t 4
    # dmidecode 3.3
    Getting SMBIOS data from sysfs.
    SMBIOS 3.3 present.
    
    Handle 0x0400, DMI type 4, 48 bytes
    Processor Information
    	Socket Designation: CPU1
    	Type: Central Processor
    	Family: Zen
    	Manufacturer: AMD
    	ID: 11 0F A0 00 FF FB 8B 17
    	Signature: Family 25, Model 1, Stepping 1
    	Flags:
    		FPU (Floating-point unit on-chip)
    		VME (Virtual mode extension)
    		DE (Debugging extension)
    		PSE (Page size extension)
    		TSC (Time stamp counter)
    		MSR (Model specific registers)
    		PAE (Physical address extension)
    		MCE (Machine check exception)
    		CX8 (CMPXCHG8 instruction supported)
    		APIC (On-chip APIC hardware supported)
    		SEP (Fast system call)
    		MTRR (Memory type range registers)
    		PGE (Page global enable)
    		MCA (Machine check architecture)
    		CMOV (Conditional move instruction supported)
    		PAT (Page attribute table)
    		PSE-36 (36-bit page size extension)
    		CLFSH (CLFLUSH instruction supported)
    		MMX (MMX technology supported)
    		FXSR (FXSAVE and FXSTOR instructions supported)
    		SSE (Streaming SIMD extensions)
    		SSE2 (Streaming SIMD extensions 2)
    		HTT (Multi-threading)
    	Version: AMD EPYC 7543 32-Core Processor                
    	Voltage: 1.8 V
    	External Clock: 16000 MHz
    	Max Speed: 3900 MHz
    	Current Speed: 2800 MHz
    	Status: Populated, Enabled
    	Upgrade: Socket SP3
    	L1 Cache Handle: 0x0700
    	L2 Cache Handle: 0x0701
    	L3 Cache Handle: 0x0702
    	Serial Number: Not Specified
    	Asset Tag: Not Specified
    	Part Number: Not Specified
    	Core Count: 32
    	Core Enabled: 32
    	Thread Count: 64
    	Characteristics:
    		64-bit capable
    		Multi-Core
    		Hardware Thread
    		Execute Protection
    		Enhanced Virtualization
    
    Handle 0x0401, DMI type 4, 48 bytes
    Processor Information
    	Socket Designation: CPU2
    	Type: Central Processor
    	Family: Zen
    	Manufacturer: AMD
    	ID: 11 0F A0 00 FF FB 8B 17
    	Signature: Family 25, Model 1, Stepping 1
    	Flags:
    		FPU (Floating-point unit on-chip)
    		VME (Virtual mode extension)
    		DE (Debugging extension)
    		PSE (Page size extension)
    		TSC (Time stamp counter)
    		MSR (Model specific registers)
    		PAE (Physical address extension)
    		MCE (Machine check exception)
    		CX8 (CMPXCHG8 instruction supported)
    		APIC (On-chip APIC hardware supported)
    		SEP (Fast system call)
    		MTRR (Memory type range registers)
    		PGE (Page global enable)
    		MCA (Machine check architecture)
    		CMOV (Conditional move instruction supported)
    		PAT (Page attribute table)
    		PSE-36 (36-bit page size extension)
    		CLFSH (CLFLUSH instruction supported)
    		MMX (MMX technology supported)
    		FXSR (FXSAVE and FXSTOR instructions supported)
    		SSE (Streaming SIMD extensions)
    		SSE2 (Streaming SIMD extensions 2)
    		HTT (Multi-threading)
    	Version: AMD EPYC 7543 32-Core Processor                
    	Voltage: 1.8 V
    	External Clock: 16000 MHz
    	Max Speed: 3900 MHz
    	Current Speed: 2800 MHz
    	Status: Populated, Enabled
    	Upgrade: Socket SP3
    	L1 Cache Handle: 0x0703
    	L2 Cache Handle: 0x0704
    	L3 Cache Handle: 0x0705
    	Serial Number: Not Specified
    	Asset Tag: Not Specified
    	Part Number: Not Specified
    	Core Count: 32
    	Core Enabled: 32
    	Thread Count: 64
    	Characteristics:
    		64-bit capable
    		Multi-Core
    		Hardware Thread
    		Execute Protection
    		Enhanced Virtualization


                                

Matteo (73 rep)

Dec 6, 2024, 07:34 PM

0 votes

0 answers

17 views

What does the ACPI docs about NUMA nodes mean by "dynamic migration"?

acpi hot-plug numa

With reference to the following section of the ACPI docs about NUMA nodes: https://drops.dagstuhl.de/storage/01oasics/oasics-vol116-parma-ditam2024/OASIcs.PARMA-DITAM.2024.3/OASIcs.PARMA-DITAM.2024.3.pdf what does "dynamic migration of the devices" mean? Does it refer to physically migrating stuff (...

                                  With reference to the following section of the ACPI docs about NUMA nodes: https://drops.dagstuhl.de/storage/01oasics/oasics-vol116-parma-ditam2024/OASIcs.PARMA-DITAM.2024.3/OASIcs.PARMA-DITAM.2024.3.pdf  

what does "dynamic migration of the devices" mean?

Does it refer to physically migrating stuff (e.g. physically moving a DIMM from one slot to another), or is it referring to a logical operation (e.g. nothing changes physically)?

Matteo (73 rep)

Dec 5, 2024, 10:15 PM

0 votes

0 answers

105 views

Understanding CPU threads is used by a multithread application

cpu numa

Recently, I did some research on how the CPU core is used by a multi-threaded application. I can see what cores each thread is using by using the following command: For example: ceph-osd is a multi-threaded application ``` for i in $(pgrep ceph-osd); do ps -mo pid,tid,fname,user,psr -p $i;done ``` M...

for i in $(pgrep ceph-osd); do ps -mo pid,tid,fname,user,psr -p $i;done

My CPU has 32 cores and 72 threads, so the ceph-osd will have 72 TID (thread ID):

PID     TID COMMAND  USER     PSR
 336157       - ceph-osd ceph       -
      -  336157 -        ceph       6
      -  336160 -        ceph      51
      -  336162 -        ceph      57
      -  336163 -        ceph      23
      -  336164 -        ceph      22
      -  336168 -        ceph       7
      -  336169 -        ceph      17
      -  336203 -        ceph       1
...
...

But what I don't understand is: - This ceph-osd process will use all the CPU cores, or if the thread is just allocated and maybe not always in use? - If I use numactl to define affinity and bind the process to specific threads, will it make any different than before? - There are more than one ceph-osd process on my server, so will it help improve performance when binding manually? Thanks in advance.

huynp (3 rep)

Nov 20, 2023, 12:55 PM

0 votes

1 answers

164 views

In a computer with two CPUs, how can I get the physical address of local memory of each CPU?

cpu numa

My computer has the following specifications - 2 Intel CPU x86_64 - Total 8GB memory (4GB per CPU) - O/S is Rocky Linux 9 I want to reserve 1GB of memory per CPU using the `memmap` parameter in grub. I checked `dmesg` and `/proc/meminfo` and even used `numaclt -H`. But I couldn't find the physical a...

                                  My computer has the following specifications

  - 2 Intel CPU x86_64
  - Total 8GB memory (4GB per CPU)
  - O/S is Rocky Linux 9

I want to reserve 1GB of memory per CPU using the memmap parameter in grub. I checked dmesg and /proc/meminfo and even used numaclt -H. But I couldn't find the physical address of the memory for each CPU.

How can I get physical addresses of local memory of each CPU?
                                

raon0ms (11 rep)

May 16, 2023, 08:05 AM • Last activity: Jun 6, 2023, 03:00 PM

1 votes

1 answers

745 views

NUMA aware caching on linux

linux kernel memory cache numa

This is a follow-up question to https://unix.stackexchange.com/questions/733250/dentry-inode-caching-on-multi-cpu-machines-memory-allocator-configuration, but here I try to put the question differently. My problem is that I have a dual socket machine, and memory for the kernel caches (dentry/inode/b...

                                  This is a follow-up question to https://unix.stackexchange.com/questions/733250/dentry-inode-caching-on-multi-cpu-machines-memory-allocator-configuration , but here I try to put the question differently.

My problem is that I have a dual socket machine, and memory for the kernel caches (dentry/inode/buffer) are allocated from bank0 (cpu0's memory bank), and that eventually gets consumed. However, bank1 is never used for caches, so there is plenty of free memory in the overall system. So in this state the memory allocator gets memory from bank1, regardless of where my process is running (even if I set memory affinity). Due to the different memory latency when accessing memory from different banks, this means that my process (which is somewhat memory access bound with a low cache-hit ratio) will run much slower when scheduled on the cores in cpu0 than when scheduled on the cores in cpu1. (I'd like to schedule two processes, one for each cpu, and a process should use all cores of its cpu. I don't want to waste half the cores.)

What could I do to ensure that my process can get memory from the local bank, no matter on which cpu it gets scheduled on?

I tried playing with the kernel VM parameters, but they don't really do anything. After all, half the memory is free! These caches in kernel simply do not seem to take NUMA issues into account. I tried to look into cgroups, but as far as I can understand, I can't really control the kernel that way. I did not really find anything that would address my issue :-(.

I can, of course, drop all caches before starting my processes, but that is a bit heavy handed. A cleaner solution would be, for example, to limit the total cache size (say 8GB). True, cpu0 would still have a bit less memory than cpu1 (I have 64GB in both banks), but I can live with that.

I'd be grateful for any suggestions... Thanks!

LaszloLadanyi (153 rep)

Jan 29, 2023, 05:03 PM • Last activity: Feb 19, 2023, 10:48 PM

0 votes

0 answers

156 views

Are there any "gotcha's" with using NUMA on linux

numa

I have a new system, with 2 socketed CPUs. I've heard that there can be bottlenecks when working with NUMA systems, if an application on processor 0 tries to access memory attached to processor 1. How does the linux kernel handle running software on a NUMA system? - Would the kernel (automatically)...

                                  I have a new system, with 2 socketed CPUs. I've heard that there can be bottlenecks when working with NUMA systems, if an application on processor 0 tries to access memory attached to processor 1. How does the linux kernel handle running software on a NUMA system?

- Would the kernel (automatically) prioritize putting all of an applications threads and memory on one CPU, if possible?
- What if the application is CPU heavy and creates more threads than cores that are available on one CPU? 
- What about VMs?
  - If you created a KVM VM with less resources than a single CPU has access to (if both the cores and the memory is less than what is available) would it work the best way out of the box or would you need to manually set the VMs affinity or something?
  - What if you wanted a VM that used more than a single CPUs resources, but you emulated NUMA? Would that work as expected out of the box too?
- If I installed bog-standard Ubuntu or centos, would I have to do anything to make it NUMA-aware

I suppose this question is very general, but I truly don't know much about how NUMA works, and I have found little documentation about it. 

                                

Kaiden Prince (101 rep)

Jan 22, 2023, 03:05 PM

2 votes

1 answers

237 views

Processes ignore global CPUAffinity settings

systemd cpu numa

I am setting global CPUAffinity via `/etc/systemd/system.conf`. See snippet below: ``` root@PC1-03:~# cat /etc/systemd/system.conf # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as publ...

I am setting global CPUAffinity via /etc/systemd/system.conf. See snippet below:

root@PC1-03:~# cat /etc/systemd/system.conf 
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See systemd-system.conf(5) for details.

[Manager]
#LogLevel=info
#LogTarget=journal-or-kmsg
#LogColor=yes
#LogLocation=no
#LogTime=no
#DumpCore=yes
#ShowStatus=yes
#CrashChangeVT=no
#CrashShell=no
#CrashReboot=no
#CtrlAltDelBurstAction=reboot-force
CPUAffinity=2 3 4 5 6 7 10 11 12 13 14 15 18 19 20 21 22 23 26 27 28 29 30 31 34 35 36 37 38 39 42 43 44 45 46 47 50 51 52 53 54 55 58 59 60 61 62 63 66 67 68 69 70 71 74 75 76 77 78 79 82 83 84 85 86 87 90 91 92 93 94 95 98 99 100 101 102 103 106 107 108 109 110 111 114 115 116 117 118 119 122 123 124 125 126 127 130 131 132 133 134 135 138 139 140 141 142 143 146 147 148 149 150 151 154 155 156 157 158 159 162 163 164 165 166 167 170 171 172 173 174 175 178 179 180 181 182 183 186 187 188 189 190 191 194 195 196 197 198 199 202 203 204 205 206 207 210 211 212 213 214 215 218 219 220 221 222 223 226 227 228 229 230 231 234 235 236 237 238 239 242 243 244 245 246 247 250 251 252 253 254 255
#NUMAPolicy=default
#NUMAMask=
#RuntimeWatchdogSec=0
#RebootWatchdogSec=10min
#ShutdownWatchdogSec=10min
#KExecWatchdogSec=0
#WatchdogDevice=
#CapabilityBoundingSet=
#NoNewPrivileges=no
#SystemCallArchitectures=
#TimerSlackNSec=
#StatusUnitFormat=description
#DefaultTimerAccuracySec=1min
#DefaultStandardOutput=journal
#DefaultStandardError=inherit
#DefaultTimeoutStartSec=90s
#DefaultTimeoutStopSec=90s
#DefaultTimeoutAbortSec=
#DefaultRestartSec=100ms
#DefaultStartLimitIntervalSec=10s
#DefaultStartLimitBurst=5
#DefaultEnvironment=
#DefaultCPUAccounting=no
#DefaultIOAccounting=no
#DefaultIPAccounting=no
#DefaultBlockIOAccounting=no
#DefaultMemoryAccounting=yes
#DefaultTasksAccounting=yes
#DefaultTasksMax=15%
#DefaultLimitCPU=
#DefaultLimitFSIZE=
#DefaultLimitDATA=
#DefaultLimitSTACK=
#DefaultLimitCORE=
#DefaultLimitRSS=
#DefaultLimitNOFILE=1024:524288
#DefaultLimitAS=
#DefaultLimitNPROC=
#DefaultLimitMEMLOCK=
#DefaultLimitLOCKS=
#DefaultLimitSIGPENDING=
#DefaultLimitMSGQUEUE=
#DefaultLimitNICE=
#DefaultLimitRTPRIO=
#DefaultLimitRTTIME=

However, in running a dummy load, i observe a 14 percent load on CPU 0.

0[||      14.6%]   16[         0.0%]   32[         0.0%]  48[         0.0%]  64[         0.0%]  80[         0.0%]  96[         0.0%] 112[         0.0%]   128[         0.0%]  144[         0.0%]  160[         0.0%] 176[         0.0%] 192[         0.0%] 208[         0.0%] 224[         0.0%] 240[         0.0%]
    1[         0.0%]   17[         0.0%]   33[         0.0%]  49[         0.0%]  65[         0.0%]  81[         0.0%]  97[         0.0%] 113[         0.0%]   129[         0.0%]  145[         0.0%]  161[         0.0%] 177[         0.0%] 193[         0.0%] 209[         0.0%] 225[         0.0%] 241[         0.0%]
    2[|||||||100.0%]   18[|||||||100.0%]   34[|||||||100.0%]  50[|||||||100.0%]  66[|||||||100.0%]  82[|||||||100.0%]  98[|||||||100.0%] 114[|||||||100.0%]   130[|||||||100.0%]  146[|||||||100.0%]  162[|||||||100.0%] 178[|||||||100.0%] 194[|||||||100.0%] 210[|||||||100.0%] 226[|||||||100.0%] 242[|||||||100.0%]
    3[|||||||100.0%]   19[|||||||100.0%]   35[|||||||100.0%]  51[|||||||100.0%]  67[|||||||100.0%]  83[|||||||100.0%]  99[|||||||100.0%] 115[|||||||100.0%]   131[|||||||100.0%]  147[|||||||100.0%]  163[|||||||100.0%] 179[|||||||100.0%] 195[|||||||100.0%] 211[|||||||100.0%] 227[|||||||100.0%] 243[|||||||100.0%]
    4[|||||||100.0%]   20[|||||||100.0%]   36[|||||||100.0%]  52[|||||||100.0%]  68[|||||||100.0%]  84[|||||||100.0%] 100[|||||||100.0%] 116[|||||||100.0%]   132[|||||||100.0%]  148[|||||||100.0%]  164[|||||||100.0%] 180[|||||||100.0%] 196[|||||||100.0%] 212[|||||||100.0%] 228[|||||||100.0%] 244[|||||||100.0%]
    5[|||||||100.0%]   21[|||||||100.0%]   37[|||||||100.0%]  53[|||||||100.0%]  69[|||||||100.0%]  85[|||||||100.0%] 101[|||||||100.0%] 117[|||||||100.0%]   133[|||||||100.0%]  149[|||||||100.0%]  165[|||||||100.0%] 181[|||||||100.0%] 197[|||||||100.0%] 213[|||||||100.0%] 229[|||||||100.0%] 245[|||||||100.0%]
    6[|||||||100.0%]   22[|||||||100.0%]   38[|||||||100.0%]  54[|||||||100.0%]  70[|||||||100.0%]  86[|||||||100.0%] 102[|||||||100.0%] 118[|||||||100.0%]   134[|||||||100.0%]  150[|||||||100.0%]  166[|||||||100.0%] 182[|||||||100.0%] 198[|||||||100.0%] 214[|||||||100.0%] 230[|||||||100.0%] 246[|||||||100.0%]
    7[|||||||100.0%]   23[|||||||100.0%]   39[|||||||100.0%]  55[|||||||100.0%]  71[|||||||100.0%]  87[|||||||100.0%] 103[|||||||100.0%] 119[|||||||100.0%]   135[|||||||100.0%]  151[|||||||100.0%]  167[|||||||100.0%] 183[|||||||100.0%] 199[|||||||100.0%] 215[|||||||100.0%] 231[|||||||100.0%] 247[|||||||100.0%]
    8[         0.0%]   24[         0.0%]   40[         0.0%]  56[         0.0%]  72[         0.0%]  88[         0.0%] 104[         0.0%] 120[         0.0%]   136[         0.0%]  152[         0.0%]  168[         0.0%] 184[         0.0%] 200[         0.0%] 216[         0.0%] 232[         0.0%] 248[         0.0%]
    9[         0.0%]   25[         0.0%]   41[         0.0%]  57[         0.0%]  73[         0.0%]  89[         0.0%] 105[         0.0%] 121[         0.0%]   137[         0.0%]  153[         0.0%]  169[         0.0%] 185[         0.0%] 201[         0.0%] 217[         0.0%] 233[         0.0%] 249[         0.0%]
   10[|||||||100.0%]   26[|||||||100.0%]   42[|||||||100.0%]  58[|||||||100.0%]  74[|||||||100.0%]  90[|||||||100.0%] 106[|||||||100.0%] 122[|||||||100.0%]   138[|||||||100.0%]  154[|||||||100.0%]  170[|||||||100.0%] 186[|||||||100.0%] 202[|||||||100.0%] 218[|||||||100.0%] 234[|||||||100.0%] 250[|||||||100.0%]
   11[|||||||100.0%]   27[|||||||100.0%]   43[|||||||100.0%]  59[|||||||100.0%]  75[|||||||100.0%]  91[|||||||100.0%] 107[|||||||100.0%] 123[|||||||100.0%]   139[|||||||100.0%]  155[|||||||100.0%]  171[|||||||100.0%] 187[|||||||100.0%] 203[|||||||100.0%] 219[|||||||100.0%] 235[|||||||100.0%] 251[|||||||100.0%]
   12[|||||||100.0%]   28[|||||||100.0%]   44[|||||||100.0%]  60[|||||||100.0%]  76[|||||||100.0%]  92[|||||||100.0%] 108[|||||||100.0%] 124[|||||||100.0%]   140[|||||||100.0%]  156[|||||||100.0%]  172[|||||||100.0%] 188[|||||||100.0%] 204[|||||||100.0%] 220[|||||||100.0%] 236[|||||||100.0%] 252[|||||||100.0%]
   13[|||||||100.0%]   29[|||||||100.0%]   45[|||||||100.0%]  61[|||||||100.0%]  77[|||||||100.0%]  93[|||||||100.0%] 109[|||||||100.0%] 125[|||||||100.0%]   141[|||||||100.0%]  157[|||||||100.0%]  173[|||||||100.0%] 189[|||||||100.0%] 205[|||||||100.0%] 221[|||||||100.0%] 237[|||||||100.0%] 253[|||||||100.0%]
   14[|||||||100.0%]   30[|||||||100.0%]   46[|||||||100.0%]  62[|||||||100.0%]  78[|||||||100.0%]  94[|||||||100.0%] 110[|||||||100.0%] 126[|||||||100.0%]   142[|||||||100.0%]  158[|||||||100.0%]  174[|||||||100.0%] 190[|||||||100.0%] 206[|||||||100.0%] 222[|||||||100.0%] 238[|||||||100.0%] 254[|||||||100.0%]
   15[|||||||100.0%]   31[|||||||100.0%]   47[|||||||100.0%]  63[|||||||100.0%]  79[|||||||100.0%]  95[|||||||100.0%] 111[|||||||100.0%] 127[|||||||100.0%]   143[|||||||100.0%]  159[|||||||100.0%]  175[|||||||100.0%] 191[|||||||100.0%] 207[|||||||100.0%] 223[|||||||100.0%] 239[|||||||100.0%] 255[|||||||100.0%]

I check the processes running on the core, and there still are some that exist. Small snippet below.

root@PC1-03:~# ps -A -o psr,pid,args | grep '^ *0' | head -n 25
  0       3 [rcu_gp]
  0       4 [rcu_par_gp]
  0       5 [netns]
  0       7 [kworker/0:0H-events_highpri]
  0       9 [kworker/0:1H-events_highpri]
  0      11 [mm_percpu_wq]
  0      12 [rcu_tasks_kthread]
  0      13 [rcu_tasks_rude_kthread]
  0      14 [rcu_tasks_trace_kthread]
  0      15 [ksoftirqd/0]
  0      17 [migration/0]
  0      18 [kworker/0:1-events]
  0      19 [cpuhp/0]
  0      94 [kworker/15:0H]
  0     110 [kworker/18:0H]
  0     120 [kworker/20:0H]
  0     125 [kworker/21:0H]
  0     130 [kworker/22:0H]
  0     140 [kworker/24:0H]
  0     145 [kworker/25:0H]
  0     165 [kworker/29:0H]
  0     170 [kworker/30:0H]
  0     186 [kworker/33:0H]
  0     216 [kworker/39:0H]
  0     221 [kworker/40:0H]

Is there additional config I need to set to make sure stuff does not run on the cores I don't want it to?

Trevor K Smith (71 rep)

Oct 4, 2022, 06:18 AM • Last activity: Oct 15, 2022, 09:30 AM

2 votes

1 answers

272 views

Why does a 2-socket server show PCIe locations but the 4-socket server does not (how can I find the PCIe locations on the 4-socket server)?

linux linux-kernel pci numa

I have two servers: - 2 socket [Supermicro X9DBL-3F][1] - 4 socket [Supermicro X10QBI][2] When I run `hwloc-ls` for the 2-socket server I see the PCIe topology with the HostBridges on each NUMANode, but the 4-socket server shows Packages instead of NUMANodes and all of the HostBridges are listed at...

I have two servers: - 2 socket Supermicro X9DBL-3F - 4 socket Supermicro X10QBI When I run hwloc-ls for the 2-socket server I see the PCIe topology with the HostBridges on each NUMANode, but the 4-socket server shows Packages instead of NUMANodes and all of the HostBridges are listed at the bottom. In addition, lscpu shows 2 NUMA nodes on the 2-socket but only 1 NUMA node on the 4-socket server. How can I discern which PCIe device is attached to which socket on the 4-socket server? When I run hwloc-ls on the 2-socket server I get the following:

Machine (63GB total)
  NUMANode L#0 (P#0 31GB)
    Package L#0 + L3 L#0 (20MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
        ...
    HostBridge L#0
      PCIBridge
        PCI 17d3:1880
          Block(Disk) L#0 "sda"
  NUMANode L#1 (P#1 31GB)
    Package L#1 + L3 L#1 (20MB)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
        ...
    HostBridge L#6
      PCIBridge
        PCI 8086:10fb
          Net L#8 "eth0"

... and when I run it on the 4-socket server I get the following:

Machine (126GB)
  Package L#0 + L3 L#0 (38MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#60)
      ...
  Package L#1 + L3 L#1 (38MB)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#15)
      PU L#5 (P#75)
      ...
  Package L#2 + L3 L#2 (38MB)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
      PU L#7 (P#30)
      PU L#8 (P#90)
      ...
  Package L#3 + L3 L#3 (38MB)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
      PU L#10 (P#45)
      PU L#11 (P#105)
      ...
  Misc(MemoryModule)
  ...
  HostBridge L#5
    PCIBridge
      PCI 8086:10c9
        Net L#6 "ens8f0"

2-socket lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz
Stepping:              4
CPU MHz:               2804.841
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              5000.25
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

4-socket lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                120
On-line CPU(s) list:   0-119
Thread(s) per core:    2
Core(s) per socket:    15
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
Stepping:              7
CPU MHz:               1199.953
CPU max MHz:           3400.0000
CPU min MHz:           1200.0000
BogoMIPS:              5600.25
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              38400K
NUMA node0 CPU(s):     0-119

KJ7LNW (525 rep)

Oct 2, 2020, 12:44 AM • Last activity: Jul 7, 2022, 10:27 PM

8 votes

3 answers

5216 views

How to disable memory for a NUMA node on a Linux system

linux rhel memory numa

Is there a way to disable access to memory associated with a given NUMA node/socket on a [NUMA][1] machine? We have a bit of controversy with the database vendor about our HP DL560 machines. The DB sales type’s technical support person was animated that we could not use our DL560s but had to buy new...

                                  Is there a way to disable access to memory associated with a given NUMA node/socket on a NUMA  machine?

We have a bit of controversy with the database vendor about our HP DL560 machines.  The DB sales type’s technical support person was animated that we could not use our DL560s but had to buy new DL360s since they have fewer sockets.  I believe their concern is the speed of accessing inter-socket memory. They recommended that if I insisted on keeping the DL560s, I should leave two of the sockets empty.  I think they are mistaken (AKA crazy) but I need tests to demonstrate that I am on solid ground.

My configuration:  
The machines have four sockets, each of which has 22 hyperthreaded physical cores, for a total of 176 apparent cores with a total of 1.5 T of memory.
The operating system is Red Hat Enterprise Linux Server release 7.4.

The lscpu display reads (in part):

    $ lscpu | egrep 'NUMA|ore'
    Thread(s) per core:    2
    Core(s) per socket:    22
    NUMA node(s):          4
    NUMA node0 CPU(s):     0-21,88-109
    NUMA node1 CPU(s):     22-43,110-131
    NUMA node2 CPU(s):     44-65,132-153
    NUMA node3 CPU(s):     66-87,154-175

If I had access to the physical hardware, I would consider pulling the processors from two of the sockets to prove my point but I don’t have access and I don’t have permission to go monkeying around with the hardware anyway.

The next best thing would be to virtually disable the sockets using the operating system.  I read on this link  that I can take a processor out of service with

    echo 0 > /sys/devices/system/cpu/cpu3/online

and, indeed, the processors the processors are out of service but that says nothing about the memory.

I just turned off all the processors for socket #3 with (using lscpu to find which are for Socket#3):

    for num in {66..87} {154..175}
    do
        echo 0 > /sys/devices/system/cpu/cpu${num}/online
        cat /sys/devices/system/cpu/cpu${num}/online
    done

and got:

    $ grep N3 /proc/$$/numa_maps
    7fe5daa79000 default file=/usr/lib64/libm-2.17.so mapped=16 mapmax=19 N3=16 kernelpagesize_kB=4

Which, if I am reading this correctly, shows my current process is using memory in socket #3. Except the shell was already running when I turned off the processors.

I started a new process that does its best to gobble up memory and 

   $ cat /proc/18824/numa_maps | grep N3

Returns no records initially but After gobbling up memory for a long time, it starts using memory on Node 3.

I tried running my program with numactl and binding to nodes 0,1,2 and it works as expected ... except I don’t have control over the vendor's software and there is no provision in Linux to set another process as is done with the set_mempolicy service as used by numactl.

Short of physically removing the processors, Is there a way to force the issue?

                                

user1683793 (423 rep)

Jun 18, 2019, 04:29 PM • Last activity: Apr 30, 2022, 05:45 AM

3 votes

1 answers

1798 views

Does Linux's NUMA architecture shares main memory as well?

linux linux-kernel numa

I am reading about NUMA (Non-uniform memory access) architecture. It looks like this is the hardware architecture that on the multiprocessor system, each core accesses their internal local memory is faster than the remote memory. The thing I don't know is: looks like the main memory (RAM) is also di...

                                  I am reading about NUMA (Non-uniform memory access) architecture. It looks like this is the hardware architecture that on the multiprocessor system, each core accesses their internal local memory is faster than the remote memory.

The thing I don't know is: looks like the main memory (RAM) is also divided between nodes. That makes me confused because I think all the nodes (which are stayed inside the same CPU) will have the same access speed to the main memory. So why does Linux divide the main memory for each node?

hqt (607 rep)

Apr 30, 2020, 10:14 PM • Last activity: Jun 9, 2020, 11:50 PM

0 votes

1 answers

1721 views

numactl: this system does not support NUMA policy

linux kernel yocto numa

When using numactl, I was seeing numactl: this system does not support NUMA policy. Is it because some kernel config is not enabled? Confirmed BIOS enabled NUMA. lscpu shows there are NUMA nodes.

                                  When using numactl, I was seeing

    numactl: this system does not support NUMA policy.

Is it because some kernel config is not enabled?

Confirmed BIOS enabled NUMA.

lscpu shows there are NUMA nodes.

Mark K (955 rep)

Mar 28, 2020, 02:41 AM • Last activity: Mar 29, 2020, 09:29 PM

0 votes

1 answers

468 views

Is a page fault across NUMA nodes "major" or "minor"?

performance numa

I understand that on a single-socket Linux system, a command such as `sudo ps -eo min_flt,maj_flt,cmd` will generally count a page fault as "minor" if it blocks on a memory-to-memory copy, or on the zeroing of a deallocated page, or for some reason doesn't touch persistent storage. But is this true...

                                  I understand that on a single-socket Linux system, a command such as sudo ps -eo min_flt,maj_flt,cmd will generally count a page fault as "minor" if it blocks on a memory-to-memory copy, or on the zeroing of a deallocated page, or for some reason doesn't touch persistent storage. But is this true on NUMA systems as well, even when the fault requires a data transfer from one NUMA node to another? Or does that cross the line into "major"?
                                

Pr0methean (101 rep)

Nov 22, 2019, 05:56 AM • Last activity: Nov 22, 2019, 08:39 AM

25 votes

5 answers

29144 views

Enabling NUMA for Intel Core i7

linux-kernel numa

In Linux kernel, the documentation for `CONFIG_NUMA` says: Enable NUMA (Non Uniform Memory Access) support. he kernel will try to allocate memory used by a CPU on the local memory controller of the CPU and add some more NUMA awareness to the kernel. For 64-bit this is recommended if the system is In...

                                  In Linux kernel, the documentation for CONFIG_NUMA says:

    Enable NUMA (Non Uniform Memory Access) support. 

    he kernel will try to allocate memory used by a CPU on the  
    local memory controller of the CPU and add some more
    NUMA awareness to the kernel.

    For 64-bit this is recommended if the system is Intel Core i7
    (or later), AMD Opteron, or EM64T NUMA.

I have an Intel Core i7 processor, but AFAICT it only has one NUMA node:

    $ numactl --hardware
    available: 1 nodes (0)
    node 0 cpus: 0 1 2 3 4 5 6 7
    node 0 size: 16063 MB
    node 0 free: 15031 MB
    node distances:
    node   0 
      0:  10 

So what is the purpose of having CONFIG_NUMA=y, when i7 has only one NUMA node ?
                                

user1968963 (4163 rep)

Sep 25, 2013, 12:41 PM • Last activity: Aug 11, 2019, 09:52 PM

1 votes

0 answers

32 views

Why do we use dynamic memory in ccNUMA systems when we talk about data distribution into locality domains by first touch policy?

linux parallelism multiprocessor numa

In many books, when they talk about first touch policy in ccNUMA systems they are using dynamic memory allocation when they distribute data across locality domains. What if, for example, we have an array in stack?Does first touch policy work in the same way?

                                  In many books, when they talk about first touch policy in ccNUMA systems they are using dynamic memory allocation when they distribute data across locality domains. What if, for example, we have an array in stack?Does first touch policy work in the same way?
                                

Arvanitis (11 rep)

Jun 30, 2019, 11:26 AM • Last activity: Jun 30, 2019, 01:02 PM

2 votes

0 answers

679 views

Allocate pools of hugepages separately on each NUMA doamin

linux memory numa huge-pages

On my dual-socket machine, I'm trying to allocated two pools of hugepages (one for each socket), so that the application A, which is pinned on the first socket, uses the first pool, and the application B on the second socket uses it's own local pool. However, when I put the number of huge pages in t...

                                  On my dual-socket machine, I'm trying to allocated two pools of hugepages (one for each socket), so that the application A, which is pinned on the first socket, uses the first pool, and the application B on the second socket uses it's own local pool.

However, when I put the number of huge pages in the /sys/devices/system/node/node{0,1}/hugepages/hugepages-1048576kB/nr_hugepages, the hugeadm --explain still shows me a single pool and one mount point for it, instead of two.

The goal is to have two processes one on each socket, so that each process works only on the local pool of hugepages.

Vahid Noormofidi (162 rep)

Mar 11, 2019, 10:28 PM • Last activity: May 1, 2019, 09:59 PM

2 votes

0 answers

197 views

What prevents page migration?

linux memory cache numa

OpenSUSE 42.3, Kernel 4.4.175-89-default Running memory bandwidth intensive applications I realised the following behaviour: The application uses ~55% of the physical memory of a NUMA system with 2 nodes. The application is parallelized using OpenMP, but without accounting for NUMA. So it relies on...

                                  OpenSUSE 42.3, Kernel 4.4.175-89-default

Running memory bandwidth intensive applications I realised the following behaviour:
The application uses ~55% of the physical memory of a NUMA system with 2 nodes. The application is parallelized using OpenMP, but without accounting for NUMA. So it relies on page migration to achieve a somewhat decent execution speed.

Here is how that looks like:

At around 180 iterations, I cleared caches manually using

    # echo 3 >| /proc/sys/vm/drop_caches

The result is an immediate performance improvement.
What prevents the system from doing proper page migration before I cleared the caches manually?

MechEng (233 rep)

Apr 12, 2019, 08:52 AM • Last activity: Apr 12, 2019, 10:21 AM

2 votes

1 answers

92 views

Where does the default numa setting come from?

linux-kernel drivers numa

when we run: numactl --hardware we can see the current status of numa setting. However, it seems not set by Linux ( at least, I didn't add a parameter to set it ). Did it set by BIOS?

                                  when we run:

    numactl --hardware
we can see the current status of numa setting.

However, it seems not set by Linux ( at least, I didn't add a parameter to set it ).
Did it set by BIOS?
                                

Mark (747 rep)

Jan 23, 2019, 12:28 PM • Last activity: Jan 28, 2019, 12:36 PM

2 votes

1 answers

6448 views

Sub-process returned an error code when apt-get install package

apt software-installation packagekit numa

sudo apt-get install numactl E: Problem executing scripts APT::Update::Post-Invoke-Success '/usr/bin/test -e /usr/share/dbus-1/system-services/org.freedesktop.PackageKit.service && /usr/bin/test -S /var/run/dbus/system_bus_socket && /usr/bin/gdbus call --system --dest org.freedesktop.PackageKit --ob...

                                      sudo apt-get install  numactl

    E: Problem executing scripts APT::Update::Post-Invoke-Success '/usr/bin/test -e /usr/share/dbus-1/system-services/org.freedesktop.PackageKit.service && /usr/bin/test -S /var/run/dbus/system_bus_socket && /usr/bin/gdbus call --system --dest org.freedesktop.PackageKit --object-path /org/freedesktop/PackageKit --timeout 4 --method org.freedesktop.PackageKit.StateHasChanged cache-update > /dev/null; /bin/echo > /dev/null'
    E: Sub-process returned an error code

How to fix it?

showkey (499 rep)

May 2, 2018, 12:50 AM • Last activity: Jan 25, 2019, 05:01 PM

2 votes

1 answers

329 views

Find out where the allocated memory for a process resides

memory parallelism numa

I would like to investigate where the memory for a specific process is allocated. To be more specific: I am running an OpenMP parallel Fortran binary on a ccNUMA machine with two physical CPUs. My concern is that this program violates the first touch rule when initializing its variables. This would...

                                  I would like to investigate where the memory for a specific process is allocated.

To be more specific: I am running an OpenMP parallel Fortran binary on a ccNUMA machine with two physical CPUs. My concern is that this program violates the first touch rule when initializing its variables. This would lead to memory being allocated in a non-balanced fashion, i.e. most of the memory would be allocated in the address space of only one physical CPU instead of balancing it between both CPUs. In turn, this would lead to poor scaling for this memory-bandwidth limited application.

Unfortunately, I don't have access to the source code. So looking at the memory allocation seems like a good way way to find out. Other ideas are welcome.

Edit due to comments: OpenSUSE Leap 42.3, kernel version 4.4.103-36-default

MechEng (233 rep)

Apr 30, 2018, 09:35 AM • Last activity: May 2, 2018, 09:44 AM

Showing page 1 of 20 total questions