Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

4 votes

3 answers

3079 views

Check what process is spiking the average load with atop

In trying to find the culprit of a high load average on a system during night (which does not seem to be related to logrotate) I installed atop to write a raw file with a specific interval. While reading the file, it seems the processlist stands still, can I somehow go back and forth between the sam...

                                  In trying to find the culprit of a high load average on a system during night (which does not seem to be related to logrotate) I installed atop to write a raw file with a specific interval. While reading the file, it seems the processlist stands still, can I somehow go back and forth between the samples to see what sticks out, and further sort by any column (like cpu usage)?

user135361 (193 rep)

Sep 3, 2013, 01:31 PM • Last activity: Jun 4, 2025, 01:06 AM

2 votes

2 answers

5249 views

atop peak memory usage from log

memory monitoring atop

I have `atop` logs stored on a dayly basis with 10 min interval and I can read them with `atop -r `, but how can I find a peak memory usage in this log?

                                  I have atop logs stored on a dayly basis with 10 min interval and I can read them with atop -r , but how can I find a peak memory usage in this log?
                                

mrgloom (123 rep)

May 11, 2017, 08:41 AM • Last activity: Sep 5, 2024, 01:03 PM

0 votes

1 answers

185 views

atop is not logging daily

logs atop rocky-linux

I installed `atop` on `Rocky 9` to monitor CPU spike issues. On 2/8/2024 (2 days ago) I enabled and started `atop`: systemctl enable atop systemctl start atop Thus, a file was created in `/var/log/atop` named `atop_20240802`. `journalctl -u atop` output: Aug 02 23:52:01 systemd[1]: Starting Atop adv...

                                  I installed atop on Rocky 9 to monitor CPU spike issues. On 2/8/2024 (2 days ago) I enabled and started atop:

    systemctl enable atop
    systemctl start atop

Thus, a file was created in /var/log/atop named atop_20240802.

journalctl -u atop output:

    Aug 02 23:52:01 systemd: Starting Atop advanced performance monitor...
    Aug 02 23:52:01 systemd: Started Atop advanced performance monitor.

I updated the config file /usr/share/atop/atop.daily:

    #!/usr/bin/sh

    LOGOPTS=""                              # default options
    LOGINTERVAL=3600                        # default interval in seconds
    LOGGENERATIONS=7                        # default number of days
    LOGPATH=/var/log/atop                   # default log location

    # allow administrator to overrule the variables
    # defined above
    #
    DEFAULTSFILE=/etc/sysconfig/atop                # possibility to overrule vars

ls -ld /var/log/atop/ output:

    drwxr-xr-x 2 root root 53 Aug  2 23:53 /var/log/atop/

Daily logs were not created for yesterday and today. How to fix it?

user25934652 (1 rep)

Aug 4, 2024, 03:49 AM • Last activity: Aug 20, 2024, 11:54 AM

2 votes

1 answers

480 views

What are "shrss" / "shswp" in atop?

memory atop

From [`atop` manual][1] ([official][2]): > **MEM** - Memory occupation. > > This line contains the total amount of physical memory (`tot`), [...] the resident size of shared memory including tmpfs (`shmem`), the resident size of shared memory (`shrss`) the amount of shared memory that is currently s...

From atop manual (official ): > **MEM** - Memory occupation. > > This line contains the total amount of physical memory (tot), [...] the resident size of shared memory including tmpfs (shmem), the resident size of shared memory (shrss) the amount of shared memory that is currently swapped (shswp) [...] The description of shrss / shswp is too vague for me. What does it actually cover?

MEM | tot     7.7G  | free    3.6G |  cache   2.1G |  dirty   0.3M | buff  258.1M  | slab  306.4M |  slrec 182.7M | shmem 257.5M  | shrss  12.7M |  shswp   0.0M |  vmbal   0.0M |               | hptot   0.0M |  hpuse   0.0M |

# Or

tot     7.7G  
free    3.6G 
cache   2.1G 
dirty   0.3M 
buff  258.1M  
slab  306.4M 
slrec 182.7M 
shmem 257.5M  
shrss  12.7M 
shswp   0.0M 
vmbal   0.0M 

hptot   0.0M 
hpuse   0.0M |

---

$ rpm -q atop
atop-2.4.0-1.fc30.x86_64

sourcejedi (53232 rep)

Jun 29, 2019, 03:36 PM • Last activity: Aug 15, 2024, 09:59 PM

6 votes

5 answers

25305 views

How to configure atop sampling interval & log path?

atop

I installed `atop` version 1.26 on Ubuntu 12.04 server. By default, its sampling interval is 10 minutes. How can I change to 1 minute? Also, how can I set the log path? Currently the logs are at `/var/log/`, but I notice that `atop` added a folder `/var/log/atop/` (empty). How can I set the logs to...

                                  I installed atop version 1.26 on Ubuntu 12.04 server. By default, its sampling interval is 10 minutes. How can I change to 1 minute?

Also, how can I set the log path? Currently the logs are at /var/log/, but I notice that atop added a folder /var/log/atop/ (empty). How can I set the logs to be saved in this folder?

Raptor (315 rep)

Apr 13, 2016, 03:10 AM • Last activity: Aug 15, 2024, 07:26 PM

9 votes

2 answers

5675 views

Can I extract the full command line from an atop 1.23 data file?

atop

My production servers have [`atop`][1] version 1.23, and I have a data file written by it from which I need to extract the full command line of a process. Unfortunately: * this version of `atop` won't display data beyond the 80th column of the display * horizontal scrolling was only introduced in 1....

                                  My production servers have atop  version 1.23, and I have a data file written by it from which I need to extract the full command line of a process.  

Unfortunately:

 * this version of atop won't display data beyond the 80th column of the display
 * horizontal scrolling was only introduced in 1.27
 * newer versions refuse to read data files written by older versions
 * the file itself is compressed, so a simple strings search won't work

Is there any way I can recover the full command line from my data file?
                                

Flup (8335 rep)

May 14, 2013, 10:54 AM • Last activity: May 13, 2024, 12:12 PM

0 votes

1 answers

96 views

Persistent atop across reboots

command-line monitoring administration atop

[`atop` shows total CPU time since the last reboot](https://superuser.com/q/1816687). How can I have it persist across reboots?

                                  [atop shows total CPU time since the last reboot](https://superuser.com/q/1816687) . How can I have it persist across reboots?
                                

wdkmaaeo (1 rep)

Nov 16, 2023, 07:04 PM • Last activity: Nov 16, 2023, 08:26 PM

2 votes

2 answers

12607 views

how auto-rotate atop logs

linux rhel performance top atop

we can see that atop logs are created on each day and that take a lot of space ls -l /var/log/atop/ total 1634632 -rw-r--r-- 1 root root 127992086 Aug 30 01:49 atop_20180829 -rw-r--r-- 1 root root 262277153 Aug 31 00:00 atop_20180830 -rw-r--r-- 1 root root 321592670 Sep 1 00:00 atop_20180831 -rw-r--...

                                  we can see that atop logs are created on each day
and that take a lot of space

    ls -l /var/log/atop/
    total 1634632
    -rw-r--r-- 1 root root 127992086 Aug 30 01:49 atop_20180829
    -rw-r--r-- 1 root root 262277153 Aug 31 00:00 atop_20180830
    -rw-r--r-- 1 root root 321592670 Sep  1 00:00 atop_20180831
    -rw-r--r-- 1 root root 330041977 Sep  2 00:00 atop_20180901
    -rw-r--r-- 1 root root 269040388 Sep  3 00:00 atop_20180902
    -rw-r--r-- 1 root root 274807097 Sep  4 00:00 atop_20180903
    -rw-r--r-- 1 root root  85426960 Sep  4 06:03 atop_20180904
    -rw-r--r-- 1 root root         0 Sep  4 06:03 daily.log


how to limit the atop log for example to only 5 logs ( 5 last days )
                                

yael (13936 rep)

Sep 4, 2018, 06:06 AM • Last activity: Feb 14, 2022, 09:51 AM

8 votes

9 answers

17568 views

Change atop log interval from 10 minutes to 1 minute

atop

I'm having post-mortem `atop` log on Ubuntu machine, which I'm able to view with atop -r The problem is that log files are shown in 10 minute intervals with `t` and `Shift+t`, while I would prefer 1 minute intervals. When current time is changed with `b`, it is quantized to nearest 10 minute point....

                                  I'm having post-mortem atop log on Ubuntu machine, which I'm able to view with

    atop -r

The problem is that log files are shown in 10 minute intervals with t and Shift+t, while I would prefer 1 minute intervals. When current time is changed with b, it is quantized to nearest 10 minute point.

I have a line in /etc/init.d/atop, but I'm not sure if it affects anything here:

    INTERVAL=60

How can I make atop logs to be browsed with 1 minute accuracy with t and Shift+t?

Is the information between these 10 minute intervals already lost?

What are the possible workarounds if this is not possible?

Estus Flask (214 rep)

Feb 23, 2017, 07:41 PM • Last activity: Oct 6, 2021, 05:07 PM

1 votes

0 answers

193 views

When using atop as a systemd service it only runs and logs to file once

centos systemd top cpu-usage atop

I am on **CentOS 7** and using **atop v2.6.0** If I run atop manually with the following command everything runs as expected and the log is written to every 10 seconds: `/usr/bin/atop -w /var/log/atop/atop_20210320 10` However when running atop as a service with systemd the log is only written to on...

                                  I am on **CentOS 7** and using **atop v2.6.0**

If I run atop manually with the following command everything runs as expected and the log is written to every 10 seconds:

/usr/bin/atop -w /var/log/atop/atop_20210320 10

However when running atop as a service with systemd the log is only written to once when the service starts and never updated. 

I can confirm via ps -ef that the service has initiated the command and it's running:

root      2320     1  0 13:34 ?        00:00:00 /usr/bin/atop  -w /var/log/atop/atop_20210320 10

Any help would be much appreciated. Thank you.

Pel (11 rep)

Mar 20, 2021, 06:56 PM

3 votes

1 answers

3396 views

atop shows `swout` (swapping) when I have gigabytes of free memory. Why?

linux memory swap atop

Why did `atop` show that I was swapping out over 20,000 pages - over 80 megabytes - when I had gigabytes of free memory? I have not noticed a performance problem with this. I simply wish to take the opportunity to increase my knowledge :-). `atop` refreshes every ten seconds. Each refresh shows the...

Why did atop show that I was swapping out over 20,000 pages - over 80 megabytes - when I had gigabytes of free memory? I have not noticed a performance problem with this. I simply wish to take the opportunity to increase my knowledge :-). atop refreshes every ten seconds. Each refresh shows the activity since the last refresh.

MEM | tot     7.7G | free    3.7G | cache 608.3M | buff   19.1M | slab  264.6M |
SWP | tot     2.0G | free    1.4G |              | vmcom  13.4G | vmlim   5.8G |
PAG | scan  167264 | steal 109112 | stall      0 | swin       0 | swout  23834 |

                                "swout" is non-zero and coloured in red  ^

Kernel meminfo:

$ head -n 5 /proc/meminfo
MemTotal:        8042664 kB
MemFree:         3563088 kB
MemAvailable:    3922092 kB
Buffers:           20484 kB
Cached:           775308 kB

Kernel version:

$ uname -r
5.0.16-200.fc29.x86_64

--- 1. It is not clear that this would be affected by vm.swappiness. That setting balances cache reclaim v.s. swapping. However there is plenty of *free* memory, so why would we need to reclaim any memory in the first place? 2. As you can see, this is a small system. It does not use NUMA. I checked in /proc/zoneinfo and I only have one node, "Node 0". So this is not [caused by NUMA](https://serverfault.com/a/510220/133475) . 3. Related questions and answers mention the idea of "opportunistic swapping", "when the system has nothing better to do", "which may provide a benefit if there's a memory shortage later", etc. I do not find these ideas credible, because they contradict the kernel documentation. See https://unix.stackexchange.com/questions/533739/does-linux-perform-opportunistic-swapping-or-is-it-a-myth/533741#533741 4. There are no limits set on RAM usage using systemd.resources features. I.e. I think all systemd units have their RAM usage limit set to "infinity". $ systemctl show '*' | \ grep -E '(Memory|Swap).*(Max|Limit|High)' | \ grep -v infinity $ 5. Edit: I suspect this is related to transparent huge pages. I note that Virtual Machines use transparent huge pages to allocate the guest memory efficiently. They are the only user program that uses huge pages [on my system](https://unix.stackexchange.com/questions/495816/which-distributions-enable-transparent-huge-pages-for-all-applications) . There is a similar-looking question: https://unix.stackexchange.com/questions/530314/can-kswapd-be-active-if-free-memory-well-exceeds-pages-high-watermark It is asking about RHEL 6, which enables huge pages for all applications. I am not sure exactly how to reproduce this result. It happened when starting a VM. I use libvirt to run VMs. As per the default, VM disk reads are cached using the host page cache. (Cache mode: "Hypervisor default" means "Writeback"). I tried to stop the VM, FADVISE_DONTNEED the image file, and try again. But the same thing did not happen. Then I tried again with a different VM, and it happened briefly. I captured vmstat. I think atop showed a different, higher figure for "swout", but I did not capture it.

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 770168 5034300  28024 936256    0    2    21    86   60  101 22  9 68  1  0
 0  0 770168 5033852  28048 935904    0    0     0     8  372  633  1  1 97  0  0
 1  0 770168 4974048  28252 948152    3    0  1137   194 1859 2559 11  7 79  3  0
 0  1 770168 4557968  28308 1037512    0    0  8974    45 3595 6218 16  6 57 21  0
 6  3 770168 4317800  28324 1111408    0    0  7200   609 6820 6793 12  5 38 44  0
 0  4 770168 4125100  28348 1182624    0    0  6900   269 5054 3674 74  3  8 15  0
 0  5 770168 3898200  27840 1259768    2    0  9421   630 4694 5465 11  6 11 71  0
 1  3 770168 3760316  27672 1300540    0    0  9294   897 3308 4135  5  4 28 63  0
 0  1 770168 3531332  27032 1356236    0    0 10532   155 3140 4949  8  5 63 25  0
 0  0 783772 3381556  27032 1320296    0 1390  7320  4210 4450 5112 17  5 43 35  0
 0  0 783772 3446284  27056 1335116    0    0   239   441  995 1782  4  2 92  2  0
 0  0 783772 3459688  27076 1335372    0    0     3   410  728 1037  2  2 95  1  0

I also checked for a cgroup memory limit on the VM, on the off-chance that libvirt had bypassed systemd, and inflicted swapping on itself by mistake:

$ cd /sys/fs/cgroup/memory/machine.slice/machine-qemu\x2d5\x2ddebian9.scope
$ find -type d  # there were no sub-directories here
$ grep -H . *limit_in_bytes
memory.kmem.limit_in_bytes:9223372036854771712
memory.kmem.tcp.limit_in_bytes:9223372036854771712
memory.limit_in_bytes:9223372036854771712
memory.memsw.limit_in_bytes:9223372036854771712
memory.soft_limit_in_bytes:9223372036854771712
$ cd ../..
$ find -name "*limit_in_bytes" -exec grep -H -v 9223372036854771712 \{\} \;
$

sourcejedi (53232 rep)

May 26, 2019, 02:36 PM • Last activity: Feb 15, 2021, 09:53 PM

3 votes

2 answers

6094 views

How to completely stop atop from creating logs for good?

atop

I've installed `atop` on a orangePi/RaspberryPi with only 40MB for `/var/log`. It quickly filled my partition to 100%, I then purged it `apt purge atop` to remove it and its configuration files. * `atop` binary doesn't exist * its file in `/etc/init.d` doesn't exist * no systemd service called `atop...

I've installed atop on a orangePi/RaspberryPi with only 40MB for /var/log. It quickly filled my partition to 100%, I then purged it apt purge atop to remove it and its configuration files. * atop binary doesn't exist * its file in /etc/init.d doesn't exist * no systemd service called atop Nevertheless, after a reboot, the /var/log/atop folder comes back full with the old atop files. * **How can I completely remove it for good?** Edit:

$ ls /var/log/atop/
atop_20200822  atop_20200826  atop_20200830  atop_20200903  atop_20200907  atop_20200911  atop_20200915  atop_20200919
atop_20200823  atop_20200827  atop_20200831  atop_20200904  atop_20200908  atop_20200912  atop_20200916  atop_20200920
atop_20200824  atop_20200828  atop_20200901  atop_20200905  atop_20200909  atop_20200913  atop_20200917  atop_20200921
atop_20200825  atop_20200829  atop_20200902  atop_20200906  atop_20200910  atop_20200914  atop_20200918  atop_20200922

I removed atop on 2020/09/22, the last file. And it regenerates the same folder with the same files again and again after each reboot. ### atop folders and files:

# find . -type d -name  *atop*
./usr/share/doc/libatopology2
./var/log.hdd/atop
./var/log/atop
./var/cache/atop.d

# find . -type f -name  *atop*
./usr/lib/arm-linux-gnueabihf/libatopology.so.2.0.0
./var/log.hdd/atop/atop_20200913
...
./var/log.hdd/atop/atop_20200906
./var/log/atop/atop_20200830
...
./var/log/atop/atop_20200826
./var/cache/atop.d/atop.acct
./var/lib/dpkg/info/libatopology2:armhf.symbols
./var/lib/dpkg/info/libatopology2:armhf.md5sums
./var/lib/dpkg/info/libatopology2:armhf.list
./var/lib/dpkg/info/libatopology2:armhf.triggers
./var/lib/dpkg/info/libatopology2:armhf.shlibs

無名前 (729 rep)

Oct 3, 2020, 01:59 AM • Last activity: Oct 3, 2020, 03:31 AM

9 votes

1 answers

2898 views

`dd` is running at full speed, but I only see 20% disk utilization. Why?

linux atop

1. `sudo dd if=/dev/sda of=/dev/null bs=1M iflag=direct` 2. `atopsar -d 5 # in a second terminal` 3. `top # in a third terminal` Results from `atopsar` : ``` 19:18:32 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk_ ... 19:16:50 sda 18% 156.5 1024.0 0.0 0.0 5.0 1.15 ms 19:16:55 sda 18% 156...

1. sudo dd if=/dev/sda of=/dev/null bs=1M iflag=direct 2. atopsar -d 5 # in a second terminal 3. top # in a third terminal Results from atopsar :

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:16:50  sda             18%  156.5  1024.0     0.0     0.0   5.0   1.15 ms
19:16:55  sda             18%  156.3  1024.0     0.0     0.0   4.9   1.15 ms
...

Why is disk utilization ("busy") reported as much less than 100% ? According to top, the dd process only uses 3% of a CPU or less. top also provides an overall report of hardware and software interrupt (hi and si) usage of the system's CPU's, which shows as less than 1%. I have four CPUs (2 cores with 2 threads each). /dev/sda is a SATA HDD. It is not an SSD, it is not even a hybrid SSHD drive. It cannot read faster than about 150 megabytes per second :-). So that part of the results makes sense: 156 read/s * 1024 KB/read = 156 MB/s The kernel version is 5.0.9-200.fc29.x86_64 (Fedora Workstation 29). The IO scheduler is mq-deadline. Since kernel version 5.0, Fedora uses the multi-queue block layer. Because the single queue block layer has been removed :-). I believe the disk utilization figure in atopsar -d and atop is calculated from one of the [kernel iostat fields](https://github.com/torvalds/linux/blob/v5.0/Documentation/iostats.txt) . The linked doc mentions "field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the multi-queue block layer. As far as I can tell, both atopsar -d and atop use [common code](https://github.com/Atoptool/atop/blob/v2.4.0/photosyst.c#L992) to read this field 10. (I believe this field is also used by sar -d / iostat -x / mxiostat.py) ## Additional tests Variant 2: Changing to bs=512k, but keeping iflag=direct. dd if=/dev/sda of=/dev/null bs=512k iflag=direct

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:18:00  sda             35%  314.0   512.0     0.0     0.0   2.1   1.12 ms
19:18:05  sda             35%  313.6   512.0     0.2     4.0   2.1   1.11 ms

Variant 3: Using bs=1M, but removing iflag=direct . dd uses about 10% CPU, and 35% disk. dd if=/dev/sda of=/dev/null bs=1M

19:18:32  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_
...
19:21:47  sda             35%  242.3   660.2     0.0     0.0   5.4   1.44 ms
19:21:52  sda             31%  232.3   667.8     0.0     0.0   9.5   1.33 ms

# How to reproduce these results - essential details ## Beware of the last test, i.e. running dd *without* iflag=direct It is a bit of a hog. I saw it freeze the system (mouse cursor) for ten seconds or longer. Even when I had swap disabled. (The test fills your RAM with [buff/cache](https://unix.stackexchange.com/questions/440558/30-of-ram-is-buffers-what-is-it) . It is filling the inactive LRU list. I think the turnover evicts inactive cache pages relatively quickly. At the same time, the disk is busy with sequential reads, so it takes longer when you need to page something in. How bad this gets probably depends on whether the kernel ends up also turning over the active LRU list, or shrinking it too much. I.e. how well the current ["mash of a number of different algorithms with a number of modifications for catching corner cases and various optimisations"](https://unix.stackexchange.com/questions/281974/what-page-replacement-algorithms-are-used-in-linux-kernel-for-os-file-cache/282008) is working in your case). ## The exact results of the first test are difficult to reproduce. Sometimes, KB/read shows as 512 instead of 1024. In this case, the other results look more like the results from bs=512k. Including that it shows a disk utilization around 35%, instead of around 20%. My question stands in either case. If you would like to understand this behaviour, it is described here: https://unix.stackexchange.com/questions/529529/why-is-the-size-of-my-io-requests-being-limited-to-about-512k

sourcejedi (53232 rep)

May 4, 2019, 07:10 PM • Last activity: Jul 11, 2020, 04:00 PM

0 votes

1 answers

1629 views

`systemctl status atop` shows it has failed, but there is no error message

systemd logs atop

I installed `atop`, and then immediately upgraded from Fedora 29 to Fedora 30. Now `atop` failed to start during boot. But there is no error message. It only shows that `atop` exited with code 7. Why do I not see an error message? $ rpm -q atop atop-2.4.0-1.fc30.x86_64 $ journalctl -u atop -- Logs b...

                                  I installed atop, and then immediately upgraded from Fedora 29 to Fedora 30.  

Now atop failed to start during boot.  But there is no error message.  It only shows that atop exited with code 7.  Why do I not see an error message?

    $ rpm -q atop
    atop-2.4.0-1.fc30.x86_64
    $ journalctl -u atop
    -- Logs begin at Wed 2018-02-28 17:23:24 GMT, end at Tue 2019-04-30 22:41:24 BST. --
    Apr 30 17:27:05 fedora29-vm systemd: Started Atop advanced performance monitor.
    Apr 30 18:34:31 fedora29-vm systemd: Stopping Atop advanced performance monitor...
    Apr 30 18:34:34 fedora29-vm systemd: Stopped Atop advanced performance monitor.
    -- Reboot --
    Apr 30 19:48:56 fedora29-vm systemd: Started Atop advanced performance monitor.
    Apr 30 19:48:58 fedora29-vm systemd: atop.service: Main process exited, code=exited, status=7/NOTRUNNING
    Apr 30 19:48:58 fedora29-vm systemd: atop.service: Failed with result 'exit-code'.


                                

sourcejedi (53232 rep)

Apr 30, 2019, 09:48 PM • Last activity: Aug 13, 2019, 08:20 AM

0 votes

1 answers

615 views

atop died with exit status 53

exit-status atop

I was running `atop`. It was working, but later it died with exit status 53. $ atop $ echo $? 53 Is this a bug in `atop`? Or is my system failing? ## Pre-emptive extra details :-) I checked in the kernel log (`dmesg`). I found nothing *specifically* alarming, but there were some errors. See below. T...

I was running atop. It was working, but later it died with exit status 53. $ atop $ echo $? 53 Is this a bug in atop? Or is my system failing? ## Pre-emptive extra details :-) I checked in the kernel log (dmesg). I found nothing *specifically* alarming, but there were some errors. See below. This laptop has been used reliably for months now. The kernel is not "tainted", therefore there have been no kernel WARN or OOPS messages.

$ cat /proc/sys/kernel/tainted
0

$ rpm -q atop
atop-2.3.0-10.fc28.x86_64

$ uname -r  # My kernel version
5.1.6-200.fc29.x86_64

dmesg reminded me that I recently woke the computer from sleep, and I started atop before it went to sleep. dmesg also happens shows USB and FAT errors since I lost connection with a poorly-connected USB stick at some point during this. There were some hardware messages during resume from suspend. But I have not had any user-visible problem with sleep/wake, in the past several months. There are also some earlier dmesg errors. ---

[40654.082396] PM: suspend exit
[40654.205336] Bluetooth: hci0: Intel Bluetooth firmware file: intel/ibt-hw-37.8.10-fw-1.10.3.11.e.bseq
[40654.517627] Bluetooth: hci0: unexpected event for opcode 0xfc2f
[40654.534651] Bluetooth: hci0: Intel firmware patch completed and activated
[40655.519592] e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
                 TDH                  
                 TDT                  
                 next_to_use          
                 next_to_clean        
               buffer_info[next_to_clean]:
                 time_stamp           
                 next_to_watch        
                 jiffies              
                 next_to_watch.status 
               MAC Status             
               PHY Status             
               PHY 1000BASE-T Status  
               PHY Extended Status    
               PCI Status             
[40656.239991] Process accounting resumed
[40656.318355] e1000e: eno1 NIC Link is Down
...
[41013.654127] usb 2-2: USB disconnect, device number 43
[41013.901953] usb 2-2: new full-speed USB device number 46 using xhci_hcd
[41020.270316] FAT-fs (sdb4): Directory bread(block 22704) failed
[41020.270322] FAT-fs (sdb4): Directory bread(block 22705) failed
[41020.270325] FAT-fs (sdb4): Directory bread(block 22706) failed
[41020.270328] FAT-fs (sdb4): Directory bread(block 22707) failed
[41020.270331] FAT-fs (sdb4): Directory bread(block 22708) failed
[41020.270333] FAT-fs (sdb4): Directory bread(block 22709) failed
[41020.270336] FAT-fs (sdb4): Directory bread(block 22710) failed
[41020.270339] FAT-fs (sdb4): Directory bread(block 22711) failed
[41020.270354] FAT-fs (sdb4): Directory bread(block 22704) failed
[41020.270357] FAT-fs (sdb4): Directory bread(block 22705) failed
[41029.558085] usb 2-2: device descriptor read/64, error -110
[41029.780057] usb 2-2: device descriptor read/64, error -71
[41029.996102] usb 2-2: new full-speed USB device number 47 using xhci_hcd
[41030.110057] usb 2-2: device descriptor read/64, error -71
[41030.332052] usb 2-2: device descriptor read/64, error -71
[41030.434046] usb usb2-port2: attempt power cycle
[41031.061973] usb 2-2: new full-speed USB device number 48 using xhci_hcd
[41031.062086] usb 2-2: Device not responding to setup address.
[41031.266212] usb 2-2: Device not responding to setup address.
[41031.474101] usb 2-2: device not accepting address 48, error -71
[41031.588100] usb 2-2: new full-speed USB device number 49 using xhci_hcd
[41031.588294] usb 2-2: Device not responding to setup address.
[41031.794271] usb 2-2: Device not responding to setup address.
[41032.002095] usb 2-2: device not accepting address 49, error -71
[41032.002202] usb usb2-port2: unable to enumerate USB device
[41070.004040] usb 2-1: new high-speed USB device number 50 using xhci_hcd
[41075.638042] usb 2-1: device descriptor read/64, error -110
[41075.877302] usb 2-1: New USB device found, idVendor=0718, idProduct=063d, bcdDevice= 1.00
[41075.877309] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[41075.877313] usb 2-1: Product: Nano Pro
[41075.877316] usb 2-1: Manufacturer: Imation
[41075.877319] usb 2-1: SerialNumber: 0703335B0AF92D86
[41075.881530] usb-storage 2-1:1.0: USB Mass Storage device detected
[41075.882899] scsi host3: usb-storage 2-1:1.0
[41077.004052] scsi 3:0:0:0: Direct-Access     Imation  Nano Pro         PMAP PQ: 0 ANSI: 4
[41077.005015] sd 3:0:0:0: Attached scsi generic sg1 type 0
[41078.366583] sd 3:0:0:0: [sdc] 15116736 512-byte logical blocks: (7.74 GB/7.21 GiB)
[41078.366732] sd 3:0:0:0: [sdc] Write Protect is off
[41078.366735] sd 3:0:0:0: [sdc] Mode Sense: 23 00 00 00
[41078.366935] sd 3:0:0:0: [sdc] No Caching mode page found
[41078.366939] sd 3:0:0:0: [sdc] Assuming drive cache: write through
[41078.389637]  sdc: sdc1 sdc2 sdc3 sdc4
[41078.392305] sd 3:0:0:0: [sdc] Attached SCSI removable disk

Errors from earlier:

$ journalctl -o short-monotonic -k -p err
[   14.953318] alan-laptop kernel: i801_smbus 0000:00:1f.3: Failed to enable SMBus PCI device (-16)
[   15.919042] alan-laptop kernel: uvcvideo: Failed to query (GET_INFO) UVC control 3 on unit 1: -32 (exp. 1).
[   15.974276] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[   19.812805] alan-laptop kernel: db_root: cannot open: /etc/target
[   35.219443] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
...
[19625.693967] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[20606.292276] alan-laptop kernel: ACPI Error: Thread 3649118016 cannot release Mutex [PATM] acquired by thread 2284756800 (20190215/exmutex-382)
[20606.292282] alan-laptop kernel: ACPI Error: Aborting method \_SB.PCI0.LPCB.ECDV._Q66 due to previous error (AE_AML_NOT_OWNER) (20190215/psparse-53>
[20609.491788] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[22006.616549] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[28469.301081] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[28472.604260] alan-laptop kernel: Bluetooth: hci0: command 0x0c56 tx timeout
[30297.940165] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f
[30355.700904] alan-laptop kernel: mmc0: error -110 doing runtime resume
[30355.939525] alan-laptop kernel: Bluetooth: hci0: unexpected event for opcode 0xfc2f

sourcejedi (53232 rep)

Jun 20, 2019, 03:17 PM • Last activity: Jun 20, 2019, 03:25 PM

3 votes

1 answers

3720 views

IO wait time is higher than disk utilization. Isn't this impossible?

linux performance iostat atop vmstat

I am trying to improve my understanding, following this (so far) unanswered question: https://unix.stackexchange.com/questions/516433/possible-limiting-factor-during-upgrade-of-fedora-vm-not-disk-or-cpu-or-networ I ran the following test load, which took 200 seconds to complete. `sudo perf trace -s...

sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal

I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline. I use LVM, and the ext4 filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem. I have 4 "CPUs": 2 cores with 2 threads each. I have only one disk, /dev/sda, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth shows 32. vmstat 5 showed that non-idle CPU time *sometimes* rose to about one CPU, i.e. idle was as low as 75%.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

...

 1  1   3720 1600980 392948 3669196    0    0    14 30634 8769 1876  4  3 78 15  0
 1  1   3720 1600144 393128 3669884    0    0     0  3460 1406 1468  0  1 80 18  0
 0  1   3720 1599424 393356 3670532    0    0     0  6830 1416 1480  0  1 73 25  0
 0  1   3720 1598448 393700 3671108    0    0     0  7766 1420 1362  0  1 78 20  0
 0  1   3720 1597296 393940 3671616    0    0     0  5401 1374 1367  1  2 87 11  0
 0  1   3720 1596332 394312 3672416    0    0     0  1162 1647 1558  1  2 85 13  0
 3  0   3720 1652064 394564 3673540    0    0     0 17897 15406 1784  1  3 74 23  0
 0  0   3720 1972876 394600 3541064    0    0     0   522 8028 18327  3  3 84 10  0
 1  0   3720 1974152 394600 3541108    0    0     0     9  422  879  2  1 97  0  0
 0  0   3720 1974136 394600 3541120    0    0     0     0  204  455  0  0 99  0  0
(end of test)

And the "IO wait" time (the wa field under cpu in vmstat) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5 showed the utilization of my disk did not directly match this. It was much less than 100%:

22:46:44  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_

...

22:49:34  sda              5%    0.4     4.0    69.5   413.0  36.9   0.68 ms
22:49:39  sda              8%    0.2    60.0   120.6    30.6  18.7   0.66 ms
22:49:44  sda              8%    0.0     0.0   136.2    16.7  20.4   0.61 ms
22:49:49  sda             10%    0.0     0.0   157.1    44.2  21.4   0.65 ms
22:49:54  sda              9%    0.0     0.0   196.4    39.3  48.0   0.47 ms
22:49:59  sda              9%    0.0     0.0   148.9    36.6  32.6   0.62 ms
22:50:04  sda             10%    0.0     0.0   137.3   130.6  37.2   0.70 ms
22:50:09  sda             11%    0.0     0.0   199.6     5.4  13.5   0.55 ms
22:50:14  sda              2%    0.0     0.0    50.2     4.5  11.8   0.32 ms
22:50:19  sda              0%    0.0     0.0     0.8    11.0  13.3   0.75 ms
(end of test)

How can "IO wait" time be higher than disk utilization? > Following is the definition taken from the sar manpage: > > %iowait: > > Percentage of time that the CPU or CPUs were idle during which the > system had an outstanding disk I/O request. > > Therefore, %iowait means that from the CPU point of view, no tasks > were runnable, but at least one I/O was in progress. iowait is simply > a form of idle time when nothing could be scheduled. The value may or > may not be useful in indicating a performance problem, but it does > tell the user that the system is idle and could have taken more work. > > https://support.hpe.com/hpsc/doc/public/display?docId=c02783994 "IO wait" is tricksy to define on multi-CPU systems. See https://unix.stackexchange.com/questions/410628/how-does-a-cpu-know-there-is-io-pending . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure! I expect the disk utilization figure in atopsar -d (and equally in atop / sar -d / iostat -x / mxiostat.py) is calculated from one of the [kernel iostat fields](https://github.com/torvalds/linux/blob/v5.0/Documentation/iostats.txt) . The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer. --- Thanks to the perf in the test command, I can also report that dnf's fdatasync() calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.

199.440461084 seconds time elapsed

      60.043226000 seconds user
      11.858057000 seconds sys


60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k
496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps

 Summary of events:

...

 dnf (6177), 2438150 events, 76.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   fdatasync           3157 81436.062     0.160    25.795   465.329      1.92%
   wait4                 51 43911.444     0.017   861.009 32770.388     74.93%
   poll               45188  6052.070     0.001     0.134    96.764      5.58%
   read              341567  2718.531     0.001     0.008  1372.100     50.63%
   write             120402  1616.062     0.002     0.013   155.142      9.61%
   getpid             50012   755.423     0.001     0.015   207.506     32.86%
...

sourcejedi (53232 rep)

May 2, 2019, 10:25 PM • Last activity: May 3, 2019, 03:14 PM

1 votes

0 answers

32 views

Possible limiting factor during upgrade of Fedora VM - not disk or cpu or network?

fedora virtual-machine upgrade performance atop

I'm upgrading a VM from Fedora 29 to Fedora 30. I triggered the update from inside GNOME Software. It downloaded the upgrade, and asked to reboot. It started installing... `atop` shows the `qemu` process using about 15% of one CPU core. And my one hard drive at about 15% utilization. What other reso...

                                  I'm upgrading a VM from Fedora 29 to Fedora 30.  I triggered the update from inside GNOME Software.  It downloaded the upgrade, and asked to reboot.  It started installing...

atop shows the qemu process using about 15% of one CPU core.  And my one hard drive at about 15% utilization.

What other resource(s) could the VM possibly be waiting for, the other ~ 70% of the time?  Is it just [sleepy](https://en.wikipedia.org/wiki/Sleep_(system_call)) ??  Is there some other way to explain or find this out?

The upgrade showed the normal progress indicator, and completed without errors.

The VM is in libvirt (virt-manager), using the default KVM backend on x86-64.  I configured the VM disk using cache mode "none".  There's a 2x2 matrix of caching options; "none" means O_DIRECT, but not O_SYNC.  (My intent was to avoid [filling the host kernel's writeback cache](https://unix.stackexchange.com/questions/474119/system-lags-when-doing-large-r-w-operations-on-external-disks/474171) ; at least it seems to have left the host system so much more responsive than the last time I tried this).  "IO Mode" is set to "Hypervisor default".  The virtual disk controller is using VirtIO.  libvirt is from Fedora 29; it is version 4.7.0 and qemu is version 3.0.0.

sourcejedi (53232 rep)

Apr 30, 2019, 06:41 PM • Last activity: Apr 30, 2019, 10:33 PM

1 votes

1 answers

3726 views

How to read several days log files when use `atopsar`?

atop

My system is centos 7.4 with `atop-2.3.0-8.el7.x86_64`. I already have 3 days log in `var/log/atop`, such as `atop_20190312 atop_20190313 atop_20190314`. I want to find out the peak period of CPU consumption in those 3 days. Based on [this post][1],I tried: atopsar -r /var/log/atop/ -A -R 3 -c |head...

                                  My system is centos 7.4 with atop-2.3.0-8.el7.x86_64.  
I already have 3 days log in var/log/atop, such as atop_20190312  atop_20190313  atop_20190314.  

I want to find out the peak period of CPU consumption in those 3 days.  
Based on this post ,I tried:  

    atopsar -r /var/log/atop/ -A -R 3 -c |head
but failed.  

How to read several days atop log into one?
                                

kittygirl (111 rep)

Mar 14, 2019, 09:51 AM • Last activity: Mar 14, 2019, 05:56 PM

7 votes

1 answers

3915 views

Writeback cache (`dirty`) seems to be limited to even less than dirty_background_ratio. What is it being limited by? How is this limit calculated?

linux linux-kernel cache sysctl atop

I have been testing Linux `4.18.16-200.fc28.x86_64`. My system has 7.7G total RAM, according to `free -h`. I have default values for the `vm.dirty*` sysctl's. `dirty_background_ratio` is 10, and `dirty_ratio` is 20. Based on everything I've read, I expect Linux to begin writeout of dirty cache when...

                                  I have been testing Linux 4.18.16-200.fc28.x86_64.  My system has 7.7G total RAM, according to free -h.

I have default values for the vm.dirty* sysctl's.  dirty_background_ratio is 10, and dirty_ratio is 20.  Based on everything I've read, I expect Linux to begin writeout of dirty cache when it reaches 10% of RAM: 0.77G.  And buffered write() calls should block when dirty cache reaches 20% of RAM: 1.54G.

I ran dd if=/dev/zero of=~/test bs=1M count=2000 and watched the dirty field in atop.  While the dd command was running, the dirty value settled at around 0.5G.  This is significantly less than the dirty background threshold (0.77G)!  How can this be?  What am I missing?

dirty_expire_centisecs is 3000, so I don't think that can be the cause.  I even tried lowering dirty_expire_centisecs to 100, and dirty_writeback_centisecs to 10, to see if that was limiting dirty.  This did not change the result.

I initially wrote these observations as part of this investigation: https://unix.stackexchange.com/questions/480399/why-were-usb-stick-stall-problems-reported-in-2013-why-wasnt-this-problem-al/ 

---

I understand that half-way between the two thresholds - 15% = 1.155G - write() calls start being throttled (delayed) on a curve.  But no delay is added when underneath this ceiling; the processes generating dirty pages are allowed "free run".

As I understand it, the throttling aims to keep the dirty cache somewhere at or above 15%, and prevent hitting the 20% hard limit.  It does not provide a guarantee for every situation.  But I'm testing a simple case with one dd command; I think it should simply ratelimit the write() calls to match the writeout speed achieved by the device.

(There is not a simple guarantee because there are some complex exceptions.  For example, the throttle code limits the delay it will impose to a maximum of 200ms.  But not if the target ratelimit for the process is less than one page per second; in that case it will apply a strict ratelimit.)

* [Documentation/sysctl/vm.txt](https://github.com/torvalds/linux/blob/v4.18/Documentation/sysctl/vm.txt)  -- Linux v4.18
* [No-I/O dirty throttling](https://lwn.net/Articles/456904/)  -- 2011 LWN.net.
* > (dirty_background_ratio + dirty_ratio)/2 dirty data in
total ... is an amount of dirty data when we start to throttle
processes -- [Jan Kara, 2013](https://lore.kernel.org/lkml/20131029203050.GE9568@quack.suse.cz/) 
* > Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory

  > -- commit 143dfe8611a6, "[writeback: IO-less balance_dirty_pages()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/page-writeback.c?id=143dfe8611a63030ce0c79419dc362f7838be557) "
* > The memory-management subsystem will, by default, try to limit dirty pages to a maximum of 15% of the memory on the system. There is a "magical function" called balance_dirty_pages() that will, if need be, throttle processes dirtying a lot of pages in order to match the rate at which pages are being dirtied and the rate at which they can be cleaned." -- [Writeback and control groups](https://lwn.net/Articles/648292/) , 2015 LWN.net.
* [balance_dirty_pages()](https://elixir.bootlin.com/linux/v4.18.16/source/mm/page-writeback.c#L1554)  in Linux 4.18.16.
                                

sourcejedi (53232 rep)

Nov 7, 2018, 11:27 PM • Last activity: Feb 21, 2019, 12:36 PM

2 votes

1 answers

722 views

Why does atop open a raw socket?

linux netstat atop ss

I install `atop` on Debian 9. It runs as a monitoring daemon. Why is it listening on a raw socket? Raw sockets are used to generate arbitrary IPv4 packets or capture all packets read all packets for a given IP sub-protocol! But I don't think my `atop` and its logs show any information from reading p...

                                  I install atop on Debian 9.  It runs as a monitoring daemon.

Why is it listening on a raw socket?  Raw sockets are used to generate arbitrary IPv4 packets or capture all packets read all packets for a given IP sub-protocol!  But I don't think my atop and its logs show any information from reading packets.  I don't even use netatop - and that would require a kernel module, which is not included in Debian.   And I would be extremely surprised if any of the atop features involve *sending* raw IP packets.

    $ sudo netstat -l --raw -ep
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
    raw        0      0 0.0.0.0:255             0.0.0.0:*               7           root       2427667    7353/atop

    $ sudo ss -l --raw -p
    State      Recv-Q Send-Q Local Address:Port    Peer Address:Port    
    UNCONN     0      0      *:ipproto-255         *:*                  users:(("atop",pid=7353,fd=4))
                                

sourcejedi (53232 rep)

Nov 13, 2018, 11:54 AM • Last activity: Nov 13, 2018, 08:31 PM

Showing page 1 of 20 total questions