Sample Header Ad - 728x90

Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

1 votes
1 answers
2433 views
Understanding iostat block measurements
I am trying to understand how data is written to the disk. I'm writing data with `dd` using various block sizes, but it looks like the disk is always getting hit with the same size blocks, according to iostat. For example this command should write 128K blocks. dd if=/dev/zero of=/dev/sdb bs=128K cou...
I am trying to understand how data is written to the disk. I'm writing data with dd using various block sizes, but it looks like the disk is always getting hit with the same size blocks, according to iostat. For example this command should write 128K blocks. dd if=/dev/zero of=/dev/sdb bs=128K count=300000 Trimmed output of iostat -dxm 1: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 129897.00 0.00 1024.00 0.00 512.00 1024.00 142.09 138.81 0.00 138.81 0.98 100.00 My reading of this is that it's writing 512MBps in 1024 operations. This means each write = 512/1024 = 512K. Another way of calculating the same thing: The avgrq-sz column shows 1024 sectors. According to gdisk the sector size of this Samsung 850 Pro SSD is 512B, therefore each write is 1024 sectors * 512B = 512K. So my question is, why is it writing 512K blocks instead of 128K as specified with dd? If I change dd to write 4M blocks, the iostat result is exactly the same. The merges number doesn't make sense to me either. That was writing directly to the block device; but if I format it XFS and write to the filesystem, the numbers are the same except zero merges: dd if=/dev/zero of=/mnt/ddtest bs=4M count=3000 Now iostat shows Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 1024.00 0.00 512.00 1024.00 142.31 138.92 0.00 138.92 0.98 100.00 I'm using RHEL 7.7 by the way.
Elliott B (575 rep)
Sep 10, 2019, 08:38 AM • Last activity: May 14, 2025, 01:11 AM
5 votes
4 answers
2882 views
iostat: avoid displaying loop devices information
Given the annoying feature of `snap`'s `loop` devices, my `iostat` output in an `Ubuntu 18.04.02` is kind of like this Is there a way to filter out `loop` devices other than ` | grep -v loop` ? $ iostat -xm Linux 4.15.0-47-generic (pkara-pc01) 04/22/2019 _x86_64_ (4 CPU) avg-cpu: %user %nice %system...
Given the annoying feature of snap's loop devices, my iostat output in an Ubuntu 18.04.02 is kind of like this Is there a way to filter out loop devices other than | grep -v loop ? $ iostat -xm Linux 4.15.0-47-generic (pkara-pc01) 04/22/2019 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 19.85 0.03 5.64 2.18 0.00 72.30 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util loop0 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00 0.00 0.00 2.88 0.00 0.50 0.00 loop1 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.25 0.00 0.00 1.80 0.00 0.09 0.00 loop2 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.47 0.00 0.00 6.57 0.00 1.06 0.00 loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.40 0.00 0.00 2.50 0.00 0.00 0.00 loop4 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.54 0.00 0.00 2.44 0.00 1.54 0.00 loop5 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.29 0.00 0.00 2.86 0.00 0.76 0.00 loop6 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.55 0.00 0.00 1.89 0.00 0.17 0.00 loop7 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.39 0.00 0.00 1.75 0.00 0.24 0.00 sda 0.07 0.00 0.00 0.00 0.00 0.00 5.32 9.09 9.29 41.60 0.00 12.47 3.20 9.51 0.07 sdb 23.40 103.81 0.40 38.12 8.47 10.16 26.59 8.91 8.48 6.17 0.84 17.49 376.07 0.94 11.93 dm-0 31.96 113.83 0.40 38.08 0.00 0.00 0.00 0.00 12.17 10.89 1.63 12.74 342.57 0.82 12.00 dm-1 31.91 113.30 0.40 38.08 0.00 0.00 0.00 0.00 12.19 10.95 1.63 12.74 344.17 0.83 12.04 dm-2 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.00 0.00 19.90 0.00 0.41 0.00 dm-3 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.82 58.91 0.00 10.21 2.91 10.02 0.05 dm-4 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.23 64.80 0.00 14.69 3.20 14.35 0.05 loop8 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.54 0.00 0.00 9.13 0.00 2.67 0.01 loop9 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.83 0.00 0.00 2.66 0.00 1.79 0.00 loop10 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 28.96 0.00 0.00 20.96 0.00 3.20 0.00 loop11 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 31.20 0.00 0.00 20.80 0.00 4.80 0.00 loop12 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.56 0.00 0.00 9.28 0.00 1.56 0.00 loop13 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.77 0.00 0.00 9.36 0.00 2.12 0.00 loop14 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13.65 0.00 0.00 9.76 0.00 0.71 0.00 loop15 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 30.40 0.00 0.00 20.96 0.00 4.08 0.00 loop16 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.85 0.00 0.00 5.22 0.00 0.49 0.00 loop17 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.23 0.00 0.00 2.48 0.00 1.00 0.00 loop18 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.66 0.00 0.00 2.50 0.00 0.70 0.00 loop19 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.86 0.00 0.00 6.27 0.00 3.29 0.00 loop20 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.48 0.00 0.00 9.15 0.00 1.65 0.00 loop21 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.66 0.00 0.00 9.83 0.00 1.29 0.00 loop22 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.10 0.00 0.00 5.09 0.00 0.70 0.00 loop23 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.32 0.00 0.00 3.05 0.00 1.16 0.00 loop24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20.60 0.00 0.00 2.50 0.00 4.20 0.00 loop25 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.30 0.00 0.00 2.44 0.00 2.37 0.00 loop26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.60 0.00 0.00 0.00
pkaramol (3109 rep)
Apr 22, 2019, 07:34 AM • Last activity: Apr 13, 2025, 06:34 PM
1 votes
1 answers
61 views
Linux kernel phantom reads
Why if i write to a raw hard disk (without FS) the kernel also makes reads. $ sudo dd if=/dev/zero of=/dev/sda bs=32k count=1 oflag=direct status=none $ iostat -xc 1 /dev/sda | grep -E "Device|sda" Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %u...
Why if i write to a raw hard disk (without FS) the kernel also makes reads. $ sudo dd if=/dev/zero of=/dev/sda bs=32k count=1 oflag=direct status=none $ iostat -xc 1 /dev/sda | grep -E "Device|sda" Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 45,54 0,99 1053,47 31,68 0,00 0,00 0,00 0,00 1,17 3071,00 3,04 23,13 32,00 66,38 308,91 Is it readahead? Instead of dd i wrote a c program that does the same, i even used posix_fadvise to hint the kernel that i do not want read ahead. #include #include #include #include #include #include #define _GNU_SOURCE #define BLOCKSIZE 512 #define COUNT 32768 int main(void) { // read COUNT bytes from /dev/zero int fd; mode_t mode = O_RDONLY; char *filename = "/dev/zero"; fd = openat(AT_FDCWD, filename, mode); if (fd < 0) { perror("Creation error"); exit (1); } void *pbuf; posix_memalign(&pbuf, BLOCKSIZE, COUNT); size_t a; a = COUNT; ssize_t ret; ret = read(fd, pbuf, a); if (ret < 0) { perror("read error"); exit (1); } close (fd); // write COUNT bytes to /dev/sda int f = open("/dev/sda", O_WRONLY|__O_DIRECT); ret = posix_fadvise (f, 0, COUNT, POSIX_FADV_NOREUSE); if (ret < 0) perror ("posix_fadvise"); ret = write(f, pbuf, COUNT); if (ret < 0) { perror("write error"); exit (1); } close(f); free(pbuf); return 0; } But the result is the same $ iostat -xc 1 /dev/sda | grep -E "Device|sda" Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sda 46,00 1,00 1064,00 32,00 10,78 1,00 0,43 23,13 32,00 10,55 49,60 It does not matter if it is a spindel disk or ssd , the result is the same. Also tried different kernels.
Alex (923 rep)
Jun 18, 2024, 10:53 AM • Last activity: Jun 20, 2024, 03:22 PM
0 votes
1 answers
236 views
Is there iostat-similar tool that tracks swap area activity and page cache miss?
The iostat tool is able to tell us cpu usage, disk r/w throughput second-by-second. Is there a similar tool to track swap area activity and page cache miss? For example, the tool should tell us the magnitude of swap area activity and number of page cache miss second-by-second.
The iostat tool is able to tell us cpu usage, disk r/w throughput second-by-second. Is there a similar tool to track swap area activity and page cache miss? For example, the tool should tell us the magnitude of swap area activity and number of page cache miss second-by-second.
Dachuan Huang (21 rep)
Jun 2, 2023, 04:32 AM • Last activity: Jun 2, 2023, 08:16 AM
0 votes
2 answers
2067 views
In iostat, why are kB_wrtn/s and kB_wrtn the same?
/dev/sdc is a SATA hard drive. Do the kB_read and kB_wrtn fields sometimes, in some situations, show total counts? Here it seems to be just the same as the per second value. - Linux kernel 5.4.0-26-generic. - sysstat version 12.2.0 `iostat -dz 1` ``` Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read...
/dev/sdc is a SATA hard drive. Do the kB_read and kB_wrtn fields sometimes, in some situations, show total counts? Here it seems to be just the same as the per second value. - Linux kernel 5.4.0-26-generic. - sysstat version 12.2.0 iostat -dz 1
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sdc              40.00         0.00        21.00         0.00          0         21          0


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0              6.00         0.00        24.00         0.00          0         24          0
sdc              42.00         0.00        42.50         0.00          0         42          0


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
dm-0              5.00         0.00        20.00         0.00          0         20          0
sdc              43.00         0.00        36.00         0.00          0         36          0


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sdc              48.00         0.00        25.00         0.00          0         25          0


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sdc              36.00         0.00        18.50         0.00          0         18          0


Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sdc              40.00         0.00        21.00         0.00          0         21          0
brendan (195 rep)
Mar 28, 2021, 12:08 PM • Last activity: Mar 27, 2023, 01:12 PM
4 votes
2 answers
2003 views
Why doesn't my IOstat change its output at all?
My IOstat doesn't change...at all. It'll show a change in blocks being read and written, but it doesn't change at all when it comes to blocks/kB/MB read and written. When the server sits idle...it shows 363kB_read/s, 537kB_wrtn/s. If I put it under heavy load...it says the same thing. Is it bugged o...
My IOstat doesn't change...at all. It'll show a change in blocks being read and written, but it doesn't change at all when it comes to blocks/kB/MB read and written. When the server sits idle...it shows 363kB_read/s, 537kB_wrtn/s. If I put it under heavy load...it says the same thing. Is it bugged out? How do I fix it? Using Centos 6, being used a primary mysql server.
Sqenix (43 rep)
Mar 18, 2015, 06:03 PM • Last activity: Feb 9, 2023, 11:36 PM
0 votes
0 answers
974 views
iostat returns disk utilization greater than 100% while profiling a Beaglebone Black board
I need to profile the performance of software running on a BeagleBone Black (BBB). The BBB has an ARM Cortex-A8 up to 1GHz frequency, 512MB RAM, and 4GB eMMC onboard flash storage. You can find more information here: https://beagleboard.org/black The BBB runs Debian bullseye booted from a 14GB Micro...
I need to profile the performance of software running on a BeagleBone Black (BBB). The BBB has an ARM Cortex-A8 up to 1GHz frequency, 512MB RAM, and 4GB eMMC onboard flash storage. You can find more information here: https://beagleboard.org/black The BBB runs Debian bullseye booted from a 14GB MicroSD: debian@BeagleBone:~$ uname -a Linux BeagleBone 5.10.109-ti-r45 #1bullseye SMP PREEMPT Fri May 6 16:59:02 UTC 2022 armv7l GNU/Linux ## Problem As first trial, I'm running in parallel iostat with dd: dd if=/dev/urandom of=~debian/ddtest/200MBfile bs=1M count=200 & iostat -xdz 1 20 I don't understand why iostat returns utilization values greater then 100%. I'm getting the metrics each second specifying 1 as argument in the command line. This is an excerpt of what I see in the terminal: Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 18.00 9216.00 0.00 0.00 1062.00 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19.12 92.80 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 16.00 8192.00 7.00 30.43 2058.25 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.93 101.20 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 25.00 12800.00 0.00 0.00 2295.64 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 57.39 103.60 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 33.00 15908.00 0.00 0.00 1136.58 482.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 37.51 89.60 mmcblk0 is the name of the 14GB memory (as stated by lsblk). I downloaded sysstat, namely the package including iostat, directly from https://packages.debian.org/stable/sysstat and installed the version 12.5.2-2: `sudo apt list | grep sysstat sysstat/stable,now 12.5.2-2 armhf [residual-config]` I checked the source code of sysstat and I saw that the utilization is calculated at line 381 of rd_stats.c: xds→util = S_VALUE(spd→tot_ticks, sdc→tot_ticks, itc); S_VALUE is a MACRO defined at line 154 of common.h: #define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * 100) Each second iostat reads the ms spent doing I/O from /proc/diskstats. The variable sdc->tot_ticks represents the last value read, sdp->tot_ticks the previous sampled value, while itc is the sampling interval we set from the command line (i.e., one second). I can't understand why iostat returns values greater than 100%. I noticed that the time spent doing I/O (sdc->tot_ticks - sdp->tot_ticks) is often greater than itc. My guess is that iostat performs disk operations, that dd is preempted by the scheduler between two /proc/diskstat sampling or there are some processes running in parallel. ## Experiments I did some experiments, but I'm still don't get the source of the problem. With iotop, I checked what processes were running concurrently with dd. I found a few journaling processes, i.e., jb2 and systemdjournald. They do not affect disk utilization, since the time indicated at the 10th field of /proc/diskstats records the time queues and disks were busy, taking concurrency into account (https://serverfault.com/questions/862334/interpreting-read-write-and-total-io-time-in-proc-diskstats) . I made a trivial bash script (attached to this paragraph) that mimics the behavior of iostat. It retrieves the 10th field from /proc/diskstats and calculates the utilization given an observation time. I set an observation period of 1 second, the same as my first attempt with iostat, and obtained a utilization of more than 100%. I believe that iostat is not the problem, as confirmed by this issue: https://github.com/sysstat/sysstat/issues/73#issuecomment-860158402 Using the bash script, I got even higher values than those received from iostat. I believe this is due to /proc/diskstats readings or BBB performance extending the execution time of the script (or iostat).
#!/bin/bash
for (( i=0; i<$2; i++ ));
do
  value=$(cat "/proc/diskstats" | awk '/ mmcblk0 / {print $13}')
  if [ ! -z "$prev" ]; then
    bc -l <<< "scale=4;(($value - $prev)/($1*1000))*100"
  fi
  sleep $1
  prev=$value
done
I observed that running dd with the oflag=sync option decreases disk utilization. Also, journaling processes are not executed at the same time as dd but after it. This flag blocks the writing process until it is actually written to the device. This is the output of perf recording who makes block I/O requests: sudo perf record -e block:block_rq_insert -a dd if=/de/urandom of=~debian/ddtest/500MBfile bs=1M count=200 without oflag=sync without <code class=oflag=sync" class="img-fluid rounded" style="max-width: 100%; height: auto; margin: 10px 0;" loading="lazy"> with oflag=sync with <code class=oflag=sync" class="img-fluid rounded" style="max-width: 100%; height: auto; margin: 10px 0;" loading="lazy"> I'm missing how the kernel updates /proc/diskstats. I hope someone more experienced than me on the platform can help me understand the problem. Thank you.
vnzstc (1 rep)
Jul 8, 2022, 06:31 PM • Last activity: Jul 18, 2022, 12:25 PM
0 votes
0 answers
101 views
Our MySQL is read heavy, but iostat reports that almost no reads are taking place. How come?
According to MySQL's `STATUS` command, we have about 500 reads and ~20 writes to our DB per second. But `iostat` is reporting that ~70 writes (`w/s`) and ~0.5 reads (`r/s`) are taking place on the corresponding device. Why isn't `iostat` showing all the activity that should be caused by the `SELECT`...
According to MySQL's STATUS command, we have about 500 reads and ~20 writes to our DB per second. But iostat is reporting that ~70 writes (w/s) and ~0.5 reads (r/s) are taking place on the corresponding device. Why isn't iostat showing all the activity that should be caused by the SELECTs? Does this mean that they are hitting some cache and that's why we are not seeing them? If this is the case, how can I tell? (The filesystem the DB is on is a BBU Raid 10 with SSD discs.)
edmz (103 rep)
Apr 13, 2022, 10:59 PM
3 votes
1 answers
3834 views
Disk io stat “averaged” over a period of time
I am using the iostat utility on my RedHat Linux server to monitor the performance of a disk. When I use "iostat -xd sdh 1", I get the perf result printed every one second. When I use "iostat -xd sdh 5", I get the perf result printed every five second. My feeling is the latter command is printing a...
I am using the iostat utility on my RedHat Linux server to monitor the performance of a disk. When I use "iostat -xd sdh 1", I get the perf result printed every one second. When I use "iostat -xd sdh 5", I get the perf result printed every five second. My feeling is the latter command is printing a snapshot of the perf every five second, rather than averaging over the past 5 seconds. Am I correct in my understanding? If so, is there a way I can make iostat print the perf. number averaged over n seconds, or is there some other utility that will do that. Currently, the perf number is fluctuating within a range, and I want to get a somewhat "stable" number. I am hoping that averaging over a period of time will give me such a number. Thank you, Ahmed.
Ahmed (31 rep)
Aug 20, 2018, 08:43 PM • Last activity: Dec 8, 2021, 09:01 AM
0 votes
0 answers
221 views
How can I monitor whether disk activity is sychronous or asynchronous?
My Google-fu simply cannot find an answer to this. If I have a process with heavy I/O activity, how can I check whether it's using asynchronous or synchronous writes? (I want this information to decide whether to add a SLOG to my ZFS pool.)
My Google-fu simply cannot find an answer to this. If I have a process with heavy I/O activity, how can I check whether it's using asynchronous or synchronous writes? (I want this information to decide whether to add a SLOG to my ZFS pool.)
sssheridan (193 rep)
Jul 27, 2021, 09:35 AM
1 votes
0 answers
285 views
degraded iops and throughput on a linux machine in a cluster
we have a linux-based cluster on AWS with 8 workers. OS version (taken from /proc/version) is: Linux version 5.4.0-1029-aws (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020 worker id 5 was added recently, and the problem t...
we have a linux-based cluster on AWS with 8 workers. OS version (taken from /proc/version) is: Linux version 5.4.0-1029-aws (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020 worker id 5 was added recently, and the problem that we see is that during times of high disk util% due to a burst of writes into the workers, the disk mounted to that worker's data dir (/dev/nvme1n1p1) shows a degraded performance in terms of w/sec and wMB/sec, which are much lower on that worker compared to the other 7 workers (~40% less iops and throughput on that broker). the data in this table was taken from running iostat -x on all the brokers, starting at the same time and ending after 3 hours during peak time. the cluster handles ~2M messages/sec. another strange behavior is that broker id 7 has ~40% more iops and throughput during bursts of writes compared to the other brokers. worker type is i3en.3xlarge with one nvme ssd 7.5TB. **any idea as to what can cause such degraded performance in worker id 5 (or such a good performance on broker id 7)?** this issue is causing the consumers from this cluster to lag during high writes because worker id 5 gets into high iowait, and in case some consumer reads gets into lag and performs reads from the disk then the iowait on worker id 5 climbs into ~70% and all consumers start to lag and also the producers get OOM due buffered messages that the broker doesn't accept. iops & throughput of all workers in the cluster (taken from iostat -x)
Elad Eldor (11 rep)
May 13, 2021, 04:11 PM • Last activity: May 15, 2021, 05:02 PM
2 votes
0 answers
769 views
Process shows as 100% I/O bound while producing minimal disk activity, disk util is at 100%
We are having quite strange problem. There is a program (cryptocurrency node to be precise) which has local database of all the transactions ever made. Database is huge - around 15 TB. The problem is that the program won't synchronize with the network, though it has enough peers and knowledge about...
We are having quite strange problem. There is a program (cryptocurrency node to be precise) which has local database of all the transactions ever made. Database is huge - around 15 TB. The problem is that the program won't synchronize with the network, though it has enough peers and knowledge about new and old blocks is not a problem. Now the strange part - I have started the same program from scratch, without that history of 15TB, and it started syncing immediately, loading disk by about 50% per iostat. CPU and memory utilization are negligible. Absolute figures are: - Read speed: 5MB/s - Write speed: 20MB/s - iotop - 20% on average for this process When I switch to historical DB (15TB), iostat shows 100% disk utilization, iotop shows multiple forked processes with majority of them sitting at 99% of I/O, but actual I/O is not happening judging by the volume reported by iotop or iostat. Both read and write are within 1MB/s. This is running on MS Azure VM, through Azure portal we see that disk utilization is around 1% in "full" mode and writing is around 20% in "fresh" mode, so throttling by cloud operator is no an issue either. Now the question - how do I diagnose what exactly program is doing with the disk? I was thinking about random I/O, tried to strace lseek function, got some for both fresh and full modes, much less ratio in full mode, while I expected the opposite. What does it do in full mode then? Program has quite bearable number of file descriptors (/prod//fd), below 50 together with peer TCP connections. How can it be in general that both iostat and iotop show 100% utilization with no actual consuming of I/O bandwidth? We even had a call with engineer from Microsoft, he said that iostat may be not accurate especially with SSDs. Might be, but when it says util is 100%, iotop confirms it, and program is not doing what it is supposed to do, what is an alternative explanation?
DimaA6_ABC (121 rep)
Jan 23, 2021, 02:14 PM
0 votes
2 answers
698 views
How do I get the linux kernel to track io stats to a block device I create in a loadable module?
I've been looking and looking and everybody explains the /proc/diskstats file, but nobody seems to explain where that data comes from. I found this comment: ``` Just remember that /proc/diskstats is tracking the kernel’s read requests–not yours.``` on this page: ```https://kevinclosson.net/2018/10/0...
I've been looking and looking and everybody explains the /proc/diskstats file, but nobody seems to explain where that data comes from. I found this comment:
Just remember that /proc/diskstats is tracking the kernel’s read requests–not yours.
on this page:
://kevinclosson.net/2018/10/09/no-proc-diskstats-does-not-track-your-physical-i-o-requests/
But basically my problem is that I've got a kernel module that creates a block device, and handles requests via a request handler set via blk_queue_make_request not blk_init_queue, just like dm, I don't want the kernel to queue requests for me. Everything works fine, but nothing shows up in /proc/diskstats What bit of magic am I missing to get my stats in there so it will show up in iostat? I assumed the kernel would be tallying this information since it's handling the requests to the kernel module, but apparently not. or I'm missing a flag somewhere or something. Any ideas?
stu (143 rep)
Sep 29, 2020, 11:35 AM • Last activity: Nov 10, 2020, 04:59 PM
4 votes
1 answers
3316 views
Understanding iostat with Linux software RAID
I'm trying to understand what I see in `iostat`, specifically the differences between the output for md and sd devices. I have a couple of quite large Centos Linux servers, each with E3-1230 CPU, 16 GB RAM and 4 2TB SATA disk drives. Most are JBOD, but one is configure with software RAID 1+0. The se...
I'm trying to understand what I see in iostat, specifically the differences between the output for md and sd devices. I have a couple of quite large Centos Linux servers, each with E3-1230 CPU, 16 GB RAM and 4 2TB SATA disk drives. Most are JBOD, but one is configure with software RAID 1+0. The servers have very similar type and amount of load, but the %util figures I get with iostat on the software raid one is much higher than others, and I'm trying to understand why. All servers are usually 80-90% idle with regard to CPU. *Example of iostat on a server without RAID:*
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.26    0.19    1.15    2.55    0.00   86.84

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               2.48     9.45   10.45   13.08  1977.55  1494.06   147.50     2.37  100.61   3.86   9.08
sdc               4.38    24.11   13.25   20.69  1526.18  1289.87    82.97     1.40   41.14   3.94  13.36
sdd               0.06     1.28    1.43    2.50   324.67   587.49   232.32     0.45  113.73   2.77   1.09
sda               0.28     1.06    1.33    0.97   100.89    61.63    70.45     0.06   27.14   2.46   0.57
dm-0              0.00     0.00    0.17    0.24     4.49     1.96    15.96     0.01   18.09   3.38   0.14
dm-1              0.00     0.00    0.09    0.12     0.74     0.99     8.00     0.00    4.65   0.36   0.01
dm-2              0.00     0.00    1.49    3.34   324.67   587.49   188.75     0.45   93.64   2.25   1.09
dm-3              0.00     0.00   17.73   42.82  1526.17  1289.87    46.50     0.35    5.72   2.21  13.36
dm-4              0.00     0.00    0.11    0.03     0.88     0.79    12.17     0.00   19.48   0.87   0.01
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    1.17   1.17   0.00
dm-6              0.00     0.00   12.87   20.44  1976.66  1493.27   104.17     2.77   83.01   2.73   9.08
dm-7              0.00     0.00    1.36    1.58    95.65    58.68    52.52     0.09   29.20   1.55   0.46
*Example of iostat on a server with RAID 1+0:*
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.55    0.25    1.01    3.35    0.00   87.84

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb              42.21    31.78   18.47   59.18  8202.18  2040.94   131.91     2.07   26.65   4.02  31.20
sdc              44.93    27.92   18.96   55.88  8570.70  1978.15   140.94     2.21   29.48   4.60  34.45
sdd              45.75    28.69   14.52   55.10  8093.17  1978.16   144.66     0.21    2.95   3.94  27.42
sda              45.05    32.59   18.22   58.37  8471.04  2040.93   137.24     1.57   20.56   5.04  38.59
md1               0.00     0.00   18.17  162.73  3898.45  4013.90    43.74     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     4.89     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.07    0.26     3.30     2.13    16.85     0.04  135.54  73.73   2.38
dm-1              0.00     0.00    0.25    0.22     2.04     1.79     8.00     0.24  500.99  11.64   0.56
dm-2              0.00     0.00   15.55  150.63  2136.73  1712.31    23.16     1.77   10.66   2.93  48.76
dm-3              0.00     0.00    2.31    2.37  1756.39  2297.67   867.42     2.30  492.30  13.08   6.11
So my questions are: 1) Why is there such a relatively high %util on the server with RAID vs the one without. 2) On the non-RAID server the %util of the combined physical devices (sd*) are more or less the same as the combined LVM devices (dm-*). Why is that not the case for the RAID server? 3) Why does it seem like the software RAID devices (md*) are virtually idle, while the underlying physical devices (sd*) are busy? My first thought was that it might be caused by RAID checking, but /proc/mdadm shows all good. Edit: Apologies, I thought the question was clear, but that seems there is some confusion about it. Obviously the question is not about the difference in the %util between drives on one server, but why the total/avg %util value on one server is so different from the other. Hope that clarifies any misunderstanding.
Dokbua (209 rep)
Jul 10, 2014, 02:00 PM • Last activity: May 22, 2020, 09:02 AM
6 votes
1 answers
9275 views
Why is the size of my IO requests being limited, to about 512K?
I read `/dev/sda` using a 1MiB block size. Linux seems to limit the IO requests to 512KiB an average size of 512KiB. What is happening here? Is there a configuration option for this behaviour? ``` $ sudo dd iflag=direct if=/dev/sda bs=1M of=/dev/null status=progress 1545601024 bytes (1.5 GB, 1.4 GiB...
I read /dev/sda using a 1MiB block size. Linux seems to limit the IO requests to 512KiB an average size of 512KiB. What is happening here? Is there a configuration option for this behaviour?
$ sudo dd iflag=direct if=/dev/sda bs=1M of=/dev/null status=progress
1545601024 bytes (1.5 GB, 1.4 GiB) copied, 10 s, 155 MB/s
1521+0 records in
1520+0 records out
...
While my dd command is running, rareq-sz is 512. > rareq-sz The average size (in kilobytes) of the read requests that were issued to the device. > > -- [man iostat](http://man7.org/linux/man-pages/man1/iostat.1.html)
$ iostat -d -x 3
...
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda            309.00    0.00 158149.33      0.00     0.00     0.00   0.00   0.00    5.24    0.00   1.42   511.81     0.00   1.11  34.27
dm-0             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-2             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
dm-3             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
...
The kernel version is 5.1.15-300.fc30.x86_64. max_sectors_kb is 1280.
$ cd /sys/class/block/sda/queue
$ grep -H . max_sectors_kb max_hw_sectors_kb max_segments max_segment_size optimal_io_size logical_block_size chunk_sectors
max_sectors_kb:1280
max_hw_sectors_kb:32767
max_segments:168
max_segment_size:65536
optimal_io_size:0
logical_block_size:512
chunk_sectors:0
By default I use the BFQ I/O scheduler. I also tried repeating the test after echo 0 | sudo tee wbt_lat_usec. I also then tried repeating the test after echo mq-deadline|sudo tee scheduler. The results remained the same. Apart from WBT, I used the default settings for both I/O schedulers. E.g. for mq-deadline, iosched/read_expire is 500, which is equivalent to half a second. During the last test (mq-deadline, WBT disabled), I ran btrace /dev/sda. It shows all the requests were split into two unequal halves:
8,0    0     3090     5.516361551 15201  Q   R 6496256 + 2048 [dd]
  8,0    0     3091     5.516370559 15201  X   R 6496256 / 6497600 [dd]
  8,0    0     3092     5.516374414 15201  G   R 6496256 + 1344 [dd]
  8,0    0     3093     5.516376502 15201  I   R 6496256 + 1344 [dd]
  8,0    0     3094     5.516388293 15201  G   R 6497600 + 704 [dd]
  8,0    0     3095     5.516388891 15201  I   R 6497600 + 704 [dd]
  8,0    0     3096     5.516400193   733  D   R 6496256 + 1344 [kworker/0:1H]
  8,0    0     3097     5.516427886   733  D   R 6497600 + 704 [kworker/0:1H]
  8,0    0     3098     5.521033332     0  C   R 6496256 + 1344 
  8,0    0     3099     5.523001591     0  C   R 6497600 + 704
> X -- split On [software] raid or device mapper setups, an incoming i/o may straddle a device or internal zone and needs to be chopped up into smaller pieces for service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of normal boundary conditions. dm is notably bad at this and will clone lots of i/o. > > -- [man blkparse](http://man7.org/linux/man-pages/man1/blkparse.1.html) ## Things to ignore in iostat Ignore the %util number. It is broken in this version. (https://unix.stackexchange.com/questions/517132/dd-is-running-at-full-speed-but-i-only-see-20-disk-utilization-why/517219#517219) I *thought* aqu-sz is also affected [due to being based on %util](https://utcc.utoronto.ca/~cks/space/blog/linux/DiskIOStats) . Although I thought that meant it would be about three times too large here (100/34.27). Ignore the svtm number. "Warning! Do not trust this field any more. This field will be removed in a future sysstat version."
sourcejedi (53222 rep)
Jul 11, 2019, 10:51 AM • Last activity: Dec 18, 2019, 06:47 AM
2 votes
1 answers
1704 views
Does IOSTATS show output since boot or since last execution?
I see conflicting information online about use of IOSTAT. In particular I would like to be able to show an average since boot. Based on information I have read if I have never issued the command IOSTAT it will show the average since boot. But if at some point I have issued an IOSTAT command the next...
I see conflicting information online about use of IOSTAT. In particular I would like to be able to show an average since boot. Based on information I have read if I have never issued the command IOSTAT it will show the average since boot. But if at some point I have issued an IOSTAT command the next execution will not be since boot, but rather since last execution. How do I execute IOSTAT since boot assuming I have already run it once before.
barrypicker (157 rep)
Dec 9, 2019, 05:42 PM • Last activity: Dec 9, 2019, 06:02 PM
0 votes
1 answers
477 views
How to find which disk is being written to/read from in an LSI HW RAID logical volume?
On this system there is a lot of "await" which is causing slow response. i need to find out which disk behind the LSI logical volume is slowing it down... IBM blade, 2 HDD with LSI RAID (LV simple mirror) on top. On top of that is RHEL LVM. ..devices... # lsscsi [0:0:0:0] disk IBM-ESXS ST9146852SS B...
On this system there is a lot of "await" which is causing slow response. i need to find out which disk behind the LSI logical volume is slowing it down... IBM blade, 2 HDD with LSI RAID (LV simple mirror) on top. On top of that is RHEL LVM. ..devices... # lsscsi [0:0:0:0] disk IBM-ESXS ST9146852SS B62C - [0:0:1:0] disk IBM-ESXS ST9146852SS B62C - [0:1:3:0] disk LSILOGIC Logical Volume 3000 /dev/sda ...RHEL LVM... # pvdisplay -v Scanning for physical volume names --- Physical volume --- PV Name /dev/sda2 VG Name VolGroup00 PV Size 135.48 GB / not usable 13.20 MB Allocatable yes PE Size (KByte) 32768 disk latency... so for this iostat output, how do i know which device in LSI LV is causing delays? avg-cpu: %user %nice %system %iowait %steal %idle 0.19 0.00 0.16 8.41 0.00 91.25 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.00 0.00 12.50 0.00 272.00 21.76 7.01 1317.20 80.00 100.00 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 4.00 0.00 12.50 0.00 272.00 21.76 7.01 1317.20 80.00 100.00 dm-0 0.00 0.00 0.00 0.50 0.00 4.00 8.00 10.48 62549.00 1358.00 67.90 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.45 0.00 0.00 44.50 dm-2 0.00 0.00 0.00 5.50 0.00 44.00 8.00 2.85 898.00 167.64 92.20 dm-3 0.00 0.00 0.00 1.50 0.00 12.00 8.00 0.86 573.67 336.00 50.40 dm-4 0.00 0.00 0.00 0.50 0.00 4.00 8.00 2.41 5162.00 1610.00 80.50 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Rajeev (256 rep)
Jul 18, 2019, 08:08 PM • Last activity: Jul 18, 2019, 08:54 PM
2 votes
2 answers
3729 views
NVMe disk shows 80% io utilization, partitions show 0% io utilization
I have a CentOS 7 server (kernel `3.10.0-957.12.1.el7.x86_64`) with 2 NVMe disks with the following setup: # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 477G 0 disk ├─nvme0n1p1 259:2 0 511M 0 part /boot/efi ├─nvme0n1p2 259:4 0 19.5G 0 part │ └─md2 9:2 0 19.5G 0 raid1 / ├─nvme0n1p3...
I have a CentOS 7 server (kernel 3.10.0-957.12.1.el7.x86_64) with 2 NVMe disks with the following setup: # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 477G 0 disk ├─nvme0n1p1 259:2 0 511M 0 part /boot/efi ├─nvme0n1p2 259:4 0 19.5G 0 part │ └─md2 9:2 0 19.5G 0 raid1 / ├─nvme0n1p3 259:7 0 511M 0 part [SWAP] └─nvme0n1p4 259:9 0 456.4G 0 part └─data-data 253:0 0 912.8G 0 lvm /data nvme1n1 259:1 0 477G 0 disk ├─nvme1n1p1 259:3 0 511M 0 part ├─nvme1n1p2 259:5 0 19.5G 0 part │ └─md2 9:2 0 19.5G 0 raid1 / ├─nvme1n1p3 259:6 0 511M 0 part [SWAP] └─nvme1n1p4 259:8 0 456.4G 0 part └─data-data 253:0 0 912.8G 0 lvm /data Our monitoring and iostat continually shows nvme0n1 and nvme1n1 with 80%+ io utilization while the individual partitions have 0% io utilization and are fully available (250k iops, 1GB read/write per sec). avg-cpu: %user %nice %system %iowait %steal %idle 7.14 0.00 3.51 0.00 0.00 89.36 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme1n1 0.00 0.00 0.00 50.50 0.00 222.00 8.79 0.73 0.02 0.00 0.02 14.48 73.10 nvme1n1p1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1p2 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.00 0.02 0.00 0.02 0.01 0.05 nvme1n1p3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme1n1p4 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.73 0.02 0.00 0.02 14.77 73.10 nvme0n1p1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1p2 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.00 0.02 0.00 0.02 0.01 0.05 nvme0n1p3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 nvme0n1p4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 48.50 0.00 214.00 8.82 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 Any ideas what can be the root cause for such behavior? All seems to be working fine except for monitoring triggering high io alerts.
mike (105 rep)
May 7, 2019, 10:11 PM • Last activity: May 30, 2019, 10:38 AM
0 votes
1 answers
68 views
How to get Disk Stats based on different parameters
Is there any way to get a disk stats grouped on certain params like: writes by size/latency? reads by size/latency? something like: total writes - 100 writes by size: - < 4096 - 20 - 4096 - 16384 - 30 ... Where 4096/16384 being the chunk size.
Is there any way to get a disk stats grouped on certain params like: writes by size/latency? reads by size/latency? something like: total writes - 100 writes by size: - < 4096 - 20 - 4096 - 16384 - 30 ... Where 4096/16384 being the chunk size.
Manish Gupta (101 rep)
Mar 4, 2019, 11:31 AM • Last activity: May 21, 2019, 08:48 AM
3 votes
1 answers
3718 views
IO wait time is higher than disk utilization. Isn't this impossible?
I am trying to improve my understanding, following this (so far) unanswered question: https://unix.stackexchange.com/questions/516433/possible-limiting-factor-during-upgrade-of-fedora-vm-not-disk-or-cpu-or-networ I ran the following test load, which took 200 seconds to complete. `sudo perf trace -s...
I am trying to improve my understanding, following this (so far) unanswered question: https://unix.stackexchange.com/questions/516433/possible-limiting-factor-during-upgrade-of-fedora-vm-not-disk-or-cpu-or-networ I ran the following test load, which took 200 seconds to complete. sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline. I use LVM, and the ext4 filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem. I have 4 "CPUs": 2 cores with 2 threads each. I have only one disk, /dev/sda, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth shows 32. vmstat 5 showed that non-idle CPU time *sometimes* rose to about one CPU, i.e. idle was as low as 75%.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

...

 1  1   3720 1600980 392948 3669196    0    0    14 30634 8769 1876  4  3 78 15  0
 1  1   3720 1600144 393128 3669884    0    0     0  3460 1406 1468  0  1 80 18  0
 0  1   3720 1599424 393356 3670532    0    0     0  6830 1416 1480  0  1 73 25  0
 0  1   3720 1598448 393700 3671108    0    0     0  7766 1420 1362  0  1 78 20  0
 0  1   3720 1597296 393940 3671616    0    0     0  5401 1374 1367  1  2 87 11  0
 0  1   3720 1596332 394312 3672416    0    0     0  1162 1647 1558  1  2 85 13  0
 3  0   3720 1652064 394564 3673540    0    0     0 17897 15406 1784  1  3 74 23  0
 0  0   3720 1972876 394600 3541064    0    0     0   522 8028 18327  3  3 84 10  0
 1  0   3720 1974152 394600 3541108    0    0     0     9  422  879  2  1 97  0  0
 0  0   3720 1974136 394600 3541120    0    0     0     0  204  455  0  0 99  0  0
(end of test)
And the "IO wait" time (the wa field under cpu in vmstat) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5 showed the utilization of my disk did not directly match this. It was much less than 100%:
22:46:44  disk           busy read/s KB/read  writ/s KB/writ avque avserv _dsk_

...

22:49:34  sda              5%    0.4     4.0    69.5   413.0  36.9   0.68 ms
22:49:39  sda              8%    0.2    60.0   120.6    30.6  18.7   0.66 ms
22:49:44  sda              8%    0.0     0.0   136.2    16.7  20.4   0.61 ms
22:49:49  sda             10%    0.0     0.0   157.1    44.2  21.4   0.65 ms
22:49:54  sda              9%    0.0     0.0   196.4    39.3  48.0   0.47 ms
22:49:59  sda              9%    0.0     0.0   148.9    36.6  32.6   0.62 ms
22:50:04  sda             10%    0.0     0.0   137.3   130.6  37.2   0.70 ms
22:50:09  sda             11%    0.0     0.0   199.6     5.4  13.5   0.55 ms
22:50:14  sda              2%    0.0     0.0    50.2     4.5  11.8   0.32 ms
22:50:19  sda              0%    0.0     0.0     0.8    11.0  13.3   0.75 ms
(end of test)
How can "IO wait" time be higher than disk utilization? > Following is the definition taken from the sar manpage: > > %iowait: > > Percentage of time that the CPU or CPUs were idle during which the > system had an outstanding disk I/O request. > > Therefore, %iowait means that from the CPU point of view, no tasks > were runnable, but at least one I/O was in progress. iowait is simply > a form of idle time when nothing could be scheduled. The value may or > may not be useful in indicating a performance problem, but it does > tell the user that the system is idle and could have taken more work. > > https://support.hpe.com/hpsc/doc/public/display?docId=c02783994 "IO wait" is tricksy to define on multi-CPU systems. See https://unix.stackexchange.com/questions/410628/how-does-a-cpu-know-there-is-io-pending . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure! I expect the disk utilization figure in atopsar -d (and equally in atop / sar -d / iostat -x / mxiostat.py) is calculated from one of the [kernel iostat fields](https://github.com/torvalds/linux/blob/v5.0/Documentation/iostats.txt) . The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer. --- Thanks to the perf in the test command, I can also report that dnf's fdatasync() calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.
199.440461084 seconds time elapsed

      60.043226000 seconds user
      11.858057000 seconds sys


60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k
496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps

 Summary of events:

...

 dnf (6177), 2438150 events, 76.0%

   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   fdatasync           3157 81436.062     0.160    25.795   465.329      1.92%
   wait4                 51 43911.444     0.017   861.009 32770.388     74.93%
   poll               45188  6052.070     0.001     0.134    96.764      5.58%
   read              341567  2718.531     0.001     0.008  1372.100     50.63%
   write             120402  1616.062     0.002     0.013   155.142      9.61%
   getpid             50012   755.423     0.001     0.015   207.506     32.86%
...
sourcejedi (53222 rep)
May 2, 2019, 10:25 PM • Last activity: May 3, 2019, 03:14 PM
Showing page 1 of 20 total questions