Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
1
votes
1
answers
2433
views
Understanding iostat block measurements
I am trying to understand how data is written to the disk. I'm writing data with `dd` using various block sizes, but it looks like the disk is always getting hit with the same size blocks, according to iostat. For example this command should write 128K blocks. dd if=/dev/zero of=/dev/sdb bs=128K cou...
I am trying to understand how data is written to the disk. I'm writing data with
dd
using various block sizes, but it looks like the disk is always getting hit with the same size blocks, according to iostat. For example this command should write 128K blocks.
dd if=/dev/zero of=/dev/sdb bs=128K count=300000
Trimmed output of iostat -dxm 1
:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 129897.00 0.00 1024.00 0.00 512.00 1024.00 142.09 138.81 0.00 138.81 0.98 100.00
My reading of this is that it's writing 512MBps in 1024 operations. This means each write = 512/1024 = 512K.
Another way of calculating the same thing: The avgrq-sz column shows 1024 sectors. According to gdisk
the sector size of this Samsung 850 Pro SSD is 512B, therefore each write is 1024 sectors * 512B = 512K.
So my question is, why is it writing 512K blocks instead of 128K as specified with dd
? If I change dd
to write 4M blocks, the iostat result is exactly the same. The merges number doesn't make sense to me either.
That was writing directly to the block device; but if I format it XFS and write to the filesystem, the numbers are the same except zero merges:
dd if=/dev/zero of=/mnt/ddtest bs=4M count=3000
Now iostat shows
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 1024.00 0.00 512.00 1024.00 142.31 138.92 0.00 138.92 0.98 100.00
I'm using RHEL 7.7 by the way.
Elliott B
(575 rep)
Sep 10, 2019, 08:38 AM
• Last activity: May 14, 2025, 01:11 AM
5
votes
4
answers
2882
views
iostat: avoid displaying loop devices information
Given the annoying feature of `snap`'s `loop` devices, my `iostat` output in an `Ubuntu 18.04.02` is kind of like this Is there a way to filter out `loop` devices other than ` | grep -v loop` ? $ iostat -xm Linux 4.15.0-47-generic (pkara-pc01) 04/22/2019 _x86_64_ (4 CPU) avg-cpu: %user %nice %system...
Given the annoying feature of
snap
's loop
devices, my iostat
output in an Ubuntu 18.04.02
is kind of like this
Is there a way to filter out loop
devices other than | grep -v loop
?
$ iostat -xm
Linux 4.15.0-47-generic (pkara-pc01) 04/22/2019 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
19.85 0.03 5.64 2.18 0.00 72.30
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
loop0 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00 0.00 0.00 2.88 0.00 0.50 0.00
loop1 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.25 0.00 0.00 1.80 0.00 0.09 0.00
loop2 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.47 0.00 0.00 6.57 0.00 1.06 0.00
loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.40 0.00 0.00 2.50 0.00 0.00 0.00
loop4 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.54 0.00 0.00 2.44 0.00 1.54 0.00
loop5 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.29 0.00 0.00 2.86 0.00 0.76 0.00
loop6 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.55 0.00 0.00 1.89 0.00 0.17 0.00
loop7 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.39 0.00 0.00 1.75 0.00 0.24 0.00
sda 0.07 0.00 0.00 0.00 0.00 0.00 5.32 9.09 9.29 41.60 0.00 12.47 3.20 9.51 0.07
sdb 23.40 103.81 0.40 38.12 8.47 10.16 26.59 8.91 8.48 6.17 0.84 17.49 376.07 0.94 11.93
dm-0 31.96 113.83 0.40 38.08 0.00 0.00 0.00 0.00 12.17 10.89 1.63 12.74 342.57 0.82 12.00
dm-1 31.91 113.30 0.40 38.08 0.00 0.00 0.00 0.00 12.19 10.95 1.63 12.74 344.17 0.83 12.04
dm-2 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.00 0.00 19.90 0.00 0.41 0.00
dm-3 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.82 58.91 0.00 10.21 2.91 10.02 0.05
dm-4 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.23 64.80 0.00 14.69 3.20 14.35 0.05
loop8 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.54 0.00 0.00 9.13 0.00 2.67 0.01
loop9 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.83 0.00 0.00 2.66 0.00 1.79 0.00
loop10 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 28.96 0.00 0.00 20.96 0.00 3.20 0.00
loop11 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 31.20 0.00 0.00 20.80 0.00 4.80 0.00
loop12 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 18.56 0.00 0.00 9.28 0.00 1.56 0.00
loop13 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.77 0.00 0.00 9.36 0.00 2.12 0.00
loop14 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13.65 0.00 0.00 9.76 0.00 0.71 0.00
loop15 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 30.40 0.00 0.00 20.96 0.00 4.08 0.00
loop16 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.85 0.00 0.00 5.22 0.00 0.49 0.00
loop17 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.23 0.00 0.00 2.48 0.00 1.00 0.00
loop18 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.66 0.00 0.00 2.50 0.00 0.70 0.00
loop19 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.86 0.00 0.00 6.27 0.00 3.29 0.00
loop20 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12.48 0.00 0.00 9.15 0.00 1.65 0.00
loop21 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.66 0.00 0.00 9.83 0.00 1.29 0.00
loop22 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.10 0.00 0.00 5.09 0.00 0.70 0.00
loop23 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.32 0.00 0.00 3.05 0.00 1.16 0.00
loop24 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 20.60 0.00 0.00 2.50 0.00 4.20 0.00
loop25 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.30 0.00 0.00 2.44 0.00 2.37 0.00
loop26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.60 0.00 0.00 0.00
pkaramol
(3109 rep)
Apr 22, 2019, 07:34 AM
• Last activity: Apr 13, 2025, 06:34 PM
1
votes
1
answers
61
views
Linux kernel phantom reads
Why if i write to a raw hard disk (without FS) the kernel also makes reads. $ sudo dd if=/dev/zero of=/dev/sda bs=32k count=1 oflag=direct status=none $ iostat -xc 1 /dev/sda | grep -E "Device|sda" Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %u...
Why if i write to a raw hard disk (without FS) the kernel also makes reads.
$ sudo dd if=/dev/zero of=/dev/sda bs=32k count=1 oflag=direct status=none
$ iostat -xc 1 /dev/sda | grep -E "Device|sda"
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 45,54 0,99 1053,47 31,68 0,00 0,00 0,00 0,00 1,17 3071,00 3,04 23,13 32,00 66,38 308,91
Is it readahead?
Instead of
dd
i wrote a c program that does the same, i even used posix_fadvise
to hint the kernel that i do not want read ahead.
#include
#include
#include
#include
#include
#include
#define _GNU_SOURCE
#define BLOCKSIZE 512
#define COUNT 32768
int main(void)
{
// read COUNT bytes from /dev/zero
int fd;
mode_t mode = O_RDONLY;
char *filename = "/dev/zero";
fd = openat(AT_FDCWD, filename, mode);
if (fd < 0) {
perror("Creation error");
exit (1);
}
void *pbuf;
posix_memalign(&pbuf, BLOCKSIZE, COUNT);
size_t a;
a = COUNT;
ssize_t ret;
ret = read(fd, pbuf, a);
if (ret < 0) {
perror("read error");
exit (1);
}
close (fd);
// write COUNT bytes to /dev/sda
int f = open("/dev/sda", O_WRONLY|__O_DIRECT);
ret = posix_fadvise (f, 0, COUNT, POSIX_FADV_NOREUSE);
if (ret < 0)
perror ("posix_fadvise");
ret = write(f, pbuf, COUNT);
if (ret < 0) {
perror("write error");
exit (1);
}
close(f);
free(pbuf);
return 0;
}
But the result is the same
$ iostat -xc 1 /dev/sda | grep -E "Device|sda"
Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 46,00 1,00 1064,00 32,00 10,78 1,00 0,43 23,13 32,00 10,55 49,60
It does not matter if it is a spindel disk or ssd , the result is the same.
Also tried different kernels.
Alex
(923 rep)
Jun 18, 2024, 10:53 AM
• Last activity: Jun 20, 2024, 03:22 PM
0
votes
1
answers
236
views
Is there iostat-similar tool that tracks swap area activity and page cache miss?
The iostat tool is able to tell us cpu usage, disk r/w throughput second-by-second. Is there a similar tool to track swap area activity and page cache miss? For example, the tool should tell us the magnitude of swap area activity and number of page cache miss second-by-second.
The iostat tool is able to tell us cpu usage, disk r/w throughput second-by-second.
Is there a similar tool to track swap area activity and page cache miss?
For example, the tool should tell us the magnitude of swap area activity and number of page cache miss second-by-second.
Dachuan Huang
(21 rep)
Jun 2, 2023, 04:32 AM
• Last activity: Jun 2, 2023, 08:16 AM
0
votes
2
answers
2067
views
In iostat, why are kB_wrtn/s and kB_wrtn the same?
/dev/sdc is a SATA hard drive. Do the kB_read and kB_wrtn fields sometimes, in some situations, show total counts? Here it seems to be just the same as the per second value. - Linux kernel 5.4.0-26-generic. - sysstat version 12.2.0 `iostat -dz 1` ``` Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read...
/dev/sdc is a SATA hard drive. Do the kB_read and kB_wrtn fields sometimes, in some situations, show total counts? Here it seems to be just the same as the per second value.
- Linux kernel 5.4.0-26-generic.
- sysstat version 12.2.0
iostat -dz 1
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdc 40.00 0.00 21.00 0.00 0 21 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
dm-0 6.00 0.00 24.00 0.00 0 24 0
sdc 42.00 0.00 42.50 0.00 0 42 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
dm-0 5.00 0.00 20.00 0.00 0 20 0
sdc 43.00 0.00 36.00 0.00 0 36 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdc 48.00 0.00 25.00 0.00 0 25 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdc 36.00 0.00 18.50 0.00 0 18 0
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
sdc 40.00 0.00 21.00 0.00 0 21 0
brendan
(195 rep)
Mar 28, 2021, 12:08 PM
• Last activity: Mar 27, 2023, 01:12 PM
4
votes
2
answers
2003
views
Why doesn't my IOstat change its output at all?
My IOstat doesn't change...at all. It'll show a change in blocks being read and written, but it doesn't change at all when it comes to blocks/kB/MB read and written. When the server sits idle...it shows 363kB_read/s, 537kB_wrtn/s. If I put it under heavy load...it says the same thing. Is it bugged o...
My IOstat doesn't change...at all. It'll show a change in blocks being read and written, but it doesn't change at all when it comes to blocks/kB/MB read and written. When the server sits idle...it shows 363kB_read/s, 537kB_wrtn/s.
If I put it under heavy load...it says the same thing. Is it bugged out? How do I fix it?
Using Centos 6, being used a primary mysql server.
Sqenix
(43 rep)
Mar 18, 2015, 06:03 PM
• Last activity: Feb 9, 2023, 11:36 PM
0
votes
0
answers
974
views
iostat returns disk utilization greater than 100% while profiling a Beaglebone Black board
I need to profile the performance of software running on a BeagleBone Black (BBB). The BBB has an ARM Cortex-A8 up to 1GHz frequency, 512MB RAM, and 4GB eMMC onboard flash storage. You can find more information here: https://beagleboard.org/black The BBB runs Debian bullseye booted from a 14GB Micro...
I need to profile the performance of software running on a BeagleBone Black (BBB). The BBB has an ARM Cortex-A8 up to 1GHz frequency, 512MB RAM, and 4GB eMMC onboard flash storage. You can find more information here:
https://beagleboard.org/black
The BBB runs Debian bullseye booted from a 14GB MicroSD:
debian@BeagleBone:~$ uname -a
Linux BeagleBone 5.10.109-ti-r45 #1bullseye SMP PREEMPT Fri May 6 16:59:02 UTC 2022 armv7l GNU/Linux
## Problem
As first trial, I'm running in parallel
oflag=sync" class="img-fluid rounded" style="max-width: 100%; height: auto; margin: 10px 0;" loading="lazy">
with
oflag=sync" class="img-fluid rounded" style="max-width: 100%; height: auto; margin: 10px 0;" loading="lazy">
I'm missing how the kernel updates
iostat
with dd
:
dd if=/dev/urandom of=~debian/ddtest/200MBfile bs=1M count=200 & iostat -xdz 1 20
I don't understand why iostat
returns utilization values greater then 100%. I'm getting the metrics each second specifying 1
as argument in the command line. This is an excerpt of what I see in the terminal:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 18.00 9216.00 0.00 0.00 1062.00 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 19.12 92.80
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 16.00 8192.00 7.00 30.43 2058.25 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 32.93 101.20
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 25.00 12800.00 0.00 0.00 2295.64 512.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 57.39 103.60
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
mmcblk0 0.00 0.00 0.00 0.00 0.00 0.00 33.00 15908.00 0.00 0.00 1136.58 482.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 37.51 89.60
mmcblk0
is the name of the 14GB memory (as stated by lsblk
).
I downloaded sysstat
, namely the package including iostat
, directly from https://packages.debian.org/stable/sysstat and installed the version 12.5.2-2
:
`sudo apt list | grep sysstat
sysstat/stable,now 12.5.2-2 armhf [residual-config]`
I checked the source code of sysstat
and I saw that the utilization is calculated at line 381
of rd_stats.c
:
xds→util = S_VALUE(spd→tot_ticks, sdc→tot_ticks, itc);
S_VALUE
is a MACRO defined at line 154
of common.h
:
#define S_VALUE(m,n,p) (((double) ((n) - (m))) / (p) * 100)
Each second iostat
reads the ms
spent doing I/O from /proc/diskstats
. The variable sdc->tot_ticks
represents the last value read, sdp->tot_ticks
the previous sampled value, while itc
is the sampling interval we set from the command line (i.e., one second).
I can't understand why iostat
returns values greater than 100%. I noticed that the time spent doing I/O (sdc->tot_ticks - sdp->tot_ticks
) is often greater than itc
. My guess is that iostat
performs disk operations, that dd
is preempted by the scheduler between two /proc/diskstat
sampling or there are some processes running in parallel.
## Experiments
I did some experiments, but I'm still don't get the source of the problem.
With iotop
, I checked what processes were running concurrently with dd
. I found a few journaling processes, i.e., jb2 and systemdjournald. They do not affect disk utilization, since the time indicated at the 10th field of /proc/diskstats
records the time queues and disks were busy, taking concurrency into account (https://serverfault.com/questions/862334/interpreting-read-write-and-total-io-time-in-proc-diskstats) .
I made a trivial bash script (attached to this paragraph) that mimics the behavior of iostat
. It retrieves the 10th field from /proc/diskstats
and calculates the utilization given an observation time. I set an observation period of 1 second, the same as my first attempt with iostat
, and obtained a utilization of more than 100%. I believe that iostat
is not the problem, as confirmed by this issue:
https://github.com/sysstat/sysstat/issues/73#issuecomment-860158402
Using the bash script, I got even higher values than those received from iostat
. I believe this is due to /proc/diskstats
readings or BBB performance extending the execution time of the script (or iostat
).
#!/bin/bash
for (( i=0; i<$2; i++ ));
do
value=$(cat "/proc/diskstats" | awk '/ mmcblk0 / {print $13}')
if [ ! -z "$prev" ]; then
bc -l <<< "scale=4;(($value - $prev)/($1*1000))*100"
fi
sleep $1
prev=$value
done
I observed that running dd
with the oflag=sync
option decreases disk utilization. Also, journaling processes are not executed at the same time as dd
but after it. This flag blocks the writing process until it is actually written to the device. This is the output of perf
recording who makes block I/O requests:
sudo perf record -e block:block_rq_insert -a dd if=/de/urandom of=~debian/ddtest/500MBfile bs=1M count=200
without oflag=sync

oflag=sync

/proc/diskstats
. I hope someone more experienced than me on the platform can help me understand the problem. Thank you.
vnzstc
(1 rep)
Jul 8, 2022, 06:31 PM
• Last activity: Jul 18, 2022, 12:25 PM
0
votes
0
answers
101
views
Our MySQL is read heavy, but iostat reports that almost no reads are taking place. How come?
According to MySQL's `STATUS` command, we have about 500 reads and ~20 writes to our DB per second. But `iostat` is reporting that ~70 writes (`w/s`) and ~0.5 reads (`r/s`) are taking place on the corresponding device. Why isn't `iostat` showing all the activity that should be caused by the `SELECT`...
According to MySQL's
STATUS
command, we have about 500 reads and ~20 writes to our DB per second.
But iostat
is reporting that ~70 writes (w/s
) and ~0.5 reads (r/s
) are taking place on the corresponding device.
Why isn't iostat
showing all the activity that should be caused by the SELECT
s? Does this mean that they are hitting some cache and that's why we are not seeing them? If this is the case, how can I tell?
(The filesystem the DB is on is a BBU Raid 10 with SSD discs.)
edmz
(103 rep)
Apr 13, 2022, 10:59 PM
3
votes
1
answers
3834
views
Disk io stat “averaged” over a period of time
I am using the iostat utility on my RedHat Linux server to monitor the performance of a disk. When I use "iostat -xd sdh 1", I get the perf result printed every one second. When I use "iostat -xd sdh 5", I get the perf result printed every five second. My feeling is the latter command is printing a...
I am using the iostat utility on my RedHat Linux server to monitor the performance of a disk. When I use "iostat -xd sdh 1", I get the perf result printed every one second. When I use "iostat -xd sdh 5", I get the perf result printed every five second. My feeling is the latter command is printing a snapshot of the perf every five second, rather than averaging over the past 5 seconds. Am I correct in my understanding?
If so, is there a way I can make iostat print the perf. number averaged over n seconds, or is there some other utility that will do that.
Currently, the perf number is fluctuating within a range, and I want to get a somewhat "stable" number. I am hoping that averaging over a period of time will give me such a number.
Thank you, Ahmed.
Ahmed
(31 rep)
Aug 20, 2018, 08:43 PM
• Last activity: Dec 8, 2021, 09:01 AM
0
votes
0
answers
221
views
How can I monitor whether disk activity is sychronous or asynchronous?
My Google-fu simply cannot find an answer to this. If I have a process with heavy I/O activity, how can I check whether it's using asynchronous or synchronous writes? (I want this information to decide whether to add a SLOG to my ZFS pool.)
My Google-fu simply cannot find an answer to this.
If I have a process with heavy I/O activity, how can I check whether it's using asynchronous or synchronous writes?
(I want this information to decide whether to add a SLOG to my ZFS pool.)
sssheridan
(193 rep)
Jul 27, 2021, 09:35 AM
1
votes
0
answers
285
views
degraded iops and throughput on a linux machine in a cluster
we have a linux-based cluster on AWS with 8 workers. OS version (taken from /proc/version) is: Linux version 5.4.0-1029-aws (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020 worker id 5 was added recently, and the problem t...
we have a linux-based cluster on AWS with 8 workers.
OS version (taken from /proc/version) is:
Linux version 5.4.0-1029-aws (buildd@lcy01-amd64-021) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #30~18.04.1-Ubuntu SMP Tue Oct 20 11:09:25 UTC 2020
worker id 5 was added recently, and the problem that we see is that during times of high disk util% due to a burst of writes into the workers, the disk mounted to that worker's data dir (/dev/nvme1n1p1) shows a degraded performance in terms of w/sec and wMB/sec, which are much lower on that worker compared to the other 7 workers (~40% less iops and throughput on that broker).
the data in this table was taken from running iostat -x on all the brokers, starting at the same time and ending after 3 hours during peak time. the cluster handles ~2M messages/sec.
another strange behavior is that broker id 7 has ~40% more iops and throughput during bursts of writes compared to the other brokers.
worker type is i3en.3xlarge with one nvme ssd 7.5TB.
**any idea as to what can cause such degraded performance in worker id 5 (or such a good performance on broker id 7)?**
this issue is causing the consumers from this cluster to lag during high writes because worker id 5 gets into high iowait, and in case some consumer reads gets into lag and performs reads from the disk then the iowait on worker id 5 climbs into ~70% and all consumers start to lag and also the producers get OOM due buffered messages that the broker doesn't accept.

Elad Eldor
(11 rep)
May 13, 2021, 04:11 PM
• Last activity: May 15, 2021, 05:02 PM
2
votes
0
answers
769
views
Process shows as 100% I/O bound while producing minimal disk activity, disk util is at 100%
We are having quite strange problem. There is a program (cryptocurrency node to be precise) which has local database of all the transactions ever made. Database is huge - around 15 TB. The problem is that the program won't synchronize with the network, though it has enough peers and knowledge about...
We are having quite strange problem. There is a program (cryptocurrency node to be precise) which has local database of all the transactions ever made. Database is huge - around 15 TB. The problem is that the program won't synchronize with the network, though it has enough peers and knowledge about new and old blocks is not a problem.
Now the strange part - I have started the same program from scratch, without that history of 15TB, and it started syncing immediately, loading disk by about 50% per
iostat
. CPU and memory utilization are negligible. Absolute figures are:
- Read speed: 5MB/s
- Write speed: 20MB/s
- iotop - 20% on average for this process
When I switch to historical DB (15TB), iostat
shows 100% disk utilization, iotop
shows multiple forked processes with majority of them sitting at 99% of I/O, but actual I/O is not happening judging by the volume reported by iotop
or iostat
. Both read and write are within 1MB/s. This is running on MS Azure VM, through Azure portal we see that disk utilization is around 1% in "full" mode and writing is around 20% in "fresh" mode, so throttling by cloud operator is no an issue either.
Now the question - how do I diagnose what exactly program is doing with the disk? I was thinking about random I/O, tried to strace lseek
function, got some for both fresh and full modes, much less ratio in full mode, while I expected the opposite. What does it do in full mode then? Program has quite bearable number of file descriptors (/prod//fd
), below 50 together with peer TCP connections. How can it be in general that both iostat
and iotop
show 100% utilization with no actual consuming of I/O bandwidth? We even had a call with engineer from Microsoft, he said that iostat
may be not accurate especially with SSDs. Might be, but when it says util is 100%, iotop
confirms it, and program is not doing what it is supposed to do, what is an alternative explanation?
DimaA6_ABC
(121 rep)
Jan 23, 2021, 02:14 PM
0
votes
2
answers
698
views
How do I get the linux kernel to track io stats to a block device I create in a loadable module?
I've been looking and looking and everybody explains the /proc/diskstats file, but nobody seems to explain where that data comes from. I found this comment: ``` Just remember that /proc/diskstats is tracking the kernel’s read requests–not yours.``` on this page: ```https://kevinclosson.net/2018/10/0...
I've been looking and looking and everybody explains the /proc/diskstats file, but nobody seems to explain where that data comes from.
I found this comment:
Just remember that /proc/diskstats is tracking the kernel’s read requests–not yours.
on this page:
://kevinclosson.net/2018/10/09/no-proc-diskstats-does-not-track-your-physical-i-o-requests/
But basically my problem is that I've got a kernel module that creates a block device, and handles requests via a request handler set via blk_queue_make_request not blk_init_queue, just like dm, I don't want the kernel to queue requests for me.
Everything works fine, but nothing shows up in /proc/diskstats What bit of magic am I missing to get my stats in there so it will show up in iostat? I assumed the kernel would be tallying this information since it's handling the requests to the kernel module, but apparently not. or I'm missing a flag somewhere or something.
Any ideas?
stu
(143 rep)
Sep 29, 2020, 11:35 AM
• Last activity: Nov 10, 2020, 04:59 PM
4
votes
1
answers
3316
views
Understanding iostat with Linux software RAID
I'm trying to understand what I see in `iostat`, specifically the differences between the output for md and sd devices. I have a couple of quite large Centos Linux servers, each with E3-1230 CPU, 16 GB RAM and 4 2TB SATA disk drives. Most are JBOD, but one is configure with software RAID 1+0. The se...
I'm trying to understand what I see in
iostat
, specifically the differences between the output for md and sd devices.
I have a couple of quite large Centos Linux servers, each with E3-1230 CPU, 16 GB RAM and 4 2TB SATA disk drives. Most are JBOD, but one is configure with software RAID 1+0. The servers have very similar type and amount of load, but the %util
figures I get with iostat
on the software raid one is much higher than others, and I'm trying to understand why. All servers are usually 80-90% idle with regard to CPU.
*Example of iostat
on a server without RAID:*
avg-cpu: %user %nice %system %iowait %steal %idle 9.26 0.19 1.15 2.55 0.00 86.84 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 2.48 9.45 10.45 13.08 1977.55 1494.06 147.50 2.37 100.61 3.86 9.08 sdc 4.38 24.11 13.25 20.69 1526.18 1289.87 82.97 1.40 41.14 3.94 13.36 sdd 0.06 1.28 1.43 2.50 324.67 587.49 232.32 0.45 113.73 2.77 1.09 sda 0.28 1.06 1.33 0.97 100.89 61.63 70.45 0.06 27.14 2.46 0.57 dm-0 0.00 0.00 0.17 0.24 4.49 1.96 15.96 0.01 18.09 3.38 0.14 dm-1 0.00 0.00 0.09 0.12 0.74 0.99 8.00 0.00 4.65 0.36 0.01 dm-2 0.00 0.00 1.49 3.34 324.67 587.49 188.75 0.45 93.64 2.25 1.09 dm-3 0.00 0.00 17.73 42.82 1526.17 1289.87 46.50 0.35 5.72 2.21 13.36 dm-4 0.00 0.00 0.11 0.03 0.88 0.79 12.17 0.00 19.48 0.87 0.01 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.17 1.17 0.00 dm-6 0.00 0.00 12.87 20.44 1976.66 1493.27 104.17 2.77 83.01 2.73 9.08 dm-7 0.00 0.00 1.36 1.58 95.65 58.68 52.52 0.09 29.20 1.55 0.46*Example of
iostat
on a server with RAID 1+0:*
avg-cpu: %user %nice %system %iowait %steal %idle 7.55 0.25 1.01 3.35 0.00 87.84 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 42.21 31.78 18.47 59.18 8202.18 2040.94 131.91 2.07 26.65 4.02 31.20 sdc 44.93 27.92 18.96 55.88 8570.70 1978.15 140.94 2.21 29.48 4.60 34.45 sdd 45.75 28.69 14.52 55.10 8093.17 1978.16 144.66 0.21 2.95 3.94 27.42 sda 45.05 32.59 18.22 58.37 8471.04 2040.93 137.24 1.57 20.56 5.04 38.59 md1 0.00 0.00 18.17 162.73 3898.45 4013.90 43.74 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 4.89 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.07 0.26 3.30 2.13 16.85 0.04 135.54 73.73 2.38 dm-1 0.00 0.00 0.25 0.22 2.04 1.79 8.00 0.24 500.99 11.64 0.56 dm-2 0.00 0.00 15.55 150.63 2136.73 1712.31 23.16 1.77 10.66 2.93 48.76 dm-3 0.00 0.00 2.31 2.37 1756.39 2297.67 867.42 2.30 492.30 13.08 6.11So my questions are: 1) Why is there such a relatively high
%util
on the server with RAID vs the one without.
2) On the non-RAID server the %util
of the combined physical devices (sd*) are more or less the same as the combined LVM devices (dm-*). Why is that not the case for the RAID server?
3) Why does it seem like the software RAID devices (md*) are virtually idle, while the underlying physical devices (sd*) are busy? My first thought was that it might be caused by RAID checking, but /proc/mdadm
shows all good.
Edit: Apologies, I thought the question was clear, but that seems there is some confusion about it. Obviously the question is not about the difference in the %util
between drives on one server, but why the total/avg %util
value on one server is so different from the other. Hope that clarifies any misunderstanding.
Dokbua
(209 rep)
Jul 10, 2014, 02:00 PM
• Last activity: May 22, 2020, 09:02 AM
6
votes
1
answers
9275
views
Why is the size of my IO requests being limited, to about 512K?
I read `/dev/sda` using a 1MiB block size. Linux seems to limit the IO requests to 512KiB an average size of 512KiB. What is happening here? Is there a configuration option for this behaviour? ``` $ sudo dd iflag=direct if=/dev/sda bs=1M of=/dev/null status=progress 1545601024 bytes (1.5 GB, 1.4 GiB...
I read
/dev/sda
using a 1MiB block size. Linux seems to limit the IO requests to 512KiB an average size of 512KiB. What is happening here? Is there a configuration option for this behaviour?
$ sudo dd iflag=direct if=/dev/sda bs=1M of=/dev/null status=progress
1545601024 bytes (1.5 GB, 1.4 GiB) copied, 10 s, 155 MB/s
1521+0 records in
1520+0 records out
...
While my dd
command is running, rareq-sz
is 512.
> rareq-sz
The average size (in kilobytes) of the read requests that were issued to the device.
>
> -- [man iostat
](http://man7.org/linux/man-pages/man1/iostat.1.html)
$ iostat -d -x 3
...
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 309.00 0.00 158149.33 0.00 0.00 0.00 0.00 0.00 5.24 0.00 1.42 511.81 0.00 1.11 34.27
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
...
The kernel version is 5.1.15-300.fc30.x86_64
. max_sectors_kb
is 1280.
$ cd /sys/class/block/sda/queue
$ grep -H . max_sectors_kb max_hw_sectors_kb max_segments max_segment_size optimal_io_size logical_block_size chunk_sectors
max_sectors_kb:1280
max_hw_sectors_kb:32767
max_segments:168
max_segment_size:65536
optimal_io_size:0
logical_block_size:512
chunk_sectors:0
By default I use the BFQ I/O scheduler. I also tried repeating the test after echo 0 | sudo tee wbt_lat_usec
. I also then tried repeating the test after echo mq-deadline|sudo tee scheduler
. The results remained the same.
Apart from WBT, I used the default settings for both I/O schedulers. E.g. for mq-deadline
, iosched/read_expire
is 500, which is equivalent to half a second.
During the last test (mq-deadline, WBT disabled), I ran btrace /dev/sda
. It shows all the requests were split into two unequal halves:
8,0 0 3090 5.516361551 15201 Q R 6496256 + 2048 [dd]
8,0 0 3091 5.516370559 15201 X R 6496256 / 6497600 [dd]
8,0 0 3092 5.516374414 15201 G R 6496256 + 1344 [dd]
8,0 0 3093 5.516376502 15201 I R 6496256 + 1344 [dd]
8,0 0 3094 5.516388293 15201 G R 6497600 + 704 [dd]
8,0 0 3095 5.516388891 15201 I R 6497600 + 704 [dd]
8,0 0 3096 5.516400193 733 D R 6496256 + 1344 [kworker/0:1H]
8,0 0 3097 5.516427886 733 D R 6497600 + 704 [kworker/0:1H]
8,0 0 3098 5.521033332 0 C R 6496256 + 1344
8,0 0 3099 5.523001591 0 C R 6497600 + 704
> X -- split On [software] raid or device mapper setups, an incoming i/o may straddle a device or internal zone and needs to be chopped up into smaller
pieces for service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of
normal boundary conditions. dm is notably bad at this and will clone lots of i/o.
>
> -- [man blkparse
](http://man7.org/linux/man-pages/man1/blkparse.1.html)
## Things to ignore in iostat
Ignore the %util
number. It is broken in this version. (https://unix.stackexchange.com/questions/517132/dd-is-running-at-full-speed-but-i-only-see-20-disk-utilization-why/517219#517219)
I *thought* aqu-sz
is also affected [due to being based on %util](https://utcc.utoronto.ca/~cks/space/blog/linux/DiskIOStats) . Although I thought that meant it would be about three times too large here (100/34.27).
Ignore the svtm
number. "Warning! Do not trust this field any more. This field will be removed in a future sysstat version."
sourcejedi
(53222 rep)
Jul 11, 2019, 10:51 AM
• Last activity: Dec 18, 2019, 06:47 AM
2
votes
1
answers
1704
views
Does IOSTATS show output since boot or since last execution?
I see conflicting information online about use of IOSTAT. In particular I would like to be able to show an average since boot. Based on information I have read if I have never issued the command IOSTAT it will show the average since boot. But if at some point I have issued an IOSTAT command the next...
I see conflicting information online about use of IOSTAT. In particular I would like to be able to show an average since boot. Based on information I have read if I have never issued the command IOSTAT it will show the average since boot. But if at some point I have issued an IOSTAT command the next execution will not be since boot, but rather since last execution.
How do I execute IOSTAT since boot assuming I have already run it once before.
barrypicker
(157 rep)
Dec 9, 2019, 05:42 PM
• Last activity: Dec 9, 2019, 06:02 PM
0
votes
1
answers
477
views
How to find which disk is being written to/read from in an LSI HW RAID logical volume?
On this system there is a lot of "await" which is causing slow response. i need to find out which disk behind the LSI logical volume is slowing it down... IBM blade, 2 HDD with LSI RAID (LV simple mirror) on top. On top of that is RHEL LVM. ..devices... # lsscsi [0:0:0:0] disk IBM-ESXS ST9146852SS B...
On this system there is a lot of "await" which is causing slow response. i need to find out which disk behind the LSI logical volume is slowing it down...
IBM blade, 2 HDD with LSI RAID (LV simple mirror) on top. On top of that is RHEL LVM.
..devices...
# lsscsi
[0:0:0:0] disk IBM-ESXS ST9146852SS B62C -
[0:0:1:0] disk IBM-ESXS ST9146852SS B62C -
[0:1:3:0] disk LSILOGIC Logical Volume 3000 /dev/sda
...RHEL LVM...
# pvdisplay -v
Scanning for physical volume names
--- Physical volume ---
PV Name /dev/sda2
VG Name VolGroup00
PV Size 135.48 GB / not usable 13.20 MB
Allocatable yes
PE Size (KByte) 32768
disk latency...
so for this iostat output, how do i know which device in LSI LV is causing delays?
avg-cpu: %user %nice %system %iowait %steal %idle
0.19 0.00 0.16 8.41 0.00 91.25
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 4.00 0.00 12.50 0.00 272.00 21.76 7.01 1317.20 80.00 100.00
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 4.00 0.00 12.50 0.00 272.00 21.76 7.01 1317.20 80.00 100.00
dm-0 0.00 0.00 0.00 0.50 0.00 4.00 8.00 10.48 62549.00 1358.00 67.90
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.45 0.00 0.00 44.50
dm-2 0.00 0.00 0.00 5.50 0.00 44.00 8.00 2.85 898.00 167.64 92.20
dm-3 0.00 0.00 0.00 1.50 0.00 12.00 8.00 0.86 573.67 336.00 50.40
dm-4 0.00 0.00 0.00 0.50 0.00 4.00 8.00 2.41 5162.00 1610.00 80.50
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Rajeev
(256 rep)
Jul 18, 2019, 08:08 PM
• Last activity: Jul 18, 2019, 08:54 PM
2
votes
2
answers
3729
views
NVMe disk shows 80% io utilization, partitions show 0% io utilization
I have a CentOS 7 server (kernel `3.10.0-957.12.1.el7.x86_64`) with 2 NVMe disks with the following setup: # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme0n1 259:0 0 477G 0 disk ├─nvme0n1p1 259:2 0 511M 0 part /boot/efi ├─nvme0n1p2 259:4 0 19.5G 0 part │ └─md2 9:2 0 19.5G 0 raid1 / ├─nvme0n1p3...
I have a CentOS 7 server (kernel
3.10.0-957.12.1.el7.x86_64
) with 2 NVMe disks with the following setup:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 477G 0 disk
├─nvme0n1p1 259:2 0 511M 0 part /boot/efi
├─nvme0n1p2 259:4 0 19.5G 0 part
│ └─md2 9:2 0 19.5G 0 raid1 /
├─nvme0n1p3 259:7 0 511M 0 part [SWAP]
└─nvme0n1p4 259:9 0 456.4G 0 part
└─data-data 253:0 0 912.8G 0 lvm /data
nvme1n1 259:1 0 477G 0 disk
├─nvme1n1p1 259:3 0 511M 0 part
├─nvme1n1p2 259:5 0 19.5G 0 part
│ └─md2 9:2 0 19.5G 0 raid1 /
├─nvme1n1p3 259:6 0 511M 0 part [SWAP]
└─nvme1n1p4 259:8 0 456.4G 0 part
└─data-data 253:0 0 912.8G 0 lvm /data
Our monitoring and iostat
continually shows nvme0n1
and nvme1n1
with 80%+ io utilization while the individual partitions have 0% io utilization and are fully available (250k iops, 1GB read/write per sec).
avg-cpu: %user %nice %system %iowait %steal %idle
7.14 0.00 3.51 0.00 0.00 89.36
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme1n1 0.00 0.00 0.00 50.50 0.00 222.00 8.79 0.73 0.02 0.00 0.02 14.48 73.10
nvme1n1p1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1p2 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.00 0.02 0.00 0.02 0.01 0.05
nvme1n1p3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1p4 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.73 0.02 0.00 0.02 14.77 73.10
nvme0n1p1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1p2 0.00 0.00 0.00 49.50 0.00 218.00 8.81 0.00 0.02 0.00 0.02 0.01 0.05
nvme0n1p3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme0n1p4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 48.50 0.00 214.00 8.82 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 1.00 0.00 4.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
Any ideas what can be the root cause for such behavior?
All seems to be working fine except for monitoring triggering high io alerts.
mike
(105 rep)
May 7, 2019, 10:11 PM
• Last activity: May 30, 2019, 10:38 AM
0
votes
1
answers
68
views
How to get Disk Stats based on different parameters
Is there any way to get a disk stats grouped on certain params like: writes by size/latency? reads by size/latency? something like: total writes - 100 writes by size: - < 4096 - 20 - 4096 - 16384 - 30 ... Where 4096/16384 being the chunk size.
Is there any way to get a disk stats grouped on certain params like:
writes by size/latency?
reads by size/latency?
something like:
total writes - 100
writes by size:
- < 4096 - 20
- 4096 - 16384 - 30
...
Where 4096/16384 being the chunk size.
Manish Gupta
(101 rep)
Mar 4, 2019, 11:31 AM
• Last activity: May 21, 2019, 08:48 AM
3
votes
1
answers
3718
views
IO wait time is higher than disk utilization. Isn't this impossible?
I am trying to improve my understanding, following this (so far) unanswered question: https://unix.stackexchange.com/questions/516433/possible-limiting-factor-during-upgrade-of-fedora-vm-not-disk-or-cpu-or-networ I ran the following test load, which took 200 seconds to complete. `sudo perf trace -s...
I am trying to improve my understanding, following this (so far) unanswered question: https://unix.stackexchange.com/questions/516433/possible-limiting-factor-during-upgrade-of-fedora-vm-not-disk-or-cpu-or-networ
I ran the following test load, which took 200 seconds to complete.
sudo perf trace -s time perf stat dnf -y --releasever=30 --installroot=$HOME/fedora-30 --disablerepo='*' --enablerepo=fedora --enablerepo=updates install systemd passwd dnf fedora-release vim-minimal
I am running this on a fairly default, straightforward install of Fedora Workstation 29. It is not a VM. The kernel version is 5.0.9-200.fc29.x86_64. The IO scheduler is mq-deadline
.
I use LVM, and the ext4
filesystem. I am not using any encryption on my disk or filesystem. I do not have any network filesystem mounted at all, so I am not reading or writing a network filesystem.
I have 4 "CPUs": 2 cores with 2 threads each.
I have only one disk, /dev/sda
, which is a SATA HDD. The HDD supports NCQ: cat /sys/class/block/sda/device/queue_depth
shows 32
.
vmstat 5
showed that non-idle CPU time *sometimes* rose to about one CPU, i.e. idle was as low as 75%.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
...
1 1 3720 1600980 392948 3669196 0 0 14 30634 8769 1876 4 3 78 15 0
1 1 3720 1600144 393128 3669884 0 0 0 3460 1406 1468 0 1 80 18 0
0 1 3720 1599424 393356 3670532 0 0 0 6830 1416 1480 0 1 73 25 0
0 1 3720 1598448 393700 3671108 0 0 0 7766 1420 1362 0 1 78 20 0
0 1 3720 1597296 393940 3671616 0 0 0 5401 1374 1367 1 2 87 11 0
0 1 3720 1596332 394312 3672416 0 0 0 1162 1647 1558 1 2 85 13 0
3 0 3720 1652064 394564 3673540 0 0 0 17897 15406 1784 1 3 74 23 0
0 0 3720 1972876 394600 3541064 0 0 0 522 8028 18327 3 3 84 10 0
1 0 3720 1974152 394600 3541108 0 0 0 9 422 879 2 1 97 0 0
0 0 3720 1974136 394600 3541120 0 0 0 0 204 455 0 0 99 0 0
(end of test)
And the "IO wait" time (the wa
field under cpu
in vmstat
) rose as high as 25%. I think this means 100% of one CPU. But atopsar -d 5
showed the utilization of my disk did not directly match this. It was much less than 100%:
22:46:44 disk busy read/s KB/read writ/s KB/writ avque avserv _dsk_
...
22:49:34 sda 5% 0.4 4.0 69.5 413.0 36.9 0.68 ms
22:49:39 sda 8% 0.2 60.0 120.6 30.6 18.7 0.66 ms
22:49:44 sda 8% 0.0 0.0 136.2 16.7 20.4 0.61 ms
22:49:49 sda 10% 0.0 0.0 157.1 44.2 21.4 0.65 ms
22:49:54 sda 9% 0.0 0.0 196.4 39.3 48.0 0.47 ms
22:49:59 sda 9% 0.0 0.0 148.9 36.6 32.6 0.62 ms
22:50:04 sda 10% 0.0 0.0 137.3 130.6 37.2 0.70 ms
22:50:09 sda 11% 0.0 0.0 199.6 5.4 13.5 0.55 ms
22:50:14 sda 2% 0.0 0.0 50.2 4.5 11.8 0.32 ms
22:50:19 sda 0% 0.0 0.0 0.8 11.0 13.3 0.75 ms
(end of test)
How can "IO wait" time be higher than disk utilization?
> Following is the definition taken from the sar manpage:
>
> %iowait:
>
> Percentage of time that the CPU or CPUs were idle during which the
> system had an outstanding disk I/O request.
>
> Therefore, %iowait means that from the CPU point of view, no tasks
> were runnable, but at least one I/O was in progress. iowait is simply
> a form of idle time when nothing could be scheduled. The value may or
> may not be useful in indicating a performance problem, but it does
> tell the user that the system is idle and could have taken more work.
>
> https://support.hpe.com/hpsc/doc/public/display?docId=c02783994
"IO wait" is tricksy to define on multi-CPU systems. See https://unix.stackexchange.com/questions/410628/how-does-a-cpu-know-there-is-io-pending . But even if you think I was wrong to multiply the above "IO wait" figure by 4, it would still be higher than the disk utilization figure!
I expect the disk utilization figure in atopsar -d
(and equally in atop
/ sar -d
/ iostat -x
/ mxiostat.py
) is calculated from one of the [kernel iostat fields](https://github.com/torvalds/linux/blob/v5.0/Documentation/iostats.txt) . The linked doc mentions "Field 10 -- # of milliseconds spent doing I/Os". There is a more detailed definition as well, although I am not sure that the functions it mentions still exist in the current multi-queue block layer.
---
Thanks to the perf
in the test command, I can also report that dnf's fdatasync()
calls were responsible for 81 out of 200 seconds of elapsed time. This evidence suggests to me that the "IO wait" figure is giving a more accurate impression than the disk utilization figure.
199.440461084 seconds time elapsed
60.043226000 seconds user
11.858057000 seconds sys
60.07user 12.17system 3:19.84elapsed 36%CPU (0avgtext+0avgdata 437844maxresident)k
496inputs+2562448outputs (25major+390722minor)pagefaults 0swaps
Summary of events:
...
dnf (6177), 2438150 events, 76.0%
syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------
fdatasync 3157 81436.062 0.160 25.795 465.329 1.92%
wait4 51 43911.444 0.017 861.009 32770.388 74.93%
poll 45188 6052.070 0.001 0.134 96.764 5.58%
read 341567 2718.531 0.001 0.008 1372.100 50.63%
write 120402 1616.062 0.002 0.013 155.142 9.61%
getpid 50012 755.423 0.001 0.015 207.506 32.86%
...
sourcejedi
(53222 rep)
May 2, 2019, 10:25 PM
• Last activity: May 3, 2019, 03:14 PM
Showing page 1 of 20 total questions