Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

1 votes

1 answers

183 views

Understanding `pgpgin` in `/proc/vmstat` as I/O Counters: Relationship with I/O Bandwidth Measurements

Hi Kernel I/O Experts, I have a question regarding the `pgpgin` and `pgpgout` counters in `/proc/vmstat`, specifically focusing on `pgpgin`. I’ve been exploring performance monitoring tools like `vmstat` and `iotop`, which are very practical commandline tools for observing I/O performance. Upon exam...

                                  Hi Kernel I/O Experts,

I have a question regarding the pgpgin and pgpgout counters in /proc/vmstat, specifically focusing on pgpgin. I’ve been exploring performance monitoring tools like vmstat and iotop, which are very practical commandline tools for observing I/O performance. Upon examining their code, I noticed that these tools report "Current I/O" using the pgpgin and pgpgout counters instead of directly reading current I/O statistics from block devices, as tools like fio do.

First Question: I am trying to understand **the exact relationship between the pgpgin and pgpgout counters and actual I/O bandwidth. Why do these tools rely on paging-related counters to represent I/O activity?** How exactly are pgpgin and pgpgout updated, and what system components are responsible for these updates? In short, could you explain when and why these counters reflect disk I/O operations?

Second Question (Edge Case): **Are there specific scenarios or edge cases where pgpgin bandwidth does not accurately correspond to actual disk bandwidth?** 

While benchmarking SSD read performance using fio with io_uring polling, I observed that the I/O bandwidth reported by fio (and SSD stats) is significantly lower than the bandwidth indicated by pgpgin. This discrepancy led me to investigate how pgpgin reflects I/O activity (thus the first question above). 

I have confirmed that this mismatch is consistent and not due to transient system noise.

Any insights into these counters, their update mechanisms, and their relationship with real I/O performance would be greatly appreciated. Thank you!

JGL (161 rep)

Sep 12, 2024, 09:56 AM • Last activity: Sep 12, 2024, 11:45 AM

0 votes

1 answers

877 views

io_uring with `fio` fails on Rocky 9.3 w/kernel 5.14.0-362.18.1.el9_3.x86_64

linux-kernel io storage benchmark fio

I've tried various variations of the command: ```bash fio --name=test --ioengine=io_uring --iodepth=64 --rw=rw --bs=4k --direct=1 --size=2G --numjobs=24 --filename=/dev/sdc ``` - lower queue depth - direct set to 1/0 - lower numjobs - `setenforce 0` just in case SELinux was a problem but all yield:...

I've tried various variations of the command:

fio --name=test --ioengine=io_uring --iodepth=64 --rw=rw --bs=4k --direct=1 --size=2G --numjobs=24 --filename=/dev/sdc

- lower queue depth - direct set to 1/0 - lower numjobs - setenforce 0 just in case SELinux was a problem but all yield:

test: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.35
Starting 24 processes
fio: pid=71823, err=1/file:engines/io_uring.c:1047, func=io_queue_init, error=Operation not permitted

I have confirmed my host supports io_uring:

[root@r7525-raid tmp]# grep io_uring_setup /proc/kallsyms
ffffffffaa7d4300 t __pfx_io_uring_setup
ffffffffaa7d4310 t io_uring_setup
ffffffffaa7d43a0 T __pfx___ia32_sys_io_uring_setup
ffffffffaa7d43b0 T __ia32_sys_io_uring_setup
ffffffffaa7d4430 T __pfx___x64_sys_io_uring_setup
ffffffffaa7d4440 T __x64_sys_io_uring_setup
ffffffffaae1b3ef t io_uring_setup.cold
ffffffffac2b0180 d event_exit__io_uring_setup
ffffffffac2b0220 d event_enter__io_uring_setup
ffffffffac2b02c0 d __syscall_meta__io_uring_setup
ffffffffac2b0300 d args__io_uring_setup
ffffffffac2b0310 d types__io_uring_setup
ffffffffacabbc68 d __event_exit__io_uring_setup
ffffffffacabbc70 d __event_enter__io_uring_setup
ffffffffacabdd38 d __p_syscall_meta__io_uring_setup
ffffffffacac1cd0 d _eil_addr___ia32_sys_io_uring_setup
ffffffffacac1ce0 d _eil_addr___x64_sys_io_uring_setup

Running with libaio against the same target works without issue. I have not yet read through the code for io_queue_init. Is there a trick to getting io_uring up and running with fio? I haven't yet read through the code for io_queue_init to see exactly what is failing.

Grant Curell (769 rep)

Apr 18, 2024, 02:24 PM • Last activity: May 13, 2024, 12:18 PM

1 votes

1 answers

84 views

Diagnose bash autocomplete issues

linux bash autocomplete fio

I have one binary on my system, `fio` installed via package manager which doesn't autocomplete files with tab once you've typed `fio `. I guess this means something must over overridding the autocomplete behavior but I don't even know how to begin diagnosing it.

                                  I have one binary on my system, fio installed via package manager which doesn't autocomplete files with tab once you've typed fio . I guess this means something must over overridding the autocomplete behavior but I don't even know how to begin diagnosing it.

BeeOnRope (559 rep)

May 10, 2024, 03:22 PM • Last activity: May 10, 2024, 04:45 PM

11 votes

2 answers

1737 views

Differences Between `/dev/null` and Devices Under `null_blk` Driver

linux drivers io block-device fio

I recently encountered the [Linux Null Block device driver](https://docs.kernel.org/block/null_blk.html), `null_blk`, while I benchmarking the I/O stack instead of on a specific block device. I found the devices created under this driver (let's use the device name `/dev/nullb0` as an example) quite...

                                  I recently encountered the [Linux Null Block device driver](https://docs.kernel.org/block/null_blk.html) , null_blk, while I benchmarking the I/O stack instead of on a specific block device. I found the devices created under this driver (let's use the device name /dev/nullb0 as an example) quite intriguing, especially considering their similarity in name to the /dev/null device. Since I couldn't find any existing questions on this topic from Stackoverflow, I decided to reach out for clarification.

My main question is: **what are the differences between the /dev/null and the block device created under the null_blk device driver?**

---

To this point: I've already noticed some distinctions. 
- First, (as far as I understand), the null device /dev/null doesn't go through any driver. However, devices created under null_blk are true block drivers that the data must pass through. I also confirmed this by running fio on both devices; /dev/null performs much better in terms of random read IOPS and submission latency. 
- Second, we know that reading from /dev/null results in an EOF (for example, cat /dev/null), but when I attempt cat /dev/nullb0, it doesn't return an EOF and instead hangs. 
- Additionally, as a side note, the kernel documentation for null_blk mentions parameters for configuration, but I don't see any similar options for /dev/null to be configured.

It seems, large number of differences exist under the similar names. Can someone provide further also formal insights or clarification on these differences? Thanks!

JGL (161 rep)

Apr 10, 2024, 07:15 AM • Last activity: Apr 14, 2024, 02:01 PM

0 votes

1 answers

675 views

FIO reports slower sequential read than the advertised NVMe SSD read bandwidth

io ssd nvme fio

# TLDL For the very simple sequential read, FIO reports is much slower than the NVMe SSD sequentail read capability. --- # Main Text Hello everyone, I have been facing an issue while trying to achieve the maximum read bandwidth reported by the vendor for my Samsung 980 Pro 1T NVMe SSD. According to the Samsung product description, the SSD is capable of reaching read bandwidths of around 7 GB/s. However, despite my efforts, I have been unable to achieve this maximum read bandwidth. Current Setup: - SSD: Samsung 980 Pro 1T NVMe SSD - Connection: PCIe 4.0 port - Operating System: Linux Ubuntu Current FIO Script and Results: To test the read performance of the SSD, I have been using the FIO benchmarking tool with the following script:

$ sudo fio --loops=5 --size=1024m --filename=/dev/nvme0n2 --stonewall --ioengine=libaio --direct=1 --zero_buffers=1 --name=Seqread --bs=1024m --iodepth=1 --numjobs=1 --rw=read

Here are the results obtained from running the FIO script:

Seqread: (g=0): rw=read, bs=(R) 1024MiB-1024MiB, (W) 1024MiB-1024MiB, (T) 1024MiB-1024MiB, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1)
Seqread: (groupid=0, jobs=1): err= 0: pid=1504682: Mon Oct 16 09:28:48 2023
  read: IOPS=3, BW=3368MiB/s (3532MB/s)(5120MiB/1520msec)
    slat (msec): min=151, max=314, avg=184.19, stdev=72.71
    clat (msec): min=2, max=149, avg=119.59, stdev=65.39
     lat (msec): min=300, max=316, avg=303.77, stdev= 7.33
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[  148], 40.00th=[  148], 50.00th=[  148], 60.00th=[  148],
     | 70.00th=[  150], 80.00th=[  150], 90.00th=[  150], 95.00th=[  150],
     | 99.00th=[  150], 99.50th=[  150], 99.90th=[  150], 99.95th=[  150],
     | 99.99th=[  150]
   bw (  MiB/s): min= 2048, max= 4096, per=81.07%, avg=2730.67, stdev=1182.41, samples=3
   iops        : min=    2, max=    4, avg= 2.67, stdev= 1.15, samples=3
  lat (msec)   : 4=20.00%, 250=80.00%
  cpu          : usr=0.00%, sys=31.47%, ctx=405, majf=0, minf=262156
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=5,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=3368MiB/s (3532MB/s), 3368MiB/s-3368MiB/s (3532MB/s-3532MB/s), io=5120MiB (5369MB), run=1520-1520msec

Disk stats (read/write):
  nvme0n2: ios=9391/0, merge=0/0, ticks=757218/0, in_queue=757218, util=93.39%

I would greatly appreciate any guidance or suggestions on **how to optimize my FIO script to achieve the expected read bandwidth of around 7 GB/s**. If there are any improvements or modifications that can be made to the script, please let me know. Thank you in advance for your assistance! Please feel free to provide any additional information or insights that may be relevant to the issue at hand. --- Note: It should be PCIe4.0*4:

$ lspci -vv -s 5e:00.0
5e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
	Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 
	Kernel driver in use: nvme
	Kernel modules: nvme

$ cat /sys/class/pci_bus/0000\:5e/device/0000\:5e\:00.0/max_link_width 
4
$ cat /sys/class/pci_bus/0000\:5e/device/0000\:5e\:00.0/max_link_speed 
16.0 GT/s PCIe

JGL (161 rep)

Oct 16, 2023, 07:39 AM • Last activity: Mar 4, 2024, 11:20 AM

0 votes

0 answers

809 views

understanding fio test results - 90th percentiles benchmark - Latency --> avg/stdev?

performance storage random benchmark fio

What are the relationships in storage benchmarking with fio between the average(avg) and standard deviation (stdev) at latency? Avg / stdev at latency is important for sequential or random test? Most I/O subsystems are very well tuned that the standard deviation has little significance for sequentia...

                                  What are the relationships in storage benchmarking with fio between the average(avg) and standard deviation (stdev) at latency?

Avg / stdev at latency is important for sequential or random test?

Most I/O subsystems are very well tuned that the standard deviation has little significance for sequential accesses. Furthermore, it is of interest how the standard deviation is with the random test.  is that right?


I was advised that in cases where the avg/stdev ratio related to latency is 
5 %, the system is still problem-free
in cases where the system is optimal 10%
but at higher rates there is a problem in the system.

 fio --rw=randrw --name=test_zufaellig_oci  --bs=4k --direct=1 --size=10G --ioengine=libaio --runtime=5400 --time_based

OCI

test_zufaellig_oci: (groupid=0, jobs=1): err= 0: pid=16244: Thu Dec 29 17:16:45 2022
  read: IOPS=1313, BW=5255KiB/s (5381kB/s)(27.1GiB/5400001msec)
    slat (usec): min=7, max=1665, avg=16.39, stdev= 6.30
    clat (usec): min=3, max=202186, avg=354.88, stdev=167.22
     lat (usec): min=290, max=202206, **avg=371.52, stdev=167.53**
    clat percentiles (usec):
     |  1.00th=[  310],  5.00th=[  318], 10.00th=[  322], 20.00th=[  330],
     | 30.00th=[  334], 40.00th=[  338], 50.00th=[  343], 60.00th=[  351],
     | 70.00th=[  355], 80.00th=[  363], 90.00th=[  383], 95.00th=[  408],
     | 99.00th=[  498], 99.50th=[  611], 99.90th=[ 1762], 99.95th=[ 2376],
     | 99.99th=[ 6128]
   bw (  KiB/s): min=  816, max= 6096, per=100.00%, avg=5263.74, stdev=325.49, samples=10781
   iops        : min=  204, max= 1524, avg=1315.94, stdev=81.37, samples=10781

-----
AWS

test_zufaellig_aws: (groupid=0, jobs=1): err= 0: pid=2960: Thu Dec 29 17:16:54 2022
  read: IOPS=315, BW=1261KiB/s (1291kB/s)(6648MiB/5400007msec)
    slat (usec): min=6, max=107, avg=19.35, stdev= 4.90
    clat (usec): min=190, max=113470, avg=1472.25, stdev=2373.50
     lat (usec): min=207, max=113491, **avg=1492.51, stdev=2374.87**
    clat percentiles (usec):
     |  1.00th=[  318],  5.00th=[  347], 10.00th=[  363], 20.00th=[  392],
     | 30.00th=[  412], 40.00th=[  429], 50.00th=[  453], 60.00th=[  482],
     | 70.00th=[  529], 80.00th=[  644], 90.00th=[ 6915], 95.00th=[ 7046],
     | 99.00th=[ 7177], 99.50th=[ 7242], 99.90th=[ 7504], 99.95th=[ 7635],
     | 99.99th=
   bw (  KiB/s): min=  144, max= 4184, per=100.00%, avg=1262.50, stdev=1493.09, samples=10783
   iops        : min=   36, max= 1046, avg=315.63, stdev=373.27, samples=10783








                                

Hamza Karabulut (1 rep)

Dec 29, 2022, 09:42 PM

0 votes

1 answers

438 views

How does the --bsize option in fio work?

filesystems io fio

Since fio is a benchmarking tool that for each run, should simulate a real I/O workload, how does the --bsize option fit with that? My understanding is that the filesystem has a set block size for which an application issuing a read/write operation has to use? The app wants to read let's say 256KiB...

                                  Since fio is a benchmarking tool that for each run, should simulate a real I/O workload, how does the --bsize option fit with that? My understanding is that the filesystem has a set block size for which an application issuing a read/write operation has to use? The app wants to read let's say 256KiB of data. If the filesystem uses a block size of 4KiB then that would be broken down into 64 blocks. If I were to use fio to simulate this, but set the bsize to 256KiB, would that have any effect on the read operation? The filesystem wouldn't write 1 block but still 64 blocks, correct? 
                                

Macondoman (1 rep)

Aug 19, 2022, 05:28 PM • Last activity: Aug 19, 2022, 06:37 PM

0 votes

1 answers

209 views

Why Disk stats show many read operations when I measure NVME squance write with fio and mmap as ioengine

linux io nvme mmap fio

Here is my fio configure and report: ``` # cat fio-write.fio [global] name=fio-seq-writes filename=test rw=write bs=1M direct=0 numjobs=1 [file1] size=1G ioengine=mmap iodepth=1 # fio --version fio-3.30 # fio fio-write.fio file1: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024...

Here is my fio configure and report:

# cat fio-write.fio 
[global]
name=fio-seq-writes
filename=test
rw=write
bs=1M
direct=0
numjobs=1
[file1]
size=1G
ioengine=mmap
iodepth=1

# fio --version
fio-3.30
# fio fio-write.fio 
file1: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=mmap, iodepth=1
fio-3.30
Starting 1 process
Jobs: 1 (f=1): [W(1)][-.-%][w=373MiB/s][w=373 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=421: Sun Nov 14 21:12:09 2021
  write: IOPS=330, BW=330MiB/s (346MB/s)(1024MiB/3102msec); 0 zone resets
    clat (usec): min=2118, max=11668, avg=2598.40, stdev=1333.02
     lat (usec): min=2171, max=11754, avg=2673.15, stdev=1339.15
    clat percentiles (usec):
     |  1.00th=[ 2114],  5.00th=[ 2147], 10.00th=[ 2147], 20.00th=[ 2147],
     | 30.00th=[ 2147], 40.00th=[ 2180], 50.00th=[ 2212], 60.00th=[ 2343],
     | 70.00th=[ 2409], 80.00th=[ 2474], 90.00th=[ 2606], 95.00th=[ 4621],
     | 99.00th=[ 9241], 99.50th=, 99.90th=, 99.95th=,
     | 99.99th=
   bw (  KiB/s): min=122880, max=385024, per=99.76%, avg=337237.33, stdev=105105.84, samples=6
   iops        : min=  120, max=  376, avg=329.33, stdev=102.64, samples=6
  lat (msec)   : 4=94.14%, 10=4.98%, 20=0.88%
  cpu          : usr=28.25%, sys=61.08%, ctx=253, majf=262144, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=330MiB/s (346MB/s), 330MiB/s-330MiB/s (346MB/s-346MB/s), io=1024MiB (1074MB), run=3102-3102msec

Disk stats (read/write):
  nvme0n1: ios=1908/757, merge=0/0, ticks=1255/3876, in_queue=5130, util=84.41%

As you can see, the Disk stats say there are 1908 read and 757 write(ios is _Number of I/Os performed by all groups_), this testcase is **sequence writing only**(configured via *rw=write*), why does it show my NVME issues 1908 readings? I also tried * sequence read(read only)

Disk stats (read/write):
  nvme0n1: ios=2026/0, merge=0/0, ticks=630/0, in_queue=631, util=69.47%

* random read(255234 read, 991 write)

Disk stats (read/write):
  nvme0n1: ios=255234/991, merge=0/6, ticks=3936/1739, in_queue=5674, util=95.55%

* random read(259349 read, 2 write)

Disk stats (read/write):
  nvme0n1: ios=259349/2, merge=0/0, ticks=3453/0, in_queue=3454, util=93.49%

I also tried other ioengines like libaio, io_uring and psync(fio's default ioengine), their seq|non-seq read only issue read operations and seq|non-seq write only issue write operations, which is as expected, so only mmap behavior weirdly.

Li Chen (397 rep)

Jun 5, 2022, 03:28 PM • Last activity: Aug 11, 2022, 09:21 AM

0 votes

1 answers

86 views

Doing fio testing and hotplug remove the SSD

performance disk fio

When disk is in use, ex: doing fio testing ( random write ), remove the PCIe SSD at the same time. Should I expect there is no any I/O error since the system support hotplug?

                                  When disk is in use, ex: doing fio testing ( random write ), remove the PCIe SSD at the same time.

Should I expect there is no any I/O error since the system support hotplug?

Mark K (955 rep)

Oct 30, 2021, 05:12 AM • Last activity: Oct 30, 2021, 06:07 AM

1 votes

1 answers

1151 views

fio: how to reduce verbosity?

benchmark fio

When I run fio command, I get huge file with following lines which fills up the entire space. I am interested in only the final fio output summary. How can I reduce this fio verbosity? :::: Jobs: 4 (f=4): [W(3),X(1),R(1)][0.1%][r=1244MiB/s,w=2232MiB/s][r=319k,w=17.9k IOPS][eta 03h:04m:45s] Jobs: 4 (...

                                  When I run fio command, I get huge file with following lines which fills up the entire space. I am interested in only the final fio output summary. How can I reduce this fio verbosity?

    ::::
    Jobs: 4 (f=4): [W(3),X(1),R(1)][0.1%][r=1244MiB/s,w=2232MiB/s][r=319k,w=17.9k IOPS][eta 03h:04m:45s]
    Jobs: 4 (f=4): [W(3),X(1),R(1)][0.1%][r=1243MiB/s,w=2252MiB/s][r=318k,w=18.0k IOPS][eta 03h:02m:18s]
    ::::

fio is run using following command.

    [root@system user]# fio  iops_wipc.fio --eta=always --eta-newline=1 | tee /tmp/iops_wipc_op

    [root@system user]# cat iops_wipc.fio
    [wipc-iops]
    group_reporting
    direct=1
    ioengine=libaio
    allow_mounted_write=1
    refill_buffers
    scramble_buffers=1
    thread=1
    #eta-newline=10
    bs=128k
    numjobs=4
    iodepth=32
    rw=write
    size=768G
    [device0]
    filename=/dev/nvme2n1

                                

user3488903 (183 rep)

Jul 9, 2020, 04:19 PM • Last activity: Jun 17, 2021, 05:06 PM

0 votes

1 answers

113 views

RAMDisk disappears after random read tests

centos modprobe ramdisk fio

I have created 60GB RAMDisk using the command `modprobe brd rd_size=62914560`. It creates 16 RAMDisks and I use /dev/ram1. The linux is CentOS 7.5 with kernel version 3.10. I don't make any filesystems on RAMDisk because I want to use it as a raw block device. My test scenario includes two phases an...

                                  I have created 60GB RAMDisk using the command modprobe brd rd_size=62914560. It creates 16 RAMDisks and I use /dev/ram1. The linux is CentOS 7.5 with kernel version 3.10. I don't make any filesystems on RAMDisk because I want to use it as a raw block device.

My test scenario includes two phases and I use FIO tool:

(1) I write sequentially to /dev/ram1 so that it is initialized and the memory is allocated it. 

(2) I test the RAMDisk performance with 4KB random read. 

However, the RAMDisk disappears during the random read test (second phase). I have checked this issue using the command free -m. Why RAMDisk disappears when we read from it?

Arghavan Mohammadhassani (121 rep)

Apr 27, 2021, 06:37 AM • Last activity: May 16, 2021, 09:02 AM

1 votes

0 answers

189 views

FIO processes go from aiospn to 100% CPU

freebsd fio

I'm using FreeBSD 12.2 and FIO 3.24. The ioengine parameter is posixaio. Testing NVMe drives. During the initial part of our testing, we hit the unit under test with a QD of 32 and numjobs of 4 for 3 hours (randomwrite with a mix of blocksizes . Usually about 2/3rds of the way through, I notice the...

                                  I'm using FreeBSD 12.2 and FIO 3.24. The ioengine parameter is posixaio. Testing NVMe drives. During the initial part of our testing, we hit the unit under test with a QD of 32 and numjobs of 4 for 3 hours (randomwrite with a mix of blocksizes . Usually about 2/3rds of the way through, I notice the 4 processes (one by one) go from state aiospn usually using 5-10% CPU to state CPUnnn at 100 %CPU. vfs.aio values below.

The question is who is the guilty party? FreeBSD vs FIO?  I'm taking a guess that someone isn't handling a dropped I/O request well.

    vfs.aio.max_buf_aio: 8192
    vfs.aio.max_aio_queue_per_proc: 65536
    vfs.aio.max_aio_per_proc: 8192
    vfs.aio.aiod_lifetime: 30000
    vfs.aio.num_unmapped_aio: 0
    vfs.aio.num_buf_aio: 0
    vfs.aio.num_queue_count: 0
    vfs.aio.max_aio_queue: 65536
    vfs.aio.target_aio_procs: 4
    vfs.aio.num_aio_procs: 4
    vfs.aio.max_aio_procs: 32
    vfs.aio.unsafe_warningcnt: 1
    vfs.aio.enable_unsafe: 0
                                

jim feldman (31 rep)

Feb 3, 2021, 09:29 PM

3 votes

0 answers

1192 views

Extremely poor performance for ZFS 4k randwrite on NVMe compared to XFS?

zfs xfs nvme fio

I've been a fan of ZFS for a long time and I use it on my home NAS, but in testing its viability for production workloads I've found that its performance is inconceivably bad compared with XFS on the same disks. Testing on an Intel P4510 8TB disk using fio 3.21 using these settings: fio \ --name=xfs...

                                  I've been a fan of ZFS for a long time and I use it on my home NAS, but in testing its viability for production workloads I've found that its performance is inconceivably bad compared with XFS on the same disks.  Testing on an Intel P4510 8TB disk using fio 3.21 using these settings:

    fio \
    --name=xfs-fio \
    --size=10G \
    -group_reporting \
    --time_based \
    --runtime=300 \
    --bs=4k \
    --numjobs=64 \
    --rw=randwrite \
    --ioengine=sync \
    --directory=/mnt/fio/

Results look like this:

    xfs-fio: (groupid=0, jobs=64): err= 0: pid=63: Mon Feb  1 21:46:44 2021
      write: IOPS=189k, BW=738MiB/s (774MB/s)(216GiB/300056msec); 0 zone resets
        clat (usec): min=2, max=2430.4k, avg=336.28, stdev=4745.39
         lat (usec): min=2, max=2430.4k, avg=336.38, stdev=4745.40
        clat percentiles (usec):
         |  1.00th=[     7],  5.00th=[    10], 10.00th=[    10], 20.00th=[    11],
         | 30.00th=[    12], 40.00th=[    14], 50.00th=[    23], 60.00th=[    35],
         | 70.00th=[    36], 80.00th=[    37], 90.00th=[    39], 95.00th=[    40],
         | 99.00th=[    44], 99.50th=[  8455], 99.90th=[ 66323], 99.95th=[ 70779],
         | 99.99th=
       bw (  KiB/s): min=95565, max=7139939, per=100.00%, avg=757400.32, stdev=21559.21, samples=38262
       iops        : min=23890, max=1784976, avg=189327.65, stdev=5389.87, samples=38262
      lat (usec)   : 4=0.03%, 10=13.41%, 20=36.22%, 50=49.56%, 100=0.12%
      lat (usec)   : 250=0.13%, 500=0.01%, 750=0.01%, 1000=0.01%
      lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
      lat (msec)   : 100=0.46%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
      lat (msec)   : 2000=0.01%, >=2000=0.01%
      cpu          : usr=0.27%, sys=7.34%, ctx=793590, majf=0, minf=116620
      IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued rwts: total=0,56715776,0,0 short=0,0,0,0 dropped=0,0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=1Run status group 0 (all jobs):
      WRITE: bw=738MiB/s (774MB/s), 738MiB/s-738MiB/s (774MB/s-774MB/s), io=216GiB (232GB), run=300056-300056msecDisk stats (read/write):
      nvme7n1: ios=25/21951553, merge=0/173138, ticks=4/660308, in_queue=265520, util=21.39%real    

On ZFS, with this zpool create:

    # zpool create -o ashift=13 -o autoreplace=on nvme6 /dev/nvme6n1

And this volume create:

    zfs create              \
        -o mountpoint=/mnt/nvme6 \
        -o atime=off             \
        -o compression=lz4       \
        -o dnodesize=auto        \
        -o primarycache=metadata \
        -o recordsize=128k       \
        -o xattr=sa              \
        -o acltype=posixacl      \
        nvme6/test0

The results look like this:

    zfs-fio: (groupid=0, jobs=64): err= 0: pid=64: Mon Feb  1 23:00:41 2021
      write: IOPS=28.3k, BW=110MiB/s (116MB/s)(32.3GiB/300004msec); 0 zone resets
        clat (usec): min=7, max=314789, avg=2258.78, stdev=2509.17
         lat (usec): min=7, max=314790, avg=2259.28, stdev=2509.22
        clat percentiles (usec):
         |  1.00th=[   52],  5.00th=[   70], 10.00th=[   81], 20.00th=[  106],
         | 30.00th=[  225], 40.00th=[ 1057], 50.00th=[ 1713], 60.00th=[ 2606],
         | 70.00th=[ 3458], 80.00th=[ 4146], 90.00th=[ 4948], 95.00th=[ 5669],
         | 99.00th=[ 8455], 99.50th=, 99.90th=, 99.95th=,
         | 99.99th=
       bw (  KiB/s): min=51047, max=455592, per=100.00%, avg=113196.01, stdev=702.99, samples=38272
       iops        : min=12761, max=113897, avg=28297.59, stdev=175.73, samples=38272
      lat (usec)   : 10=0.01%, 20=0.01%, 50=0.80%, 100=16.73%, 250=12.93%
      lat (usec)   : 500=2.45%, 750=2.97%, 1000=3.37%
      lat (msec)   : 2=14.91%, 4=23.92%, 10=21.20%, 20=0.50%, 50=0.19%
      lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%
      cpu          : usr=0.31%, sys=7.39%, ctx=11163058, majf=0, minf=32449
      IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued rwts: total=0,8476060,0,0 short=0,0,0,0 dropped=0,0,0,0
         latency   : target=0, window=0, percentile=100.00%, depth=1Run status group 0 (all jobs):
      WRITE: bw=110MiB/s (116MB/s), 110MiB/s-110MiB/s (116MB/s-116MB/s), io=32.3GiB (34.7GB), run=300004-300004msecreal  

XFS did 189k iops, ZFS did 28.3k iops - an 85% decrease - with equivalent decrease in throughput.  CPUs are dual Xeon 6132, this machine's kernel is 4.15.0-62-generic, though I've seen the same effects on 5.x kernels as well.
                                

Evan (171 rep)

Feb 3, 2021, 08:54 PM

0 votes

2 answers

274 views

Multiple Threads Cannot Access the Same RAMdisk Created by modprobe

centos block-device ramdisk lsblk fio

I have created RAMdisks of 60GB using `modprove brd rd_size=62914560` on CentOS 7.5. Checking the results, `fdisk -l /dev/ram*` shows 16 ram block devices of 60GB size (/dev/ram0, /dev/ram1, ..., /dev/ram15). I want to run 16 jobs (threads) with random accesses on one ram block device to check the p...

                                  I have created RAMdisks of 60GB using modprove brd rd_size=62914560 on CentOS 7.5. Checking the results, fdisk -l /dev/ram* shows 16 ram block devices of 60GB size (/dev/ram0, /dev/ram1, ..., /dev/ram15).

I want to run 16 jobs (threads) with random accesses on one ram block device to check the performance. I run such a workload using FIO tool. However, I get the following error: 

> cache invalidation of /dev/ram1 failed: Device or resource busy

Why this happens? Is there a limitation in the number of jobs (threads) accessing a single ram block device?
When I check the block devices with lsblk, the ram block devices are not shown. What is the reason? 

Thanks

Arghavan Mohammadhassani (121 rep)

Nov 1, 2020, 10:15 AM • Last activity: Nov 6, 2020, 11:32 AM

3 votes

1 answers

1039 views

Does it make sense to use queue-depth when doing synchronous IO benchmark?

ssd benchmark fio

Does it make sense to have a queue depth > 1 when doing a synchronous IO benchmark? I was expecting the same result a of QD1, but QD32 does give better result? I thought it would just be ignored. In the fio [manual](https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth) for the opt...

                                  Does it make sense to have a queue depth > 1 when doing a synchronous IO benchmark?

I was expecting the same result a of QD1, but QD32 does give better result?

I thought it would just be ignored. In the fio [manual](https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth)  for the option --iodepth:

> Note that increasing iodepth beyond 1 will not affect synchronous ioengines...

fio commands:

    fio --name=x --ioengine=posixaio --rw=write --bs=4k --iodepth=1 --size=512MB --fsync=1  --filename=test.img 
    Result:  5.210 IOPS / 20MB/s

    fio --name=x --ioengine=posixaio --rw=write --bs=4k --iodepth=32 --size=512MB --fsync=1  --filename=test.img 
    Result: 20.100 IOPS /  79MB/s

MrCalvin (766 rep)

Oct 31, 2020, 02:06 PM • Last activity: Nov 5, 2020, 06:26 AM

0 votes

0 answers

1366 views

Understanding "Laying out IO file" in fio

linux storage nvme fio

I am trying to understand what really happens when in the "Laying out IO file". I have btrfs installed on a raw block device and whenever I run fio with the following configuration, I see the laying out step takes about 40 minutes to complete before the actual fio job starts to perform IO ``` runtim...

runtime=600
rw=readwrite
rwmixwrite=90
random_distribution=random
percentage_random=100
size=50%
iodepth=16
ioengine=libaio
direct=1
bs=4096
time_based=1
fallocate=none
directory=/tmp/fs_d765f32a-1a34-11eb-8644-61649a50b743
write_lat_log=/tmp/ll
log_avg_msec=500
log_unix_epoch=1
log_max_value=1
filesize=8GB

If you see the following lines, the command is started at 15:19:51 and Laying out file finishes at 15:58:31 (about 40 minutes laying out the file). I tried looking through the source code and it seems like laying out occurs whenever the program decides to extend a file. I am assuming laying out then happens only if there are reads in the fio configuration but its a bit unclear to me at this point as to why it should take 40 minutes. Appreciate some insight into what really goes on here.

2020-10-29 15:19:51,583 [MainThread] - root - DEBUG - Fio job output: job-0: (g=0): rw=rw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
fio-3.16
Starting 1 process
job-0: Laying out IO file (1 file / 4096MiB)

2020-10-29 15:58:31,896 [MainThread] - root - DEBUG - Fio job output:
job-0: (groupid=0, jobs=1): err= 0: pid=70: Thu Oct 29 22:58:31 2020
....
....

linux_engine (1 rep)

Oct 30, 2020, 07:07 AM

0 votes

1 answers

423 views

What makes completion latency various in Fio benchmark with NVMe SSD?

nvme fio

I'm trying to figure out the completion latency of fio benchmark with NVMe SSD. I made following fio script to test benchmark in fio. I used following options. `rw=read, ioengine=sync, direct=1 ` So, I thought there's not much things to make completion times different. However, the result wasn't wha...

                                  I'm trying to figure out the completion latency of fio benchmark with NVMe SSD.

I made following fio script to test benchmark in fio.

I used following options.

rw=read, ioengine=sync, direct=1 

So, I thought there's not much things to make completion times different.

However, the result wasn't what I thought.

The result is 1th : 11us to 99.99th : 111 us.

Synchronous read make no outstanding IOs, so all I/Os are processed sequentially,

And direct option can bypass the buffer in OS.

I thought most of latencies are same.

Any ideas for this result?

ray5273 (3 rep)

Sep 6, 2020, 12:42 PM • Last activity: Sep 13, 2020, 03:49 PM

0 votes

1 answers

406 views

fio: how to write 2X user capacity?

linux benchmark fio

My system has SSDs and I wanted to run some benchmark tests on it based on SSS PTS (SNIA). For example, for IOPS test, the spec suggested to use QD=32, TC=4 and for doing IO, the spec says following. Run SEQ Workload Independent Pre-conditioning - Write 2X User Capacity with 128KiB SEQ writes, writi...

                                  My system has SSDs and I wanted to run some benchmark tests on it based on SSS PTS (SNIA). For example, for IOPS test, the spec suggested to use QD=32, TC=4 and for doing IO, the spec says following.

    Run SEQ Workload Independent Pre-conditioning - Write 2X User Capacity with 128KiB SEQ writes, writing the entire ActiveRange without LBA restrictions.

My system has an SSD of size 12 TB, so I planned to invoke fio two times in sequence as below.

    fio  --iodepth=32 --bs=128 --numjobs=4 --rw=write --size=3T ...  # write 4*3T=12T
    fio  --iodepth=32 --bs=128 --numjobs=4 --rw=write --size=3T ...  # write 4*3T=12T

But, from the first command only, I got fio: native_fallocate call failed: No space left on device. I think this is expected since, there are some other small files/dirs such as "lost+found". I think, I am doing something wrong here and there must be a better right way of doing this. Can somebody suggest me how should I parameterized my fio command so that it writes 2X user capacity as suggested in PTS spec? Thanks in advance.

user3488903 (183 rep)

Jul 11, 2020, 03:51 PM • Last activity: Sep 4, 2020, 07:07 AM

1 votes

1 answers

370 views

How does a utility 'fio' perform VFS like operations on raw unformatted devices with no filesystem on them?

filesystems dd storage fio

I understand that one cannot do VFS operations on a medium with no filesystem. Given that, how does a utility like `fio` perform VFS-like read/write/seek operations on raw devices?

                                  I understand that one cannot do VFS operations on a medium with no filesystem.

Given that, how does a utility like fio perform VFS-like read/write/seek operations on raw devices?

nishad kamdar (73 rep)

Jul 12, 2020, 04:28 PM • Last activity: Sep 4, 2020, 06:46 AM

1 votes

1 answers

77 views

Quickly assemble raid5 for perf test

mdadm raid5 fio

I'd like to run a series of fio-based performance tests on a few drives in various RAID and non-RAID configurations. When assembling drives in RAID5, the rebuild process takes an incredibly long time (6TB HDD). Since I'm going to completely overwrite the disks as part of the performance tests (or at...

                                  I'd like to run a series of fio-based performance tests on a few drives in various RAID and non-RAID configurations. When assembling drives in RAID5, the rebuild process takes an incredibly long time (6TB HDD). Since I'm going to completely overwrite the disks as part of the performance tests (or at least all the sectors I plan on reading), is there any way I can configure mdadm to not bother rebuilding the parity and just calculate parity the next time the sector is written?
                                

Huckle (1087 rep)

Aug 9, 2020, 04:49 PM • Last activity: Aug 9, 2020, 05:29 PM

Showing page 1 of 20 total questions