After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:
- I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.
- In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in
mate-terminal
and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something
doesn't even ask for a password.
- Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.
- I see in the logs of journald -f
that if the stall is long enough, journald
itself restarts, example:
sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Unit entered failed state.
sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Failed with result 'watchdog'.
sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Service has no hold-off time, scheduling restart.
sty 30 14:03:54 liori-pc systemd: Stopped Flush Journal to Persistent Storage.
sty 30 14:03:54 liori-pc systemd: Stopping Flush Journal to Persistent Storage...
sty 30 14:03:54 liori-pc systemd: Stopped Journal Service.
sty 30 14:03:54 liori-pc systemd: Starting Journal Service...
sty 30 14:03:54 liori-pc systemd-journald: Journal started
sty 30 14:03:54 liori-pc systemd-journald: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.
- When using dm_crypt, a dmcrypt_write
process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).
- I observe /proc/meminfo
and see that the Dirty
number is never more than few megabytes. Notably, during a stall, this number doesn't change.
- In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. [Example log](https://gist.github.com/liori/1201305ceb5787308c46b051aa16fbe3) .
Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd
). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}
), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.
This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null
or cat /dev/sdc >/dev/null
(sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null
never helped). Then, everything that stalled suddenly start working again.
So I suspect the problem is caused by one of or interaction of:
- The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc
).
- Disk controllers: the motherboard: [Gigabyte Z68X-UD3H-B3
](https://www.gigabyte.com/Motherboard/GA-Z68X-UD3H-B3-rev-10#sp) has two controllers: Marvell 88SE9172
where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68
) with two others (can I check which one is where in software?).
- Some bug in the controller kernel drivers.
- Some bug in LVM or raid5.
This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64
. Intel Core i7-2600k, 16GB of RAM.
At this point I ran out of ideas. How do I debug this problem further?
Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.
Edit: I no longer have access to this system, so I can no longer test any hypothesis. I can only say that after running that script I was no longer getting the stalls. I wish I knew how to debug it, though.
Asked by liori
(630 rep)
Feb 3, 2019, 12:24 AM
Last activity: Aug 5, 2020, 08:55 PM
Last activity: Aug 5, 2020, 08:55 PM