Weird I/O stalls affecting a whole desktop

3 votes
1 answer
694 views
                          After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:

- I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.

- In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password.

- Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.

- I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example:

        sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
        sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Unit entered failed state.
        sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Failed with result 'watchdog'.
        sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Service has no hold-off time, scheduling restart.
        sty 30 14:03:54 liori-pc systemd: Stopped Flush Journal to Persistent Storage.
        sty 30 14:03:54 liori-pc systemd: Stopping Flush Journal to Persistent Storage...
        sty 30 14:03:54 liori-pc systemd: Stopped Journal Service.
        sty 30 14:03:54 liori-pc systemd: Starting Journal Service...
        sty 30 14:03:54 liori-pc systemd-journald: Journal started
        sty 30 14:03:54 liori-pc systemd-journald: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.

- When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).

- I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change.

- In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. [Example log](https://gist.github.com/liori/1201305ceb5787308c46b051aa16fbe3) .

Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.

This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again.

So I suspect the problem is caused by one of or interaction of:

- The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc).
- Disk controllers: the motherboard: [Gigabyte Z68X-UD3H-B3](https://www.gigabyte.com/Motherboard/GA-Z68X-UD3H-B3-rev-10#sp)  has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?).
- Some bug in the controller kernel drivers.
- Some bug in LVM or raid5.

This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM.

At this point I ran out of ideas. How do I debug this problem further?

Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.

Edit: I no longer have access to this system, so I can no longer test any hypothesis. I can only say that after running that script I was no longer getting the stalls. I wish I knew how to debug it, though.
                        
Asked by liori (630 rep)
Feb 3, 2019, 12:24 AM
Last activity: Aug 5, 2020, 08:55 PM
Weird I/O stalls affecting a whole desktop

Related Questions