Sample Header Ad - 728x90

Weird I/O stalls affecting a whole desktop

3 votes
1 answer
694 views
After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall: - I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI. - In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password. - Other programs, like RStudio, can't save anything to disk and often hang when they attempt to. - I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example: sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Main process exited, code=killed, status=6/ABRT sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Unit entered failed state. sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Failed with result 'watchdog'. sty 30 14:03:54 liori-pc systemd: systemd-journald.service: Service has no hold-off time, scheduling restart. sty 30 14:03:54 liori-pc systemd: Stopped Flush Journal to Persistent Storage. sty 30 14:03:54 liori-pc systemd: Stopping Flush Journal to Persistent Storage... sty 30 14:03:54 liori-pc systemd: Stopped Journal Service. sty 30 14:03:54 liori-pc systemd: Starting Journal Service... sty 30 14:03:54 liori-pc systemd-journald: Journal started sty 30 14:03:54 liori-pc systemd-journald: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free. - When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen). - I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change. - In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. [Example log](https://gist.github.com/liori/1201305ceb5787308c46b051aa16fbe3) . Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now. This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again. So I suspect the problem is caused by one of or interaction of: - The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc). - Disk controllers: the motherboard: [Gigabyte Z68X-UD3H-B3](https://www.gigabyte.com/Motherboard/GA-Z68X-UD3H-B3-rev-10#sp) has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?). - Some bug in the controller kernel drivers. - Some bug in LVM or raid5. This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM. At this point I ran out of ideas. How do I debug this problem further? Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary. Edit: I no longer have access to this system, so I can no longer test any hypothesis. I can only say that after running that script I was no longer getting the stalls. I wish I knew how to debug it, though.
Asked by liori (630 rep)
Feb 3, 2019, 12:24 AM
Last activity: Aug 5, 2020, 08:55 PM