dev.davidsoergel.com
various computing projects

High iowait, high load on a 3ware 8506-4 solved

March 14th, 2006 . by David Soergel

The Symptoms

I have two 3ware 8506-4 controllers with three disks each connected to them. I’ve been getting high iowait on my box whenever there’s an even slightly disk-intensive task going on. The weird thing is that even processes that don’t use the high-usage disk get stalled in iowait. As a result, load skyrockets and the system becomes unresponsive.

Inducing disk load with bonnie++ and watching things with “iostat -x” revealed that even disks with extremely low bandwidth usage and very few i/o requests end up with 100% utilization. However, only those disks on the same controller as the high-usage disk are affected. The observed total throughput is around 50 Mb/sec, e.g., the limit of the high-usage disk. It’s nowhere near the bandwidth limit of the controller, the SATA channel, or the PCI bus. So, there’s something wrong with the controller that effectively limits it to control one disk at a time with reasonable performance.

The kernel is the current CentOS 3.6 one, 2.4.21-37.0.1.ELsmp.

The write cache is enabled on the controllers.

This seems to be related to at least some of the posts on the infamous RHEL vs. 3ware performance bug.

The Diagnosis

Apparently, what is happening is that the four SATA channels on the controller share a single request queue, with a depth of 256 requests. When requests to one disk arrive at a high rate, they fill the queue, thereby starving the other disks.

Controllers from other manufacturers typically limit the number of requests to a single disk to some number smaller than the total queue depth for the controller. Thus, the queue for a single disk can get full without blocking access to the other disks. Why 3ware made the default per-disk queue depth the same as the depth for the whole controller is beyond me. Fortunately, this is correctable.

The Solution

I googled around for some hints.

In 2.6 kernels, the queue depth can be adjusted using the sysfs mechanism (though that may require a patch to the 3w-xxxx driver). However, I need to continue running a 2.4 kernel for a while. So, the solution is to rebuild the driver with a different queue depth (and the kernel image, since I need to boot from these disks).

First I ran a bunch of tests to get a performance baseline using various readahead and elvtune settings:

3wareIssueTests.sh

Then I downloaded the kernel source, configured it, and added

CONFIG_3W_XXXX_CMD_PER_LUN=32

to /usr/src/linux-2.4/.config

Then, bypassing much of the standard stuff for a custom kernel since this is a simple change:

rm -f drivers/scsi/3w-xxxx.o
make modules
mv /lib/modules/2.4.21-37.0.1.ELsmp/kernel/drivers/scsi/3w-xxxx.o /lib/modules/2.4.21-37.0.1.ELsmp/kernel/drivers/scsi/3w-xxxx.o.orig
cp /usr/src/linux-2.4/drivers/scsi/3w-xxxx.o /lib/modules/2.4.21-37.0.1.ELsmp/kernel/drivers/scsi
mkinitrd /boot/initrd-2.4.21-37.0.1.ELsmp-3ware32b.img 2.4.21-37.0.1.ELsmp 

Then add a config with the new image to /etc/grub.conf, using the old image as a fallback:

default 0
fallback 1

And reboot.

Then run 3wareIssueTests.sh again and watch the load average. Hmm, that didn’t work !? The original symptoms persist, and /proc/scsi/3w-xxxx/1 shows Max commands posted: 254, which shouldn’t happen.

Added

printk(KERN_WARNING "3w-xxxx: set cmd_per_lun to %d.\n", tw_host->cmd_per_lun);

at line 1177 of 3w-xxxx.c

Aha, it’s still 254. The .config change didn’t get picked up (maybe I needed make dep or something) so we can just add the compile flag manually:

gcc -D__KERNEL__ -I/usr/src/linux-2.4.21-37.0.1.EL/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common  -Wno-unused -fomit-frame-pointer -pipe -freorder-blocks -mpreferred-stack-boundary=2 -march=i686 -DMODULE -DMODVERSIONS -include /usr/src/linux-2.4.21-37.0.1.EL/include/linux/modversions.h  -nostdinc -iwithprefix include -DKBUILD_BASENAME=3w_xxxx -DCONFIG_3W_XXXX_CMD_PER_LUN=32 -c -o 3w-xxxx.o 3w-xxxx.c

Then build a new image, reboot, and run the tests again as above.

Awesome, that worked. iowait is still kinda high during bonnie, but that’s to be expected. Load goes to 3, but not 12. Disks on the same controller as the test disk appear to be unaffected.

The various readahead and elvtune settings seem to have minimal effect; I’ll just leave them as:

blockdev --setra 120 /dev/sde
elvtune -r 2048 -w 8192 /dev/sde

Finally, other i/o performance tidbits:

  • The disks are mounted with noatime
  • Apache uses a single access log file
  • Added echo 100 5000 640 2560 150 30000 5000 1884 2 > /proc/sys/vm/bdflush to /etc/rc.d/rc.local, as proposed on this handy tuning page

Leave a Reply

Name

Mail (never published)

Website