Jump to contentJump to page navigation: previous page [access key p]/next page [access key n]
Applies to openSUSE Leap 15.1

12 Tuning I/O Performance Edit source

I/O scheduling controls how input/output operations will be submitted to storage. openSUSE Leap offers various I/O algorithms—called elevators—suiting different workloads. Elevators can help to reduce seek operations and can prioritize I/O requests.

Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too. Single ATA disk systems, SSDs, RAID arrays, or network storage systems, for example, each require different tuning strategies.

12.1 Switching I/O Scheduling Edit source

openSUSE Leap picks a default I/O scheduler at boot-time, which can be changed on the fly per block device. This makes it possible to set different algorithms, for example, for the device hosting the system partition and the device hosting a database.

The default I/O scheduler is chosen for each device based on whether the device reports to be rotational disk or not. For non-rotational disks DEADLINE I/O scheduler is picked. Other devices default to CFQ (Completely Fair Queuing). To change this default, use the following boot parameter:

elevator=SCHEDULER

Replace SCHEDULER with one of the values cfq, noop, or deadline. See Section 12.2, “Available I/O Elevators” for details.

To change the elevator for a specific device in the running system, run the following command:

tux > sudo echo SCHEDULER > /sys/block/DEVICE/queue/scheduler

Here, SCHEDULER is one of cfq, noop, or deadline. DEVICE is the block device (sda for example). Note that this change will not persist during reboot. For permanent I/O scheduler change for a particular device either place the command switching the I/O scheduler into init scripts or add appropriate udev rule into /lib/udev/rules.d/. See /lib/udev/rules.d/60-ssd-scheduler.rules for an example of such tuning.

12.2 Available I/O Elevators Edit source

Below is a list of elevators available on openSUSE Leap for devices that use the legacy block I/O path. If an elevator has tunable parameters, they can be set with the command:

tux > sudo echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE

where VALUE is the desired value for the TUNABLE and DEVICE the block device.

To find out what elevators are available for a device (sda for example), run the following command (the currently selected scheduler is listed in brackets):

jupiter:~ # cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

This file can also contain the string none meaning that I/O scheduling does not happen for this device. This is usually because the device uses a multi-queue queuing mechanism (refer to Section 12.5, “Enable blk-mq I/O Path for SCSI by Default”).

12.2.1 CFQ (Completely Fair Queuing) Edit source

CFQ is a fairness-oriented scheduler and is used by default on openSUSE Leap. The algorithm assigns each thread a time slice in which it is allowed to submit I/O to disk. This way each thread gets a fair share of I/O throughput. It also allows assigning tasks I/O priorities which are taken into account during scheduling decisions (see Section 8.3.3, “Prioritizing Disk Access with ionice). The CFQ scheduler has the following tunable parameters:

Table 12.1: CFQ tunable parameters

File

Description

slice_idle

When a task has no more I/O to submit in its time slice, the I/O scheduler waits before scheduling the next thread. slice_idle specifies the I/O scheduler's waiting time in milliseconds. Waiting for more I/O from a thread can improve locality of I/O. Additionally, it avoids starving processes doing dependent I/O. A process does dependent I/O if it needs a result of one I/O to submit another I/O. For example, if you first need to read an index block to find out a data block to read, these two reads form a dependent I/O.

For media where locality is less important (SSDs, SANs with lots of disks), setting slice_idle to 0 can improve the throughput considerably.

Default is 8.

slice_idle_us

Same as slice_idle but in microseconds.

Default is 8000.

quantum

This option limits the maximum number of requests that are being processed by the device. For a storage with several disks, this setting can unnecessarily limit parallel processing of requests. Therefore, increasing the value can improve performance. However, it can also cause latency of certain I/O operations to increase, because more requests are buffered inside the storage. When changing this value, you can also consider tuning slice_async_rq.

Default is 8.

low_latency

When enabled (which is the default on openSUSE Leap), the scheduler may dynamically adjust the length of the time slice by aiming to meet a tuning parameter called the target_latency. Time slices are recomputed to meet this target_latency and ensure that processes get fair access within a bounded length of time.

Default is 1.

target_latency

Contains an estimated latency time in milliseconds for CFQ. CFQ uses it to calculate the time slice used for every task.

Default is 300.

target_latency_us

Same as target_latency but in microseconds.

Default is 300000.

group_idle

To avoid starving of blkio cgroups doing dependent I/O, CFQ pauses after completion of I/O for one blkio cgroup before scheduling I/O for a different blkio cgroup. group_idle specifies the time in milliseconds the I/O scheduler waits. When slice_idle is set, this parameter does not have a significant effect. However, for fast media, the overhead of slice_idle is generally undesirable. Disabling slice_idle and setting group_idle is a method to avoid starvation of blkio cgroups doing dependent I/O with lower overhead.

Default is 8.

group_idle_us

Same as group_idle but in microseconds.

Default is 8000.

slice_sync

This parameter is used to calculate the time slice for synchronous queue. It is specified in milliseconds. Increasing this value increases the time slice of synchronous queue.

Default is 100.

slice_sync_us

Same as slice_sync but in microseconds.

Default is 100000.

slice_async

This parameter is used to calculate the time slice for asynchronous queue. It is specified in milliseconds. Increasing this value increases the time slice of asynchronous queue.

Default is 40.

slice_async_us

Same as slice_async but in microseconds.

Default is 40000.

slice_async_rq

This limits the maximum number of asynchronous requests—usually write requests—that are submitted in one time slice.

Default is 2.

back_seek_max

Maximum "distance" (in Kbytes) for backward seeking.

Default is 16384.

back_seek_penalty

Used to compute the cost of backward seeking.

Default is 2.

fifo_expire_async

Value (in milliseconds) is used to set the timeout of asynchronous requests.

Default is 250.

fifo_expire_sync

Value (in milliseconds) that specifies the timeout of synchronous requests.

Default is 125.

Example 12.1: Increasing individual thread throughput using CFQ

In openSUSE Leap 15.1, the low_latency tuning parameter is enabled by default to ensure that processes get fair access within a bounded length of time. (Note that this parameter was not enabled in versions prior to openSUSE Leap.)

This is usually preferred in a server scenario where processes are executing I/O as part of transactions, as it makes the time needed for each transaction predictable. However, there are scenarios where that is not the desired behavior:

  • If the performance metric of interest is the peak performance of a single process when there is I/O contention.

  • If a workload must complete as quickly as possible and there are multiple sources of I/O. In this case, unfair treatment from the I/O scheduler may allow the transactions to complete faster: Processes take their full slice and exit quickly, resulting in reduced overall contention.

To address this, there are two options—increase target_latency or disable low_latency. As with all tuning parameters it is important to verify your workload behaves as expected before and after the tuning modification. Take careful note of whether your workload depends on individual process peak performance or scales better with fairness. It should also be noted that the performance will depend on the underlying storage and the correct tuning option for one installation may not be universally true.

Find below an example that does not control when I/O starts but is simple enough to demonstrate the point. 32 processes are writing a small amount of data to disk in parallel. Using the openSUSE Leap default (enabling low_latency), the result looks as follows:

root # echo 1 > /sys/block/sda/queue/iosched/low_latency
root # time ./dd-test.sh
10485760 bytes (10 MB) copied, 2.62464 s, 4.0 MB/s
10485760 bytes (10 MB) copied, 3.29624 s, 3.2 MB/s
10485760 bytes (10 MB) copied, 3.56341 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.56908 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.53043 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.57511 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.53672 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.5433 s, 3.0 MB/s
10485760 bytes (10 MB) copied, 3.65474 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.63694 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.90122 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.88507 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.86135 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.84553 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.88871 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 3.94943 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 4.12731 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.15106 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.21601 s, 2.5 MB/s
10485760 bytes (10 MB) copied, 4.35004 s, 2.4 MB/s
10485760 bytes (10 MB) copied, 4.33387 s, 2.4 MB/s
10485760 bytes (10 MB) copied, 4.55434 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.52283 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.52682 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.56176 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.62727 s, 2.3 MB/s
10485760 bytes (10 MB) copied, 4.78958 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.79772 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.78004 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.77994 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.86114 s, 2.2 MB/s
10485760 bytes (10 MB) copied, 4.88062 s, 2.1 MB/s

real    0m4.978s
user    0m0.112s
sys     0m1.544s

Note that each process completes in similar times. This is the CFQ scheduler meeting its target_latency: Each process has fair access to storage.

Note that the earlier processes complete somewhat faster. This happens because the start time of the processes is not identical. In a more complicated example, it is possible to control for this.

This is what happens when low_latency is disabled:

root # echo 0 > /sys/block/sda/queue/iosched/low_latency
root # time ./dd-test.sh
10485760 bytes (10 MB) copied, 0.813519 s, 12.9 MB/s
10485760 bytes (10 MB) copied, 0.788106 s, 13.3 MB/s
10485760 bytes (10 MB) copied, 0.800404 s, 13.1 MB/s
10485760 bytes (10 MB) copied, 0.816398 s, 12.8 MB/s
10485760 bytes (10 MB) copied, 0.959087 s, 10.9 MB/s
10485760 bytes (10 MB) copied, 1.09563 s, 9.6 MB/s
10485760 bytes (10 MB) copied, 1.18716 s, 8.8 MB/s
10485760 bytes (10 MB) copied, 1.27661 s, 8.2 MB/s
10485760 bytes (10 MB) copied, 1.46312 s, 7.2 MB/s
10485760 bytes (10 MB) copied, 1.55489 s, 6.7 MB/s
10485760 bytes (10 MB) copied, 1.64277 s, 6.4 MB/s
10485760 bytes (10 MB) copied, 1.78196 s, 5.9 MB/s
10485760 bytes (10 MB) copied, 1.87496 s, 5.6 MB/s
10485760 bytes (10 MB) copied, 1.9461 s, 5.4 MB/s
10485760 bytes (10 MB) copied, 2.08351 s, 5.0 MB/s
10485760 bytes (10 MB) copied, 2.28003 s, 4.6 MB/s
10485760 bytes (10 MB) copied, 2.42979 s, 4.3 MB/s
10485760 bytes (10 MB) copied, 2.54564 s, 4.1 MB/s
10485760 bytes (10 MB) copied, 2.6411 s, 4.0 MB/s
10485760 bytes (10 MB) copied, 2.75171 s, 3.8 MB/s
10485760 bytes (10 MB) copied, 2.86162 s, 3.7 MB/s
10485760 bytes (10 MB) copied, 2.98453 s, 3.5 MB/s
10485760 bytes (10 MB) copied, 3.13723 s, 3.3 MB/s
10485760 bytes (10 MB) copied, 3.36399 s, 3.1 MB/s
10485760 bytes (10 MB) copied, 3.60018 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.58151 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.67385 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.69471 s, 2.8 MB/s
10485760 bytes (10 MB) copied, 3.66658 s, 2.9 MB/s
10485760 bytes (10 MB) copied, 3.81495 s, 2.7 MB/s
10485760 bytes (10 MB) copied, 4.10172 s, 2.6 MB/s
10485760 bytes (10 MB) copied, 4.0966 s, 2.6 MB/s

real    0m3.505s
user    0m0.160s
sys     0m1.516s

Note that the time processes take to complete is spread much wider as processes are not getting fair access. Some processes complete faster and exit, allowing the total workload to complete faster, and some processes measure higher apparent I/O performance. It is also important to note that this example may not behave similarly on all systems as the results depend on the resources of the machine and the underlying storage.

It is important to emphasize that neither tuning option is inherently better than the other. Both are best in different circumstances and it is important to understand the requirements of your workload and tune accordingly.

12.2.2 NOOP Edit source

A trivial scheduler that only passes down the I/O that comes to it. Useful for checking whether complex I/O scheduling decisions of other schedulers are causing I/O performance regressions.

This scheduler is recommended for setups with devices that do I/O scheduling themselves, such as intelligent storage or in multipathing environments. If you choose a more complicated scheduler on the host, the scheduler of the host and the scheduler of the storage device compete with each other. This can decrease performance. The storage device can usually determine best how to schedule I/O.

For similar reasons, this scheduler is also recommended for use within virtual machines.

The NOOP scheduler can be useful for devices that do not depend on mechanical movement, like SSDs. Usually, the DEADLINE I/O scheduler is a better choice for these devices. However, NOOP creates less overhead and thus can on certain workloads increase performance.

12.2.3 DEADLINE Edit source

DEADLINE is a latency-oriented I/O scheduler. Each I/O request is assigned a deadline. Usually, requests are stored in queues (read and write) sorted by sector numbers. The DEADLINE algorithm maintains two additional queues (read and write) in which requests are sorted by deadline. As long as no request has timed out, the sector queue is used. When timeouts occur, requests from the deadline queue are served until there are no more expired requests. Generally, the algorithm prefers reads over writes.

This scheduler can provide a superior throughput over the CFQ I/O scheduler in cases where several threads read and write and fairness is not an issue. For example, for several parallel readers from a SAN and for databases (especially when using TCQ disks). The DEADLINE scheduler has the following tunable parameters:

Table 12.2: DEADLINE tunable parameters

File

Description

writes_starved

Controls how many times reads are preferred over writes. A value of 3 means that three read operations can be done before writes and reads are dispatched on the same selection criteria.

Default is 3.

read_expire

Sets the deadline (current time plus the read_expire value) for read operations in milliseconds.

Default is 500.

write_expire

Sets the deadline (current time plus the write_expire value) for write operations in milliseconds.

Default is 5000.

front_merges

Enables (1) or disables (0) attempts to front merge requests.

Default is 1.

fifo_batch

Sets the maximum number of requests per batch (deadline expiration is only checked for batches). This parameter allows to balance between latency and throughput. When set to 1 (that is, one request per batch), it results in "first come, first served" behaviour and usually lowest latency. Higher values usually increase throughput.

Default is 16.

12.3 Available I/O Elevators with blk-mq I/O path Edit source

Below is a list of elevators available on openSUSE Leap for devices that use the blk-mq I/O path If an elevator has tunable parameters, they can be set with the command:

echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE

In command above, VALUE is the desired value for the TUNABLE and DEVICE is the block device.

To find out what elevators are available for a device (sda for example), run the following command (the currently selected scheduler is listed in brackets):

tux > cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

12.3.1 MQ-DEADLINE Edit source

MQ-DEADLINE is a latency-oriented I/O scheduler. It is a modification of DEADLINE scheduler for blk-mq I/O path (refer to Section 12.2.3, “DEADLINE). MQ-DEADLINE has the same set of tunable parameters. Please refer to Table 12.2, “DEADLINE tunable parameters” for a description.

12.3.2 NONE Edit source

When NONE is selected as I/O elevator option for blk-mq, no I/O scheduler is used, and I/O requests are passed down to the device without further I/O scheduling interaction. In this respect, it is comparable to NOOP scheduler for the legacy block I/O path (see Section 12.2.2, “NOOP).

NONE is the default for NVM Express devices. With no overhead compared to other I/O elevator options, it is considered the fastest way of passing down I/O requests on multiple queues to such devices.

There are no tunable parameters for NONE.

12.3.3 BFQ (Budget Fair Queueing) Edit source

BFQ is a fairness-oriented scheduler. It is described as "a proportional-share storage-I/O scheduling algorithm based on the slice-by-slice service scheme of CFQ. But BFQ assigns budgets, measured in number of sectors, to processes instead of time slices." (Source: linux-4.12/block/bfq-iosched.c)

BFQ allows to assign I/O priorities to tasks which are taken into account during scheduling decisions (see Section 8.3.3, “Prioritizing Disk Access with ionice).

BFQ scheduler has following tunable parameters:

Table 12.3: BFQ tunable parameters

File

Description

slice_idle

Value in milliseconds specifies how long to idle, waiting for next request on an empty queue.

Default is 8.

slice_idle_us

Same as slice_idle but in microseconds.

Default is 8000.

low_latency

Enables (1) or disables (0) BFQ's low latency mode. This mode prioritizes certain applications (for example, if interactive) such that they observe lower latency.

Default is 1.

back_seek_max

Maximum value (in Kbytes) for backward seeking.

Default is 16384.

back_seek_penalty

Used to compute the cost of backward seeking.

Default is 2.

fifo_expire_async

Value (in milliseconds) is used to set the timeout of asynchronous requests.

Default is 250.

fifo_expire_sync

Value in milliseconds specifies the timeout of synchronous requests.

Default is 125.

timeout_sync

Maximum time in milliseconds that a task (queue) is serviced after it has been selected.

Default is 124.

max_budget

Limit for number of sectors that are served at maximum within timeout_sync. If set to 0 BFQ internally calculates a value based on timeout_sync and an estimated peak rate.

Default is 0 (set to auto-tuning).

strict_guarantees

Enables (1) or disables (0) BFQ specific queue handling required to give stricter bandwidth sharing guarantees under certain conditions.

Default is 0.

12.3.4 KYBER Edit source

KYBER is a latency-oriented I/O scheduler. It makes it possible to set target latencies for reads and synchronous writes and throttles I/O requests in order to try to meet these target latencies.

Table 12.4: KYBER tunable parameters

File

Description

read_lat_nsec

Sets the target latency for read operations in nanoseconds.

Default is 2000000.

write_lat_nsec

Sets the target latency for write operations in nanoseconds.

Default is 10000000.

12.4 I/O Barrier Tuning Edit source

Most file systems (such as XFS, Ext3, or Ext4) send write barriers to disk after fsync or during transaction commits. Write barriers enforce proper ordering of writes, making volatile disk write caches safe to use (at some performance penalty). If your disks are battery-backed in one way or another, disabling barriers can safely improve performance.

Sending write barriers can be disabled using the nobarrier mount option.

Warning
Warning: Disabling Barriers Can Lead to Data Loss

Disabling barriers when disks cannot guarantee caches are properly written in case of power failure can lead to severe file system corruption and data loss.

12.5 Enable blk-mq I/O Path for SCSI by Default Edit source

Block multiqueue (blk-mq) is a multi-queue block I/O queueing mechanism. Blk-mq uses per-cpu software queues to queue I/O requests. The software queues are mapped to one or more hardware submission queues. Blk-mq significantly reduces lock contention. In particular blk-mq improves performance for devices that support a high number of input/output operations per second (IOPS). Blk-mq is already the default for some devices, for example, NVM Express devices.

Blk-mq has a different set of I/O scheduler options. There is MQ-DEADLINE (comparable to DEADLINE) and NONE (comparable to NOOP). There is no longer CFQ I/O scheduler with blk-mq. But there are two new I/O schedulers: BFQ and KYBER. These changes in I/O scheduling can cause performance differences with blk-mq compared to legacy block I/O path. Therefore, blk-mq is not enabled by default for SCSI devices.

If you have fast SCSI devices (for example, SSDs) instead of SCSI hard disks attached to your system, consider switching to blk-mq for SCSI. This is done using the kernel command line option scsi_mod.use_blk_mq=1. If you have also attached SCSI hard disks (spinning devices) to your system, make sure to switch to BFQ I/O scheduler for the spinning devices to avoid their significant performance degradation.

Print this page