I/O scheduling controls how input/output operations will be submitted to
storage. openSUSE Leap offers various I/O algorithms—called
elevators
—suiting different workloads.
Elevators can help to reduce seek operations, can prioritize I/O requests,
or make sure, and I/O request is carried out before a given deadline.
Choosing the best suited I/O elevator not only depends on the workload, but on the hardware, too. Single ATA disk systems, SSDs, RAID arrays, or network storage systems, for example, each require different tuning strategies.
openSUSE Leap lets you set a default I/O scheduler at boot-time, which can be changed on the fly per block device. This makes it possible to set different algorithms, for example, for the device hosting the system partition and the device hosting a database.
By default the CFQ
(Completely
Fair Queuing) scheduler is used for “traditional” hard disks
and DEADLINE
is used for
solid-stae drives (SSDs). To change this default, use the
following boot parameter:
elevator=SCHEDULER
Replace SCHEDULER with one of the values
cfq
, noop
, or
deadline
. See Section 12.2, “Available I/O Elevators”
for details.
To change the elevator for a specific device in the running system, run the following command:
echo SCHEDULER > /sys/block/DEVICE/queue/scheduler
Here, SCHEDULER is one of
cfq
, noop
, or deadline
.
DEVICE is the block device
(sda
for example).
In the following elevators available on openSUSE Leap are listed. Each elevator has a set of tunable parameters, which can be set with the following command:
echo VALUE > /sys/block/DEVICE/queue/iosched/TUNABLE
where VALUE is the desired value for the TUNABLE and DEVICE the block device.
To find out which elevator is the current default, run the following command. The currently selected scheduler is listed in brackets:
jupiter:~ # cat /sys/block/sda/queue/scheduler noop deadline [cfq]
CFQ
(Completely Fair Queuing) #
CFQ
is a fairness-oriented
scheduler and is used by default on openSUSE Leap. The algorithm
assigns each thread a time slice in which it is allowed to submit I/O to
disk. This way each thread gets a fair share of I/O throughput. It also
allows assigning tasks I/O priorities which are taken into account
during scheduling decisions (see man 1 ionice
). The
CFQ
scheduler has the
following tunable parameters:
/sys/block/DEVICE/queue/iosched/slice_idle
When a task has no more I/O to submit in its time slice, the I/O scheduler waits for a while before scheduling the next thread to improve locality of I/O. Additionally, the I/O scheduler avoids starving processes doing dependent I/O. A process does dependent I/O if it needs a result of one I/O in order to submit another I/O. For example, if you first need to read an index block in order to find out a data block to read, these two reads form a dependent I/O.
For media where locality does not play a big role (SSDs, SANs
with lots of disks) setting /sys/block/<device>/queue/iosched/slice_idle
to 0
can improve the throughput considerably.
/sys/block/DEVICE/queue/iosched/quantum
This option limits the maximum number of requests that are being
processed at once by the device. The default value is
4
. For a storage with several disks, this setting
can unnecessarily limit parallel processing of requests. Therefore,
increasing the value can improve performance. However, it can also
cause latency of certain I/O operations to increase because more
requests are buffered inside the storage. When changing this value,
you can also consider tuning
/sys/block/DEVICE/queue/iosched/slice_async_rq
(the default value is 2
). This limits the maximum
number of asynchronous requests—usually write
requests—that are submitted in one time slice.
/sys/block/DEVICE/queue/iosched/low_latency
When enabled (which is the default on openSUSE Leap) the scheduler
may dynamically adjust the length of the time slice by aiming to meet
a tuning parameter called the target_latency
. Time
slices are recomputed to meet this target_latency
and ensure that processes get fair access within a bounded length of
time.
/sys/block/DEVICE/queue/iosched/target_latency
Contains an estimated latency time for the
CFQ
.
CFQ
will use it to
calculate the time slice used for every task.
/sys/block/DEVICE/queue/iosched/group_idle
To avoid starving of blkio cgroups doing dependent I/O, CFQ
waits a bit after completion of I/O for one blkio cgroup before
scheduling I/O for a different blkio cgroup. When
slice_idle
is set, this parameter does not
have a big impact. However, for fast media, the overhead of
slice_idle
is generally undesirable.
Disabling slice_idle
and setting
group_idle
is a method to avoid starvation
of blkio cgroups doing dependent I/O with lower overhead.
CFQ
#
In openSUSE Leap 42.2, the
low_latency
tuning parameter is enabled by default
to ensure that processes get fair access within a bounded length of
time. (Note that this parameter was not enabled in versions prior to
SUSE Linux Enterprise 12.)
This is usually preferred in a server scenario where processes are executing I/O as part of transactions, as it makes the time needed for each transaction predictable. However, there are scenarios where that is not the desired behavior:
If the performance metric of interest is the peak performance of a single process when there is I/O contention.
If a workload must complete as quickly as possible and there are multiple sources of I/O. In this case, unfair treatment from the I/O scheduler may allow the transactions to complete faster: Processes take their full slice and exit quickly, resulting in reduced overall contention.
To address this, there are two options—increase
target_latency
or disable
low_latency
. As with all tuning parameters it is
important to verify your workload behaves as expected before and after
the tuning modification. Take careful note of whether your workload
depends on individual process peak performance or scales better with
fairness. It should also be noted that the performance will depend on
the underlying storage and the correct tuning option for one
installation may not be universally true.
Find below an example that does not control when I/O starts but is
simple enough to demonstrate the point. 32 processes are writing a
small amount of data to disk in parallel. Using the openSUSE Leap
default (enabling low_latency
), the result looks as
follows:
root #
echo 1 > /sys/block/sda/queue/iosched/low_latencyroot #
time ./dd-test.sh 10485760 bytes (10 MB) copied, 2.62464 s, 4.0 MB/s 10485760 bytes (10 MB) copied, 3.29624 s, 3.2 MB/s 10485760 bytes (10 MB) copied, 3.56341 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.56908 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.53043 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.57511 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.53672 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.5433 s, 3.0 MB/s 10485760 bytes (10 MB) copied, 3.65474 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.63694 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.90122 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.88507 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.86135 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.84553 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.88871 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 3.94943 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 4.12731 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.15106 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.21601 s, 2.5 MB/s 10485760 bytes (10 MB) copied, 4.35004 s, 2.4 MB/s 10485760 bytes (10 MB) copied, 4.33387 s, 2.4 MB/s 10485760 bytes (10 MB) copied, 4.55434 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.52283 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.52682 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.56176 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.62727 s, 2.3 MB/s 10485760 bytes (10 MB) copied, 4.78958 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.79772 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.78004 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.77994 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.86114 s, 2.2 MB/s 10485760 bytes (10 MB) copied, 4.88062 s, 2.1 MB/s real 0m4.978s user 0m0.112s sys 0m1.544s
Note that each process completes in similar times. This is the
CFQ
scheduler meeting its
target_latency
: Each process has fair access
to storage.
Note that the earlier processes complete somewhat faster. This happens because the start time of the processes is not identical. In a more complicated example, it is possible to control for this.
This is what happens when low_latency is disabled:
root #
echo 0 > /sys/block/sda/queue/iosched/low_latencyroot #
time ./dd-test.sh 10485760 bytes (10 MB) copied, 0.813519 s, 12.9 MB/s 10485760 bytes (10 MB) copied, 0.788106 s, 13.3 MB/s 10485760 bytes (10 MB) copied, 0.800404 s, 13.1 MB/s 10485760 bytes (10 MB) copied, 0.816398 s, 12.8 MB/s 10485760 bytes (10 MB) copied, 0.959087 s, 10.9 MB/s 10485760 bytes (10 MB) copied, 1.09563 s, 9.6 MB/s 10485760 bytes (10 MB) copied, 1.18716 s, 8.8 MB/s 10485760 bytes (10 MB) copied, 1.27661 s, 8.2 MB/s 10485760 bytes (10 MB) copied, 1.46312 s, 7.2 MB/s 10485760 bytes (10 MB) copied, 1.55489 s, 6.7 MB/s 10485760 bytes (10 MB) copied, 1.64277 s, 6.4 MB/s 10485760 bytes (10 MB) copied, 1.78196 s, 5.9 MB/s 10485760 bytes (10 MB) copied, 1.87496 s, 5.6 MB/s 10485760 bytes (10 MB) copied, 1.9461 s, 5.4 MB/s 10485760 bytes (10 MB) copied, 2.08351 s, 5.0 MB/s 10485760 bytes (10 MB) copied, 2.28003 s, 4.6 MB/s 10485760 bytes (10 MB) copied, 2.42979 s, 4.3 MB/s 10485760 bytes (10 MB) copied, 2.54564 s, 4.1 MB/s 10485760 bytes (10 MB) copied, 2.6411 s, 4.0 MB/s 10485760 bytes (10 MB) copied, 2.75171 s, 3.8 MB/s 10485760 bytes (10 MB) copied, 2.86162 s, 3.7 MB/s 10485760 bytes (10 MB) copied, 2.98453 s, 3.5 MB/s 10485760 bytes (10 MB) copied, 3.13723 s, 3.3 MB/s 10485760 bytes (10 MB) copied, 3.36399 s, 3.1 MB/s 10485760 bytes (10 MB) copied, 3.60018 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.58151 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.67385 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.69471 s, 2.8 MB/s 10485760 bytes (10 MB) copied, 3.66658 s, 2.9 MB/s 10485760 bytes (10 MB) copied, 3.81495 s, 2.7 MB/s 10485760 bytes (10 MB) copied, 4.10172 s, 2.6 MB/s 10485760 bytes (10 MB) copied, 4.0966 s, 2.6 MB/s real 0m3.505s user 0m0.160s sys 0m1.516s
Note that the time processes take to complete is spread much wider as processes are not getting fair access. Some processes complete faster and exit, allowing the total workload to complete faster, and some processes measure higher apparent I/O performance. It is also important to note that this example may not behave similarly on all systems as the results depend on the resources of the machine and the underlying storage.
It is important to emphasize that neither tuning option is inherently better than the other. Both are best in different circumstances and it is important to understand the requirements of your workload and tune accordingly.
NOOP
#A trivial scheduler that only passes down the I/O that comes to it. Useful for checking whether complex I/O scheduling decisions of other schedulers are causing I/O performance regressions.
This scheduler is recommended for setups with devices that do I/O scheduling themselves, such as intelligent storage or in multipathing environments. If you choose a more complicated scheduler on the host, the scheduler of the host and the scheduler of the storage device compete with each other. This can decrease performance. The storage device can usually determine best how to schedule I/O.
For similar reasons, this scheduler is also recommended for use within virtual machines.
The NOOP
scheduler can be
useful for devices that do not depend on mechanical movement, like SSDs.
Usually, the
DEADLINE
I/O scheduler is a
better choice for these devices. However,
NOOP
creates less overhead and
thus can on certain workloads increase performance.
DEADLINE
#
DEADLINE
is a latency-oriented
I/O scheduler. Each I/O request is assigned a deadline. Usually,
requests are stored in queues (read and write) sorted by sector numbers.
The DEADLINE
algorithm
maintains two additional queues (read and write) in which requests are
sorted by deadline. As long as no request has timed out, the
“sector” queue is used. When timeouts occur, requests from
the “deadline” queue are served until there are no more
expired requests. Generally, the algorithm prefers reads over writes.
This scheduler can provide a superior throughput over the
CFQ
I/O scheduler in cases
where several threads read and write and fairness is not an issue. For
example, for several parallel readers from a SAN and for databases
(especially when using “TCQ” disks). The
DEADLINE
scheduler has the
following tunable parameters:
/sys/block/<device>/queue/iosched/writes_starved
Controls how many reads can be sent to disk before it is possible to
send writes. A value of 3
means, that three read
operations are carried out for one write operation.
/sys/block/<device>/queue/iosched/read_expire
Sets the deadline (current time plus the read_expire value) for read operations in milliseconds. The default is 500.
/sys/block/<device>/queue/iosched/write_expire
/sys/block/<device>/queue/iosched/read_expire
Sets the deadline (current time plus the read_expire value) for read
operations in milliseconds. The default is 500.
Most file systems (such as XFS, Ext3, Ext4, or reiserfs) send write barriers to disk after fsync or during transaction commits. Write barriers enforce proper ordering of writes, making volatile disk write caches safe to use (at some performance penalty). If your disks are battery-backed in one way or another, disabling barriers can safely improve performance.
Sending write barriers can be disabled using the
barrier=0
mount option (for Ext3, Ext4, and reiserfs),
or using the nobarrier
mount option (for XFS).
Disabling barriers when disks cannot guarantee caches are properly written in case of power failure can lead to severe file system corruption and data loss.