To understand and tune the memory management behavior of the kernel, it is important to first have an overview of how it works and cooperates with other subsystems.
The memory management subsystem, also called the virtual memory manager, will subsequently be called “VM”. The role of the VM is to manage the allocation of physical memory (RAM) for the entire kernel and user programs. It is also responsible for providing a virtual memory environment for user processes (managed via POSIX APIs with Linux extensions). Finally, the VM is responsible for freeing up RAM when there is a shortage, either by trimming caches or swapping out “anonymous” memory.
The most important thing to understand when examining and tuning VM is how its caches are managed. The basic goal of the VM's caches is to minimize the cost of I/O as generated by swapping and file system operations (including network file systems). This is achieved by avoiding I/O completely, or by submitting I/O in better patterns.
Free memory will be used and filled up by these caches as required. The more memory is available for caches and anonymous memory, the more effectively caches and swapping will operate. However, if a memory shortage is encountered, caches will be trimmed or memory will be swapped out.
For a particular workload, the first thing that can be done to improve performance is to increase memory and reduce the frequency that memory must be trimmed or swapped. The second thing is to change the way caches are managed by changing kernel parameters.
Finally, the workload itself should be examined and tuned as well. If an application is allowed to run more processes or threads, effectiveness of VM caches can be reduced, if each process is operating in its own area of the file system. Memory overheads are also increased. If applications allocate their own buffers or caches, larger caches will mean that less memory is available for VM caches. However, more processes and threads can mean more opportunity to overlap and pipeline I/O, and may take better advantage of multiple cores. Experimentation will be required for the best results.
Memory allocations in general can be characterized as “pinned” (also known as “unreclaimable”), “reclaimable” or “swappable”.
Anonymous memory tends to be program heap and stack memory (for example,
>malloc()
). It is reclaimable, except in special
cases such as mlock
or if there is no available swap
space. Anonymous memory must be written to swap before it can be
reclaimed. Swap I/O (both swapping in and swapping out pages) tends to
be less efficient than pagecache I/O, because of allocation and access
patterns.
A cache of file data. When a file is read from disk or network, the contents are stored in pagecache. No disk or network access is required, if the contents are up-to-date in pagecache. tmpfs and shared memory segments count toward pagecache.
When a file is written to, the new data is stored in pagecache before being written back to a disk or the network (making it a write-back cache). When a page has new data not written back yet, it is called “dirty”. Pages not classified as dirty are “clean”. Clean pagecache pages can be reclaimed if there is a memory shortage by simply freeing them. Dirty pages must first be made clean before being reclaimed.
This is a type of pagecache for block devices (for example, /dev/sda). A file system typically uses the buffercache when accessing its on-disk metadata structures such as inode tables, allocation bitmaps, and so forth. Buffercache can be reclaimed similarly to pagecache.
Buffer heads are small auxiliary structures that tend to be allocated upon pagecache access. They can generally be reclaimed easily when the pagecache or buffercache pages are clean.
As applications write to files, the pagecache (and buffercache) becomes dirty. When pages have been dirty for a given amount of time, or when the amount of dirty memory reaches a specified number of pages in bytes (vm.dirty_background_bytes), the kernel begins writeback. Flusher threads perform writeback in the background and allow applications to continue running. If the I/O cannot keep up with applications dirtying pagecache, and dirty data reaches a critical setting (vm.dirty_bytes), then applications begin to be throttled to prevent dirty data exceeding this threshold.
The VM monitors file access patterns and may attempt to perform readahead. Readahead reads pages into the pagecache from the file system that have not been requested yet. It is done to allow fewer, larger I/O requests to be submitted (more efficient). And for I/O to be pipelined (I/O performed at the same time as the application is running).
This is an in-memory cache of the inode structures for each file system. These contain attributes such as the file size, permissions and ownership, and pointers to the file data.
This is an in-memory cache of the directory entries in the system. These contain a name (the name of a file), the inode which it refers to, and children entries. This cache is used when traversing the directory structure and accessing a file by name.
Applications running on openSUSE Leap 42.2 can allocate
more memory compared to openSUSE Leap 10. This is because of
glibc
changing its default
behavior while allocating user space memory. See
http://www.gnu.org/s/libc/manual/html_node/Malloc-Tunable-Parameters.html
for explanation of these parameters.
To restore a openSUSE Leap 10-like behavior, M_MMAP_THRESHOLD should be set to 128*1024. This can be done with mallopt() call from the application, or via setting MALLOC_MMAP_THRESHOLD environment variable before running the application.
Kernel memory that is reclaimable (caches, described above) will be trimmed automatically during memory shortages. Most other kernel memory cannot be easily reduced but is a property of the workload given to the kernel.
Reducing the requirements of the user space workload will reduce the kernel memory usage (fewer processes, fewer open files and sockets, etc.)
If the memory cgroups feature is not needed, it can be switched off by passing cgroup_disable=memory on the kernel command line, reducing memory consumption of the kernel a bit.
When tuning the VM it should be understood that some changes will take time to affect the workload and take full effect. If the workload changes throughout the day, it may behave very differently at different times. A change that increases throughput under some conditions may decrease it under other conditions.
/proc/sys/vm/swappiness
This control is used to define how aggressively the kernel swaps out
anonymous memory relative to pagecache and other caches. Increasing
the value increases the amount of swapping. The default value is
60
.
Swap I/O tends to be much less efficient than other I/O. However, some pagecache pages will be accessed much more frequently than less used anonymous memory. The right balance should be found here.
If swap activity is observed during slowdowns, it may be worth reducing this parameter. If there is a lot of I/O activity and the amount of pagecache in the system is rather small, or if there are large dormant applications running, increasing this value might improve performance.
Note that the more data is swapped out, the longer the system will take to swap data back in when it is needed.
/proc/sys/vm/vfs_cache_pressure
This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.
It is difficult to know when this should be changed, other than by
experimentation. The slabtop
command (part of the
package procps
) shows top
memory objects used by the kernel. The vfs caches are the "dentry"
and the "*_inode_cache" objects. If these are consuming a large
amount of memory in relation to pagecache, it may be worth trying to
increase pressure. Could also help to reduce swapping. The default
value is 100
.
/proc/sys/vm/min_free_kbytes
This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim). This should not normally be lowered unless the system is being very carefully tuned for memory usage (normally useful for embedded rather than server applications). If “page allocation failure” messages and stack traces are frequently seen in logs, min_free_kbytes could be increased until the errors disappear. There is no need for concern, if these messages are very infrequent. The default value depends on the amount of RAM.
One important change in writeback behavior since openSUSE Leap 10 is that modification to file-backed mmap() memory is accounted immediately as dirty memory (and subject to writeback). Whereas previously it would only be subject to writeback after it was unmapped, upon an msync() system call, or under heavy memory pressure.
Some applications do not expect mmap modifications to be subject to such writeback behavior, and performance can be reduced. Berkeley DB (and applications using it) is one known example that can cause problems. Increasing writeback ratios and times can improve this type of slowdown.
/proc/sys/vm/dirty_background_ratio
This is the percentage of the total amount of free and reclaimable
memory. When the amount of dirty pagecache exceeds this percentage,
writeback threads start writing back dirty memory. The default value
is 10
(%).
/proc/sys/vm/dirty_background_bytes
This is the percentage of the total amount of dirty memory at which
the background kernel flusher threads will start writeback.
dirty_background_bytes
is the counterpart of
dirty_background_ratio
. If one of them is set,
the other one will automatically be read as 0
.
/proc/sys/vm/dirty_ratio
Similar percentage value as for
dirty_background_ratio
. When this is exceeded,
applications that want to write to the pagecache are blocked and
start performing writeback as well. The default value is
20
(%).
/proc/sys/vm/dirty_bytes
Contains the amount of dirty memory (in percent) at which a process
generating disk writes will itself start writeback. The minimum value
allowed for dirty_bytes
is two pages (in bytes);
any value lower than this limit will be ignored and the old
configuration will be retained.
dirty_bytes
is the counterpart of
dirty_ratio
.If one of them is set, the other one
will automatically be read as 0
.
/proc/sys/vm/dirty_expires
Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up. Expiration is measured based on the modification time of a file's inode. Therefore, multiple dirtied pages from the same file will all be written when the interval is exceeded.
dirty_background_ratio
and
dirty_ratio
together determine the pagecache
writeback behavior. If these values are increased, more dirty memory is
kept in the system for a longer time. With more dirty memory allowed in
the system, the chance to improve throughput by avoiding writeback I/O
and to submitting more optimal I/O patterns increases. However, more
dirty memory can either harm latency when memory needs to be reclaimed
or at points of data integrity (“synchronization points”) when it
needs to be written back to disk.
The system is required to limit what percentage of the system's memory
contains file-backed data that needs writing to disk. This guarantees
that the system can always allocate the necessary data structures to
complete I/O. The maximum amount of memory that may be dirty and
requires writing at any given time is controlled by
vm.dirty_ratio
(/proc/sys/vm/dirty_ratio
). The defaults are:
SLE-11-SP3: vm.dirty_ratio = 40 SLE-12: vm.dirty_ratio = 20
The primary advantage of using the lower ratio in SUSE Linux Enterprise 12 is that
page reclamation and allocation in low memory situations completes
faster as there is a higher probability that old clean pages will be
quickly found and discarded. The secondary advantage is that if all
data on the system must be synchronized, then the time to complete the
operation on SUSE Linux Enterprise 12 will be lower than SUSE Linux Enterprise 11 SP3 by default.
Most workloads will not notice this change as data is synchronized with
fsync()
by the application or data is not dirtied
quickly enough to hit the limits.
There are exceptions and if your application is affected by this, it
will manifest as an unexpected stall during writes. To prove it is
affected by dirty data rate limiting then monitor
/proc/PID_OF_APPLICATION/stack
and it will be observed that the application spends significant time in
balance_dirty_pages_ratelimited
. If this is observed
and it is a problem, then increase the value of
vm.dirty_ratio
to 40 to restore the SUSE Linux Enterprise 11 SP3
behavior.
It is important to note that the overall I/O throughput is the same regardless of the setting. The only difference is the timing of when the I/O is queued.
This is an example of using dd
to asynchronously
write 30% of memory to disk which would happen to be affected by the
change in vm.dirty_ratio
:
root #
MEMTOTAL_MBYTES=`free -m | grep Mem: | awk '{print $2}'`root #
sysctl vm.dirty_ratio=40root #
dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) 2507145216 bytes (2.5 GB) copied, 8.00153 s, 313 MB/sroot #
sysctl vm.dirty_ratio=20 dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) 2507145216 bytes (2.5 GB) copied, 10.1593 s, 247 MB/s
Note that the parameter affects the time it takes for the command to
complete and the apparent write speed of the device. With
dirty_ratio=40
, more of the data is cached and
written to disk in the background by the kernel. It is very important
to note that the speed of I/O is identical in both cases. To
demonstrate, this is the result when dd
synchronizes
the data before exiting:
root #
sysctl vm.dirty_ratio=40root #
dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) conv=fdatasync 2507145216 bytes (2.5 GB) copied, 21.0663 s, 119 MB/sroot #
sysctl vm.dirty_ratio=20root #
dd if=/dev/zero of=zerofile ibs=1048576 count=$((MEMTOTAL_MBYTES*30/100)) conv=fdatasync 2507145216 bytes (2.5 GB) copied, 21.7286 s, 115 MB/s
Note that dirty_ratio
had almost no impact here and
is within the natural variability of a command. Hence,
dirty_ratio
does not directly impact I/O performance
but it may affect the apparent performance of a workload that writes
data asynchronously without synchronizing.
/sys/block/<bdev>/queue/read_ahead_kb
If one or more processes are sequentially reading a file, the kernel
reads some data in advance (ahead) to reduce the amount of
time that processes need to wait for data to be available. The actual
amount of data being read in advance is computed dynamically, based
on how much "sequential" the I/O seems to be. This parameter sets the
maximum amount of data that the kernel reads ahead for a single file.
If you observe that large sequential reads from a file are not fast
enough, you can try increasing this value. Increasing it too far may
result in readahead thrashing where pagecache used for readahead is
reclaimed before it can be used, or slowdowns because of a large
amount of useless I/O. The default value is 512
(KB).
For the complete list of the VM tunable parameters, see
/usr/src/linux/Documentation/sysctl/vm.txt
(available after having installed the
kernel-source
package).
Some simple tools that can help monitor VM behavior:
vmstat: This tool gives a good overview of what the VM is doing. See
Section 2.1.1, “vmstat
” for details.
/proc/meminfo
: This file gives a detailed
breakdown of where memory is being used. See
Section 2.4.2, “Detailed Memory Usage: /proc/meminfo
” for details.
slabtop
: This tool provides detailed information
about kernel slab memory usage. buffer_head, dentry, inode_cache,
ext3_inode_cache, etc. are the major caches. This command is available
with the package procps
.