yihong0618/gitblog

vm.overcommit_memory 学习笔记

yihong0618 opened this issue · 3 comments

关于 overcommit_memory 学习的一些记录,其中一些为网上前辈文章的整理记录

什么是 vm.overcommit

但在实际情况下,进程请求的内存可能比实际需要的内存少很多,而且很多进程请求的内存并不会全部使用。在这种情况下,如果系统采用严格的内存申请机制,将会导致系统的内存利用率低下,一些能够使用的内存被浪费。
因此,为了提高系统的内存利用率和灵活性,Linux 提供了 vm.overcommit_memory 机制。当这个参数的值为 0 或 1 时,内核允许进程请求超过系统实际可用内存的内存量。这种行为被称为 "overcommit"。

vm.overcommit 的三个值

#define OVERCOMMIT_GUESS        0 // default
#define OVERCOMMIT_ALWAYS       1  // redis
#define OVERCOMMIT_NEVER        2 // postgres
The Linux kernel supports the following overcommit handling modes

0	-	Heuristic overcommit handling. Obvious overcommits of
		address space are refused. Used for a typical system. It
		ensures a seriously wild allocation fails while allowing
		overcommit to reduce swap usage.  root is allowed to 
		allocate slightly more memory in this mode. This is the 
		default.

1	-	Always overcommit. Appropriate for some scientific
		applications. Classic example is code using sparse arrays
		and just relying on the virtual memory consisting almost
		entirely of zero pages.

2	-	Don't overcommit. The total address space commit
		for the system is not permitted to exceed swap + a
		configurable amount (default is 50%) of physical RAM.
		Depending on the amount you use, in most situations
		this means a process will not be killed while accessing
		pages but will receive errors on memory allocation as
		appropriate.

		Useful for applications that want to guarantee their
		memory allocations will be available in the future
		without having to initialize every page.

https://github.com/torvalds/linux/blob/v4.6/mm/util.c#L481

当 vm.overcommit_memory = 2 时

grep -i commit /proc/meminfo

image

/proc/meminfo 中的 Committed_AS 表示所有进程已经申请的内存总大小,(注意是已经申请的,不是已经分配的),如果 Committed_AS 超过 CommitLimit 就表示发生了 overcommit,超出越多表示 overcommit 越严重 -> 触发 OOM

那么这个 CommitLimit 是怎么算出来的呢?

int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
{
	long free, allowed, reserve;

	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
			-(s64)vm_committed_as_batch * num_online_cpus(),
			"memory commitment underflow");

	vm_acct_memory(pages);

	/*
	 * Sometimes we want to use more memory than we have
	 */
	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
		return 0;

	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
		free = global_page_state(NR_FREE_PAGES);
		free += global_page_state(NR_FILE_PAGES);

		/*
		 * shmem pages shouldn't be counted as free in this
		 * case, they can't be purged, only swapped out, and
		 * that won't affect the overall amount of available
		 * memory in the system.
		 */
		free -= global_page_state(NR_SHMEM);

		free += get_nr_swap_pages();

		/*
		 * Any slabs which are created with the
		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
		 * which are reclaimable, under pressure.  The dentry
		 * cache and most inode caches should fall into this
		 */
		free += global_page_state(NR_SLAB_RECLAIMABLE);

		/*
		 * Leave reserved pages. The pages are not for anonymous pages.
		 */
		if (free <= totalreserve_pages)
			goto error;
		else
			free -= totalreserve_pages;

		/*
		 * Reserve some for root
		 */
		if (!cap_sys_admin)
			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);

		if (free > pages)
			return 0;

		goto error;
	}

	allowed = vm_commit_limit();
	/*
	 * Reserve some for root
	 */
	if (!cap_sys_admin)
		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);

	/*
	 * Don't let a single process grow so big a user can't recover
	 */
	if (mm) {
		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
		allowed -= min_t(long, mm->total_vm / 32, reserve);
	}

	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
		return 0;
error:
	vm_unacct_memory(pages);

	return -ENOMEM;
}

红帽犯过这个错误 :https://access.redhat.com/solutions/665023

image

OOM-killer

OMM killer 机制:linux 会为每个进程算一个 score,当内存不足时候它会根据 score kill
score 越大表示越容易杀死

#!/bin/bash

while read -r pid comm
do
    printf '%d\t%d\t%s\n' "$pid" "$(cat /proc/$pid/oom_score)" "$comm"
done < <(ps -e -o pid= -o comm=) | sort -k2 -n

/proc/${pid}/oom_score_adj, 取值范围为-1000到1000, 如果将该值设置为-1000,则进程永远不会被杀死 --> pg
/proc/${pid}/oom_adj, 取值是-17到+15,取值越高,越容易被干掉。如果是-17,则表示不能被kill
/proc/${pid}/oom_score, 是系统综合进程的内存消耗量、CPU 时间(utime + stime)、存活时间(uptime - start time)和 oom_adj 计算出的,消耗内存越多分越高。

https://elixir.bootlin.com/linux/v5.4.58/source/mm/oom_kill.c

数据库如何设置,如何避免被 oom-killer

  • short answer for redis -> echo 1 > /proc/sys/vm/overcommit_memory :)
  • for PostgresSQL or Greenplum -> echo 2 > /proc/sys/vm/overcommit_memory :)

why

redis:

The Redis background saving schema relies on the copy-on-write semantic of the fork system call in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero the fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages. If you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
image

Setting overcommit_memory to 1 tells Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.

PostgresSQL or Greenplum:

Why Use Strategy 2?
The issue with the default overcommits strategy, or using strategy 1 is the unpredictable nature of the failure. In either case we are not guaranteed that the memory is available at time of allocation and that results in unpredictable termination of processes. In particular, the nature of the failure with strategy 1 can result in corruption of either datafiles or transaction logs as a memory failure can occur mid-transaction resulting in the immediate shutdown of the database process without any cleanup.

那如果 redis 和 pg 跑在一台机器上有什么好的建议么?

  • 采用策略 1 保证 redis 在内存不足时候可用,设置 pg master 的 oom_score 让它避免被 oom-kill
  • 采用策略 2 尽量避免造成 redis 会 oom 的情况
  • 分两台机器

当设置 overcommit = 2 时有什么需要注意的

  1. 如前文说的如果服务器有 redis 要保证内存充足
  2. JVM 启动时 Xms Xmx 设置成相同的值,避免向系统申请内存

一瞬间以为自己在看卡哇邦嘎

👍🏻!

红帽犯过这个错误 :https://access.redhat.com/solutions/665023

比起截图,是不是 quote 文字会好一些?,可以全放在 >

@wey-gu 好滴。