Table of Contents
我的redis配置的aof如下:
appendonly yes appendfsync everysec
redis-mgr配置每天早上 6:00-8:00 做aof_rewrite 和 rdb, 所以每天早上这段时间, 我们就会收到twempxoy的forward_err报警, 大约每分钟会损失5000个请求.
失败率是 10/10000.
在线上测试, 做一个10G的文件写操作, 就会触发上面问题:
dd if=/dev/zero of=xxxxx bs=1M count=10000 &
我们修改了 appendfsync no, 发现这个问题能缓解, 但是不能解决.
关于redis的各种延迟, 作者antirez的 这篇文章 已经说的很清楚了.
我们这里遇到的就是有disk I/O 的时候aof受到影响.
用下面命令:
strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished
我们在做copy 的时候, 可以观察到:
[pid 24734] write(42, "*4\r\n$5\r\nhmset\r\n$37\r\np-lc-d687791"..., 272475) = 272475 <0.036430> [pid 24738] <... fdatasync resumed> ) = 0 <2.030435> [pid 24738] <... fdatasync resumed> ) = 0 <0.012418> [pid 24734] write(42, "*4\r\n$5\r\nHMSET\r\n$37\r\np-lc-6787211"..., 73) = 73 <0.125906> [pid 24738] <... fdatasync resumed> ) = 0 <4.476948> [pid 24734] <... write resumed> ) = 294594 <2.477184> (2.47s)
此时输出:
$ ./_binaries/redis-cli --latency-history -h 10.38.114.60 -p 2000 min: 0, max: 223, avg: 1.24 (1329 samples) -- 15.01 seconds range min: 0, max: 2500, avg: 3.46 (1110 samples) -- 15.00 seconds range (这里观察到2.5s) min: 0, max: 5, avg: 1.01 (1355 samples) -- 15.01 seconds range
watchdog 输出:
[24734] 07 Jul 10:54:41.006 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. [24734 | signal handler] (1404701682) --- WATCHDOG TIMER EXPIRED --- bin/redis-server *:2000(logStackTrace+0x4b)[0x443bdb] /lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f] /lib64/tls/libpthread.so.0[0x302b80c420] /lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f] bin/redis-server *:2000(flushAppendOnlyFile+0x76)[0x43f616] bin/redis-server *:2000(serverCron+0x325)[0x41b5b5] bin/redis-server *:2000(aeProcessEvents+0x2b2)[0x416a22] bin/redis-server *:2000(aeMain+0x3f)[0x416bbf] bin/redis-server *:2000(main+0x1c8)[0x41dcd8] /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x302af1c4bb] bin/redis-server *:2000[0x415b1a] [24734 | signal handler] (1404701682) --------
所以确定是write hang住
当磁盘写buf满的时候, write就会阻塞, 释放一些buf才会允许继续写入,
所以, 如果程序不调用sync, 系统就会在不确定的时候 做sync, 此时 wirte() 就会hang住
grep ^Cached: /proc/meminfo # page cache size grep ^Dirty: /proc/meminfo # total size of all dirty pages grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages
ning@ning-laptop ~/test$ sysctl -a | grep dirty vm.dirty_background_ratio = 10 vm.dirty_background_bytes = 0 vm.dirty_ratio = 20 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 1500 vm.dirty_expire_centisecs = 3000
详细参考: https://www.kernel.org/doc/Documentation/sysctl/vm.txt
/proc/sys/vm/dirty_expire_centisecs #3000, 表示3000*0.01s = 30s, 队列中超过30s的被刷盘. /proc/sys/vm/dirty_writeback_centisecs #1500, 表示1500*0.01s = 15s, 内核pdflush wakeup 一次. /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground. The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29: /proc/sys/vm/dirty_background_bytes /proc/sys/vm/dirty_bytes
x_bytes 和 x_ratio是互斥的, 设置dirty_bytes 的时候, dirty_ratio 会被清0:
root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes 0 root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio 20 root@ning-laptop:~# echo '5000000' > /proc/sys/vm/dirty_bytes root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes 5000000 root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio 0
Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency 数值小的时候, 会减小IO系统带宽, 同时减少 随机的IO延迟.
http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html
When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called Stable Page Write.
This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.
有patch在较新的内核上能缓解这个问题, 原理是减少write调用 wait_on_page_writeback 的几率:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0 Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled.
据说xfs 也能解决问题.
$ cat /proc/sys/vm/dirty_background_ratio 10 $ cat /proc/sys/vm/dirty_ratio 20
平时dirty:
$ grep ^Dirty: /proc/meminfo Dirty: 104616 kB 机器内存128G.
早上做rdb/aof_rewrite时, dirty:
500,000 kB (500M)
都还没到达配置的 dirty_background_ratio , dirty_ratio 所以调这两个参数估计没用.
测试:
#1. 最常90s. vm.dirty_expire_centisecs = 9000 echo '9000' > /proc/sys/vm/dirty_expire_centisecs #2. 改大dirty_ratio echo '80' > /proc/sys/vm/dirty_ratio
在一个io较差的48G机器上, 设置 dirty_ratio = 80, dirty 会涨的很高, 但是redis延迟看不明显的改善:
$ grep ^Dirty: /proc/meminfo Dirty: 8598180 kB =>echo '80' > /proc/sys/vm/dirty_ratio $ grep ^Dirty: /proc/meminfo Dirty: 11887180 kB $ grep ^Dirty: /proc/meminfo Dirty: 21295624 kB