ning

Permalink: 2014-07-09 08:36:06 by ning in redis tags: all

Table of Contents

1 一些分析
2 一些想法
3 关于page cache
4 小结
5 相关

我的redis配置的aof如下:

appendonly yes
appendfsync everysec

redis-mgr配置每天早上 6:00-8:00 做aof_rewrite 和 rdb, 所以每天早上这段时间, 我们就会收到twempxoy的forward_err报警, 大约每分钟会损失5000个请求.

失败率是 10/10000.

在线上测试, 做一个10G的文件写操作, 就会触发上面问题:

dd if=/dev/zero of=xxxxx bs=1M count=10000 &

我们修改了 appendfsync no, 发现这个问题能缓解, 但是不能解决.

关于redis的各种延迟, 作者antirez的这篇文章已经说的很清楚了.

我们这里遇到的就是有disk I/O 的时候aof受到影响.

1 一些分析

1.1 为什么慢查询看不到?

慢查询统计的时间只包括cpu计算的时间, 写aof这个过程不计入查询时间统计(也不应该计入)

1.2 观察

用下面命令:

strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished

我们在做copy 的时候, 可以观察到:

[pid 24734] write(42, "*4\r\n$5\r\nhmset\r\n$37\r\np-lc-d687791"..., 272475) = 272475 <0.036430>
[pid 24738] <... fdatasync resumed> )   = 0 <2.030435>
[pid 24738] <... fdatasync resumed> )   = 0 <0.012418>
[pid 24734] write(42, "*4\r\n$5\r\nHMSET\r\n$37\r\np-lc-6787211"..., 73) = 73 <0.125906>
[pid 24738] <... fdatasync resumed> )   = 0 <4.476948>
[pid 24734] <... write resumed> )       = 294594 <2.477184>   (2.47s)

此时输出:

$ ./_binaries/redis-cli --latency-history -h 10.38.114.60 -p 2000
min: 0, max: 223, avg: 1.24 (1329 samples) -- 15.01 seconds range
min: 0, max: 2500, avg: 3.46 (1110 samples) -- 15.00 seconds range   (这里观察到2.5s)
min: 0, max: 5, avg: 1.01 (1355 samples) -- 15.01 seconds range

watchdog 输出:

[24734] 07 Jul 10:54:41.006 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
[24734 | signal handler] (1404701682)
--- WATCHDOG TIMER EXPIRED ---
bin/redis-server *:2000(logStackTrace+0x4b)[0x443bdb]
/lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f]
/lib64/tls/libpthread.so.0[0x302b80c420]
/lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f]
bin/redis-server *:2000(flushAppendOnlyFile+0x76)[0x43f616]
bin/redis-server *:2000(serverCron+0x325)[0x41b5b5]
bin/redis-server *:2000(aeProcessEvents+0x2b2)[0x416a22]
bin/redis-server *:2000(aeMain+0x3f)[0x416bbf]
bin/redis-server *:2000(main+0x1c8)[0x41dcd8]
/lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x302af1c4bb]
bin/redis-server *:2000[0x415b1a]
[24734 | signal handler] (1404701682) --------

所以确定是write hang住

1.3 为什么 appendfsync no 无效

当磁盘写buf满的时候, write就会阻塞, 释放一些buf才会允许继续写入,

所以, 如果程序不调用sync, 系统就会在不确定的时候做sync, 此时 wirte() 就会hang住

2 一些想法

能否对rdb/aof_rewrite/cp等命令限速,
- 不可能针对每个进程(比如有其它写日志的进程) 都做限制. 所以最好不要这样.
增加proxy timeout, 目前400ms, 增加到2000ms?
- 如果超时400ms, 想当于快速失败. 客户端重试效果一样, 所以还是不必改.
master 关aof
- 这个方法不需要做任何改动, 代价较小, 效果最好, 缺点是提高运维复杂性和数据可靠性, redis-mgr可以做这个支持.
write 时的阻塞貌似无法避免, 能否用一个新的线程来做write呢?
- 关于这个想法写了个patch提给作者: https://github.com/antirez/redis/pull/1862
- 不过作者貌似不太感冒.

3 关于page cache

IO调度一般是针对读优化的, 因为读的时候是同步的, 进程读取不到, 就会睡眠. 写是异步的, 只是写到page cache.

3.1 查看当前page cache 状态

grep ^Cached: /proc/meminfo # page cache size
grep ^Dirty: /proc/meminfo # total size of all dirty pages
grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages

3.2 参数

ning@ning-laptop ~/test$ sysctl -a | grep dirty
vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 1500
vm.dirty_expire_centisecs = 3000

详细参考: https://www.kernel.org/doc/Documentation/sysctl/vm.txt

/proc/sys/vm/dirty_expire_centisecs         #3000, 表示3000*0.01s = 30s, 队列中超过30s的被刷盘.
/proc/sys/vm/dirty_writeback_centisecs      #1500, 表示1500*0.01s = 15s, 内核pdflush wakeup 一次.

/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_ratio
Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground.


The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29:
/proc/sys/vm/dirty_background_bytes
/proc/sys/vm/dirty_bytes

x_bytes 和 x_ratio是互斥的, 设置dirty_bytes 的时候, dirty_ratio 会被清0:

root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes
0
root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio
20
root@ning-laptop:~# echo '5000000' > /proc/sys/vm/dirty_bytes
root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes
5000000
root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio
0

Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency 数值小的时候, 会减小IO系统带宽, 同时减少随机的IO延迟.

http://monolight.cc/2011/06/barriers-caches-filesystems/

3.3 Stable Page Write

http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html

When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called Stable Page Write.

This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

有patch在较新的内核上能缓解这个问题, 原理是减少write调用 wait_on_page_writeback 的几率:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0
Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.

据说xfs 也能解决问题.

3.4 查看线上

$ cat /proc/sys/vm/dirty_background_ratio
10
$ cat /proc/sys/vm/dirty_ratio
20

平时dirty:

$ grep ^Dirty: /proc/meminfo
Dirty:            104616 kB
机器内存128G.

早上做rdb/aof_rewrite时, dirty:

500,000 kB (500M)

都还没到达配置的 dirty_background_ratio , dirty_ratio 所以调这两个参数估计没用.

测试:

#1. 最常90s.
vm.dirty_expire_centisecs = 9000
echo '9000' > /proc/sys/vm/dirty_expire_centisecs

#2. 改大dirty_ratio
echo '80' > /proc/sys/vm/dirty_ratio

3.4.1 调整dirty_ratio

在一个io较差的48G机器上, 设置 dirty_ratio = 80, dirty 会涨的很高, 但是redis延迟看不明显的改善:

$ grep ^Dirty: /proc/meminfo
Dirty:           8598180 kB  =>echo '80' > /proc/sys/vm/dirty_ratio
$ grep ^Dirty: /proc/meminfo
Dirty:          11887180 kB
$ grep ^Dirty: /proc/meminfo
Dirty:          21295624 kB

3.4.2 调整 dirty_expire_centisecs

看上去也没有效果, 有变差趋势. 因为我在线下是通过长期dd来压测, 和线上还不太一样.

看来只能线上测试了.

4 小结

master关aof应该是目前最可以接受的方法
antirez在做一个latency采样的工作
XFS/Solaris 貌似没有这个问题.

5 相关

http://redis.io/topics/latency
11 年就有的讨论 https://groups.google.com/forum/#!msg/redis-db/jgGuGngDEb0/ZwnvUdx-gdAJ
- 作者本来想把write和fsync都移到另一个线程, 结论是把fsync移到一个线程了,
Linkedin 的一个工程师做了这样一个实验, 测试用1/4的带宽来写的时候, 产生的延迟情况:
- http://blog.empathybox.com/post/35088300798/why-does-fwrite-sometimes-block

redis-aof-latency