Table of Contents
aof 文件有个优势:
因为它的格式是和协议一致, 重放非常简单, 直接用如下命令即可:
cat data/appendonly.aof | nc localhost 22003 +OK +OK +OK +OK +OK +OK +OK +OK
或者用 redis-cli 提供的pipe功能:
$ cat data/appendonly.aof | redis-cli --pipe -h 127.0.0.5 -p 22003 All data transferred. Waiting for the last reply... Last reply received from server. errors: 0, replies: 115
时间消耗, 在我的pc上 300M需要 120s, 速度大约2M/s, (6w条/s), 每天单进程可以重放156G(5亿条):
ning@ning-laptop:~/idning-github/redis-mgr$ ll /home/ning/Desktop/t/appendonly.aof 787171 -rw-r--r-- 1 ning ning 371M 2014-02-26 22:56 /home/ning/Desktop/t/appendonly.aof ning@ning-laptop:~/idning-github/redis-mgr$ time cat /home/ning/Desktop/t/appendonly.aof | redis-cli --pipe -h 127.0.0.5 -p 22003 All data transferred. Waiting for the last reply... Last reply received from server. errors: 0, replies: 7339651 real 1m58.729s user 0m8.700s sys 0m1.780s
实现: pipe功能就是简单的 把STDIN 的内容写到soket:
ssize_t nread = read(STDIN_FILENO,obuf,sizeof(obuf)); ssize_t nwritten = write(fd,obuf+obuf_pos,obuf_len);
上面的方法不能通过twemproxy重放:
ning@ning-laptop:/tmp/r/redis-22001$ cat data/appendonly.aof | redis-cli --pipe -h 127.0.0.5 -p 24000 All data transferred. Waiting for the last reply... No replies for 30 seconds: exiting. errors: 1, replies: 0
原因:
为了把它变成一个迁移工具, pipe工具性能已经满足要求, 功能上需要增加支持:
兼容proxy(去掉select, 处理mset)
--filter : key filter (这里需要解析命令)
--rewrite
测试:
#key-value redis-cli -h 127.0.0.5 -p 22001 SET key0 v0 redis-cli -h 127.0.0.5 -p 22001 GETSET key0 v0 redis-cli -h 127.0.0.5 -p 22001 APPEND key0 v_a redis-cli -h 127.0.0.5 -p 22001 STRLEN key0 #expire redis-cli -h 127.0.0.5 -p 22001 EXPIRE key0 5 sleep 6 redis-cli -h 127.0.0.5 -p 22001 SETEX key0 5 v_a sleep 6 #counter redis-cli -h 127.0.0.5 -p 22001 INCR key1 #hash redis-cli -h 127.0.0.5 -p 22001 HSET key3 h3 val3 #list redis-cli -h 127.0.0.5 -p 22001 LPUSH key4 v4 redis-cli -h 127.0.0.5 -p 22001 LPOP key4 #set redis-cli -h 127.0.0.5 -p 22001 SADD key5 v5
对应关系如下:
---------------------------------------------------------------------------------------------------------------- #key-value redis-cli -h 127.0.0.5 -p 22001 SET key0 v0 *3 $3 SET $4 key0 $2 v0 redis-cli -h 127.0.0.5 -p 22001 GETSET key0 v0 *3 $6 GETSET $4 key0 $2 v0 redis-cli -h 127.0.0.5 -p 22001 APPEND key0 v_a *3 $6 APPEND $4 key0 $3 v_a redis-cli -h 127.0.0.5 -p 22001 STRLEN key0 <nothing> redis-cli -h 127.0.0.5 -p 22001 EXPIRE key0 5 (转变为PEXPIREAT) *3 $9 PEXPIREAT $4 key0 $13 1393467438683 sleep 6 (5s后被删除) *2 $3 DEL $4 key0 redis-cli -h 127.0.0.5 -p 22001 SETEX key0 5 v_a (SETEX转成两个命令) *3 $3 SET $4 key0 $3 v_a *3 $9 PEXPIREAT $4 key0 $13 1393467444711 sleep 6 *2 $3 DEL $4 key0 redis-cli -h 127.0.0.5 -p 22001 INCR key1 <INCR记录的是变化, 不是结果> *2 $4 INCR $4 key1 redis-cli -h 127.0.0.5 -p 22001 HSET key3 h3 val3 *4 $4 HSET $4 key3 $2 h3 $4 val3 redis-cli -h 127.0.0.5 -p 22001 LPUSH key4 v4 *3 $5 LPUSH $4 key4 $2 v4 redis-cli -h 127.0.0.5 -p 22001 LPOP key4 *2 $4 LPOP $4 key4 redis-cli -h 127.0.0.5 -p 22001 SADD key5 v5 *3 $4 SADD $4 key5 $2 v5
$redis-cli -h 127.0.0.5 -p 22000 mset k1 v1 k2 v2 mset $2 k1 $2 v1 $2 k2 $2 v2
$ redis-cli -h 127.0.0.5 -p 22001 set key1 3 OK *3 $3 set $4 key1 $1 3 $ redis-cli -h 127.0.0.5 -p 22001 set key2 3 OK *3 $3 set $4 key2 $1 3 #注意这里删除3个key, 只有2个key存在的情况下, 记录的aof是在一个del命令中删除3个key $ redis-cli -h 127.0.0.5 -p 22001 del key1 key2 key3 (integer) 2 *4 $3 del $4 key1 $4 key2 $4 key3 $ redis-cli -h 127.0.0.5 -p 22001 del key1 key2 key3 (integer) 0 $ redis-cli -h 127.0.0.5 -p 22001 del key1 key2 key3 (integer) 0 这里没有对应的aof
aof解析非常简单, 从 redis-check-aof 中就可以看出来:
off_t process(FILE *fp) { long argc; off_t pos = 0; int i, multi = 0; char *str; while(1) { if (!multi) pos = ftello(fp); if (!readArgc(fp, &argc)) break; for (i = 0; i < argc; i++) { readString(fp,&str); } } }
redis中解析, load aof 是这个函数:
/* Replay the append log file. On error REDIS_OK is returned. On non fatal * error (the append only file is zero-length) REDIS_ERR is returned. On * fatal error an error Message is logged and the program exists. */ int loadAppendOnlyFile(char *filename) {
points:
问题:
这里要实现tail -f 的功能, 需要一个readline 的功能(因为fgets在EOF的时候直接返回, 不能用fgets)
对 read 做了几个测试, 发现:
tail -f 有这样的逻辑, 但是我试了, select 就算在文件尾也总是返回可读. 查了一下:
Disk files are always ready to read (but the read might return 0 bytes if you're already at the end of the file), so you can't use select() on a disk file to find out when new data is added to the file.
POSIX says:
File descriptors associated with regular files shall always select true for ready to read, ready to write, and error conditions.
这里有一个详细的测试: http://www.greenend.org.uk/rjk/tech/poll.html
依然显示 对于regular file, 在到达EOF时, poll总是返回POLLIN.
这就是说, select/poll 只对 pipes/sockets 这样会发生阻塞读写的介质有效.
看源码:
用sleep的:
ning@ning-laptop:~/idning/langtest/c$ tail --version tail (GNU coreutils) 7.4 ning@ning-laptop:~/idning/langtest/c$ strace tail -f common.h execve("/usr/bin/tail", ["tail", "-f", "common.h"], [/* 69 vars */]) = 0 brk(0) = 0xd1a000 ... nanosleep({1, 0}, NULL) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=635, ...}) = 0 nanosleep({1, 0}, NULL) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=635, ...}) = 0 nanosleep({1, 0}, NULL) = 0 fstat(3, {st_mode=S_IFREG|0644, st_size=635, ...}) = 0
用sleep的代码:
/* Tail NFILES files forever, or until killed. The pertinent information for each file is stored in an entry of F. Loop over each of them, doing an fstat to see if they have changed size, and an occasional open/fstat to see if any dev/ino pair has changed. If none of them have changed size in one iteration, sleep for a while and try again. Continue until the user interrupts us. */ static void tail_forever (struct File_spec *f, int nfiles, double sleep_interval) { ... if (fstat (fd, &stats) != 0) { f[i].fd = -1; f[i].errnum = errno; error (0, errno, "%s", name); continue; } if (f[i].mode == stats.st_mode && (! S_ISREG (stats.st_mode) || f[i].size == stats.st_size) && timespec_cmp (f[i].mtime, get_stat_mtime (&stats)) == 0) { //not change } change
通过sleep实现tail -f
redis-cli --pipe 是使用pipeline 模式的, 只要server端可写, 就会不停的写,
但是twemproxy总是尽最大能力的读, 把消息放在内存中, 这样消息都会堆在twemproxy, 并且超时.
这个问题的讨论见: https://github.com/twitter/twemproxy/issues/203
解决方法:
safe模式, 一条一条写,写成功再写下一条.
加大twemproxy 的timeout.
比如客户端计数, 发出的req - 收到的resp < 1024
用redisCommandArgv:
while(1){ msg = readMsg(fp); reply = redisCommandArgv(context, msg->argc, (const char **)msg->argv, msg->argvlen); freeReplyObject(reply); freeMsg(msg); }
发现性能不好: 后端为twemproxy时大约7000/s, 后端为redis大约10000/s.
直接调用redisCommand 性能如何:
$ cat bench1.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include "hiredis.h" int main(void) { unsigned int i; redisContext *c; redisReply *reply; struct timeval timeout = { 1, 500000 }; // 1.5 seconds c = redisConnectWithTimeout((char*)"127.0.0.5", 22000, timeout); if (c->err) { printf("Connection error: %s\n", c->errstr); exit(1); } for(i=0; i<100*1000; i++){ reply = redisCommand(c,"SET %s %s", "foo", "hello world"); freeReplyObject(reply); } return 0; } ning@ning-laptop:~/idning-github/redis/deps/hiredis$ cc bench1.c -I ./ -L ./ -l hiredis (或cc bench1.c libhiredis.a) ning@ning-laptop:~/idning-github/redis/deps/hiredis$ time ./a.out real 0m6.945s user 0m0.710s sys 0m1.710s
100*1000/6.9 = 1.4w/s
ning@ning-laptop:~/idning-github/redis/deps/hiredis$ cc bench1.c libhiredis.a -pg ning@ning-laptop:~/idning-github/redis/deps/hiredis$ ./a.out $ gprof ./a.out ./gmon.out | vim - Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 22.23 0.04 0.04 redisReaderGetReply 16.67 0.07 0.03 redisvFormatCommand 11.12 0.09 0.02 redisGetReply 11.12 0.11 0.02 sdscatlen 5.56 0.12 0.01 main 5.56 0.13 0.01 redisBufferRead 5.56 0.14 0.01 redisBufferWrite 5.56 0.15 0.01 sdsIncrLen 5.56 0.16 0.01 sdsempty 5.56 0.17 0.01 sdsnewlen 2.78 0.18 0.01 sdsMakeRoomFor 2.78 0.18 0.01 sdsRemoveFreeSpace
总共7s, 为啥self seconds 加起来不是7s
依然很慢:
ning@ning-laptop:~/idning-github/redis/deps/hiredis$ cat bench3.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include "hiredis.h" int main(void) { unsigned int i; redisContext *c; redisReply *reply; int ret; struct timeval timeout = { 1, 500000 }; // 1.5 seconds c = redisConnectWithTimeout((char*)"127.0.0.5", 22000, timeout); if (c->err) { printf("Connection error: %s\n", c->errstr); exit(1); } char *cmd = "*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$9\r\nbarbarbar\r\n"; int len = strlen(cmd); char buf[1024]; for(i=0; i<100*1000; i++){ ret = write(c->fd, cmd, len); assert(len == ret); /*fprintf(stderr, "read\n");*/ ret = read(c->fd, buf, 5); assert(5 == ret); buf[5] = 0; /*fprintf(stderr, "%d: %s\n", i, buf);*/ /*assert(0 == strcmp(buf, "+OK\r\n"));*/ } return 0; }
还是要5s. (2w/s)
for(i=0; i<100*1000; i++){ ret = twrite(c->fd, cmd, len); assert(len == ret); } for(i=0; i<100*1000; i++){ ret = tread(c->fd, buf, 5); assert(5 == ret); }
只需要0.4s
ning@ning-laptop:~/idning-github/redis/deps/hiredis$ cat bench2.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <signal.h> #include "hiredis.h" #include "async.h" #include "adapters/ae.h" /* Put event loop in the global scope, so it can be explicitly stopped */ static aeEventLoop *loop; void setCallback(redisAsyncContext *c, void *r, void *privdata) { redisReply *reply = r; if (reply == NULL) return; int * pi = (int*) privdata; printf("argv[%d]: %s\n", *pi, reply->str); (*pi)++; if (*pi > 100*1000) exit(0); redisAsyncCommand(c, setCallback, (char*)pi, "SET thekey %s", "xxxxxxxxxxxxxx"); } void connectCallback(const redisAsyncContext *c, int status) { if (status != REDIS_OK) { printf("Error: %s\n", c->errstr); return; } printf("Connected...\n"); } void disconnectCallback(const redisAsyncContext *c, int status) { if (status != REDIS_OK) { printf("Error: %s\n", c->errstr); return; } printf("Disconnected...\n"); } int main() { signal(SIGPIPE, SIG_IGN); redisAsyncContext *c = redisAsyncConnect("127.0.0.1", 6379); if (c->err) { /* Let *c leak for now... */ printf("Error: %s\n", c->errstr); return 1; } loop = aeCreateEventLoop(1000); redisAeAttach(loop, c); redisAsyncSetConnectCallback(c,connectCallback); redisAsyncSetDisconnectCallback(c,disconnectCallback); int i = 0; redisAsyncCommand(c, setCallback, (char*)&i, "SET thekey %s", "xxxxxxxxxxxxxx"); aeMain(loop); return 0; } ning@ning-laptop:~/idning-github/redis/deps/hiredis$ cc -I../../src ../../src/ae.o ../../src/zmalloc.o bench2.c libhiredis.a ../jemalloc/lib/libjemalloc.a -lpthread
still 6s.(差不多还是2w/s)
异步.
创建一些client, 每个client注册如下事件:
aeCreateFileEvent(config.el,c->context->fd,AE_WRITABLE,writeHandler,c);
writeHandler:
static void writeHandler(aeEventLoop *el, int fd, void *privdata, int mask) { client c = privdata; REDIS_NOTUSED(el); REDIS_NOTUSED(fd); REDIS_NOTUSED(mask); /* Initialize request when nothing was written. */ if (c->written == 0) { /* Enforce upper bound to number of requests. */ if (config.requests_issued++ >= config.requests) { freeClient(c); return; } /* Really initialize: randomize keys and set start time. */ if (config.randomkeys) randomizeClientKey(c); c->start = ustime(); c->latency = -1; } if (sdslen(c->obuf) > c->written) { void *ptr = c->obuf+c->written; int nwritten = write(c->context->fd,ptr,sdslen(c->obuf)-c->written); if (nwritten == -1) { if (errno != EPIPE) fprintf(stderr, "Writing to socket: %s\n", strerror(errno)); freeClient(c); return; } c->written += nwritten; if (sdslen(c->obuf) == c->written) { aeDeleteFileEvent(config.el,c->context->fd,AE_WRITABLE); aeCreateFileEvent(config.el,c->context->fd,AE_READABLE,readHandler,c); } } }
每写成功一个消息, 去掉AE_WRITABLE, 加上AE_READABLE:
static void writeHandler(aeEventLoop *el, int fd, void *privdata, int mask) { if (sdslen(c->obuf) == c->written) { aeDeleteFileEvent(config.el,c->context->fd,AE_WRITABLE); aeCreateFileEvent(config.el,c->context->fd,AE_READABLE,readHandler,c); } }
在read完之后, 重新用激活可写事件:
if (c->pending == 0) { clientDone(c); //里面会加AE_WRITABLE break; }
redis-benchmark 默认是不用pipeline 的. 写一个, 读一个, 但是是用异步api.
如果pipeline模式benchmark, 它在准备数据的时候就一次性把多个命令写道obuf里面去:
for (j = 0; j < config.pipeline; j++) c->obuf = sdscatlen(c->obuf,cmd,len); c->pending = config.pipeline;
发现用redis-benchmark性能能达到5w左右, 后来发现是因为redis-benchmark默认-c 50, 就是50个client并发, 如果用-c 1的话, 性能还是比较差:
ning@ning-laptop:~/idning-github/redis/src$ time redis-benchmark -h 127.0.0.5 -p 22000 -c 1 -t set -n 100000 ====== SET ====== 100000 requests completed in 8.64 seconds 1 parallel clients 3 bytes payload keep alive: 1 99.99% <= 1 milliseconds 100.00% <= 2 milliseconds 100.00% <= 3 milliseconds 100.00% <= 7 milliseconds 11579.44 requests per second real 0m8.651s user 0m0.570s sys 0m2.390s
大约8s, (1.2w/s)
三种方式:
TODO: 这里的原因, 还是不清楚..
难道是, 如果read系统调用时, 数据准备好, 则很快, 没准备好, 则很慢.
同样都是发送, 等待响应, 解析响应.
貌似只能理解为用阻塞方式等待响应很耗时.
这个问题, 可以把服务器抽象为一个简单的echo-server, 此时客户端一问一答的形式, 最大能达到多大的qps.
用strace发现一个共同点: 2,3都是用epoll异步, 一次epoll_wait 做一次write, 在epoll_wait, 再一次read:
epoll_ctl(3, EPOLL_CTL_MOD, 4, {EPOLLIN, {u32=4, u64=4}}) = 0 epoll_ctl(3, EPOLL_CTL_DEL, 4, {0, {u32=4, u64=4}}) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 4, {EPOLLOUT, {u32=4, u64=4}}) = 0 epoll_wait(3, {{EPOLLOUT, {u32=4, u64=4}}}, 10240, 240) = 1 write(4, "*3\r\n$3\r\nSET\r\n$16\r\nkey:__rand_int"..., 45) = 45 epoll_ctl(3, EPOLL_CTL_DEL, 4, {0, {u32=4, u64=4}}) = 0 epoll_ctl(3, EPOLL_CTL_ADD, 4, {EPOLLIN, {u32=4, u64=4}}) = 0 epoll_wait(3, {{EPOLLIN, {u32=4, u64=4}}}, 10240, 239) = 1 read(4, "+OK\r\n", 16384) = 5
4是用pool, 而且一个pool就能做一次read, 一次write:
ning@ning-laptop:~/idning-github/redis/src$ strace ./redis-cli -h 127.0.0.5 -p 22000 --replay ~/Desktop/t/appendonly.aof poll([{fd=3, events=POLLIN|POLLOUT}], 1, 1000) = 1 ([{fd=3, revents=POLLIN|POLLOUT}]) read(3, "+OK\r\n", 16384) = 5 read(3, 0x7fff7d548f10, 16384) = -1 EAGAIN (Resource temporarily unavailable) write(3, "*3\r\n$3\r\nSET\r\n$13\r\nkkk-100000756\r"..., 53) = 53 poll([{fd=3, events=POLLIN|POLLOUT}], 1, 1000) = 1 ([{fd=3, revents=POLLIN|POLLOUT}]) read(3, "+OK\r\n", 16384) = 5 read(3, 0x7fff7d548f10, 16384) = -1 EAGAIN (Resource temporarily unavailable) write(3, "*3\r\n$3\r\nSET\r\n$13\r\nkkk-100000757\r"..., 53) = 53
当前我的写法相当于长度为1的pipe, 本质和 redis-cli 写法一样. 性能挺好(5w/s)
后来发现, 我当时写的--replay 有bug, 相当于用了长度为 100左右的pipe. 所以表现的性能很好. 代码是在这个commit:
https://github.com/idning/redis/commit/b122ab0c749f2a93bb514ae07ba73739690ab46e
修改了这个bug后:
https://github.com/idning/redis/commit/b956e2cf92feb510f7d1a2f158a8eafe907d9ae1
发现如果定义pipe长度是1, 性能就在1w左右. 改为10, 就在5w左右(laptop测试)
如果用线上机器:
pipesize 1 10 100 1000 loclhost: 1w 4w 5w 5w online: 0.3 1w 10w 12w
线上pipeline为1时, 只有0.3, 原因是线上网络RTT大, (这么看每个请求3ms), 用的是压力较大的线上机器.
需要一个脚本, 如果重放到一半出错, 需要清除所有前缀为xxx的key
code:
https://github.com/idning/redis/blob/replay/src/redis-cli.c https://github.com/cen-li/redis/blob/redis-2.8.3_replay-aof/src/redis-replay-aof.c