Table of Contents

- Hyperloglog是 counting 算法
- Hyperloglog is an approximate technique for computing the number of distinct entries in a set, It does this while using a small amount of memory. For instance, to achieve 99% accuracy, it needs only 16 KB.
- It is derived from the LogLog counting technique

但是, 不就是需要一个计数么? i++ 不就行了么?

需求:

- 计算每天访问某个网站的
`uniq`ips - 计算google 每天
`uniq`query 个数.

我们怎么处理呢:

脚本: sort | uniq | wc -l map-reduce: sort | uniq | wc -l 程序: 通过一个大字典去重.

它们的特点是:

- sort方法时间复杂度O(n log N)
- 去重方法 需要很大内存来去重, 空间复杂度O(n)

Here I’ll cover only the basic idea using a very clever example found at [3]. Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.

Long story short this is what HyperLogLog does: it hashes every new element you observe. Part of the hash is used to index a register, The other part of the hash is used to count the longest run of leading zeroes in the hash (our run of heads).

PFADD var element element … element PFCOUNT var PFMERGE dst src src src … src

The commands prefix is “PF” in honor of Philippe Flajolet [4].

很有意思.

- 这个算法大致是2007年发明出来的, google 做了改进.
- http://antirez.com/news/75
- http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
- http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
- 这里做了模拟展示: http://content.research.neustar.biz/blog/hll.html
- Lossy Counting ppt(很赞): http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S10P03slides.pdf
- 同上: http://www.cs.utah.edu/~lifeifei/cis5930/lecture8.pdf

如何动态维护top10

- redis的简单kv做计数, 再加个zset.
- zset里面放当前最大的N个, 每次来一个query, 更新hash中的value, 同时看, 如果当前value > zset.min 把zset最后一个踢掉, 把当前这个插入到zset.

这一类算法可以叫做: Streaming_algorithm

http://en.wikipedia.org/wiki/Streaming_algorithm

- Streams can be denoted as an ordered sequence of points (or "updates")
- 但是通常 too large to be stored

模型: - cash register - turnstile model - sliding window

Frequency moments

Heavy hitters (

**top 10**) Find the most frequent (popular) elements in a data stream. Some notable algorithms are:- Karp-Papadimitriou-Shenker algorithm
- Count-Min sketch
- Sticky sampling
- Lossy counting
- Multi-stage Bloom filters

Trend detection[edit] Detecting trends in data streams is often done using an heavy hitters algorithm as listed above

- Counting distinct elements
- hyperloglog

Entropy

Online learning

Say you're looking at the traffic for facebook profiles. You have billions of hits. You want to find which profiles are accessed the most often. You could keep a count for each profile, but then you'd have a very large number of counts to keep track of, the vast majority of which would be meaningless.

With lossy counting, you periodically remove very low count elements from the table. The most-frequently accessed profiles would almost never have low counts anyway, and if they did, they wouldn't be likely to stay there for long.

-- 其实就相当于用redis做count, expire调的比较小(比如1 day)

There are a large number of refinements to the algorithm. But the basic idea is this -- to find the heavy hitters without having to track every element, periodically purge your counts of any elements that don't seem likely to be heavy hitters based on the data so far.

抽样计算counting.