Table of Contents
但是, 不就是需要一个计数么? i++ 不就行了么?
脚本: sort | uniq | wc -l map-reduce: sort | uniq | wc -l 程序: 通过一个大字典去重.
Here I’ll cover only the basic idea using a very clever example found at . Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.
Long story short this is what HyperLogLog does: it hashes every new element you observe. Part of the hash is used to index a register, The other part of the hash is used to count the longest run of leading zeroes in the hash (our run of heads).
PFADD var element element … element PFCOUNT var PFMERGE dst src src src … src
The commands prefix is “PF” in honor of Philippe Flajolet .
模型: - cash register - turnstile model - sliding window
Heavy hitters (top 10) Find the most frequent (popular) elements in a data stream. Some notable algorithms are:
- Karp-Papadimitriou-Shenker algorithm
- Count-Min sketch
- Sticky sampling
- Lossy counting
- Multi-stage Bloom filters
Trend detection Detecting trends in data streams is often done using an heavy hitters algorithm as listed above
Say you're looking at the traffic for facebook profiles. You have billions of hits. You want to find which profiles are accessed the most often. You could keep a count for each profile, but then you'd have a very large number of counts to keep track of, the vast majority of which would be meaningless.
With lossy counting, you periodically remove very low count elements from the table. The most-frequently accessed profiles would almost never have low counts anyway, and if they did, they wouldn't be likely to stay there for long.
-- 其实就相当于用redis做count, expire调的比较小(比如1 day)
There are a large number of refinements to the algorithm. But the basic idea is this -- to find the heavy hitters without having to track every element, periodically purge your counts of any elements that don't seem likely to be heavy hitters based on the data so far.