Head to https://brilliant.org/BreakingTaps/ to get a 30-day free trial. The first 200 people will get 20% off their annual subscription.

——————————————————-

Today we’re looking at HyperLogLog, an algorithm that leverages random chance to count the number of distinct items are in a dataset. It does this by tracking the longest run of zeros in a binary sequence, and uses that as an estimate of cardinality.

HLL is a probabilistic algorithm, meaning it’s a guess rather than true answer. But due to some clever tricks it is usually within 2% of the correct value, and can do it both quickly and in a memory-efficient manner. A 512kb datastructure can accurately process trillions of items and terrabytes of data, which is pretty impressive!

When I made this video, I didn’t realize that another #SoME3 was in progress. But a bunch of viewers suggested I enter the video, so I guess this is will be part of the event!

———————-

🔬Patreon if that’s your jam: https://www.patreon.com/breakingtaps

📢Twitter: https://twitter.com/BreakingTaps
📷Instagram: https://instagram.com/breakingtaps
💻Discord: https://discord.gg/R45uCXcEv4

———————-

Journal papers:

Flajolet, Philippe, et al. “Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm.” _Discrete Mathematics and Theoretical Computer Science_. Discrete Mathematics and Theoretical Computer Science, 2007. https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

Heule, Stefan, Marc Nunkesser, and Alexander Hall. “Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm.” _Proceedings of the 16th International Conference on Extending Database Technology_. 2013. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

Earlier work:

Durand, Marianne, and Philippe Flajolet. “Loglog counting of large cardinalities.” _Algorithms-ESA 2003: 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003. Proceedings 11_. Springer Berlin Heidelberg, 2003.

Flajolet, Philippe, and G. Nigel Martin. “Probabilistic counting algorithms for data base applications.” _Journal of computer and system sciences_ 31.2 (1985): 182-209.

Articles:

– https://towardsdatascience.com/hyperloglog-a-simple-but-powerful-algorithm-for-data-scientists-aed50fe47869
– https://engineering.fb.com/2018/12/13/data-infrastructure/hyperloglog/