Hashing is the process of taking input values and transforming them into output values with a length that is fixed in size. The same input generates the same output. The function that does the transformation must avoid, as much as possible, mapping of different keys to the same values. When that does happen, we call it a collision.

How often do these collisions happen? Much more often than we think.

Mathematically speaking, there is an important principle called pigeonhole principle1. It states that if n objects are distributed over m places and if n > m, then some place receives at least two objects. It’s a statement that is easy to understand but has counter-intuitive applications.

The birthday paradox

What is the probability that, for a set of n people, a pair will have the same birthday (a collision)?

The counterintuitive fact is that only 23 people are needed for that probability to exceed 50%.

When you’re one of the people in the group, you’re comparing yourself to 22 other people. With 22 chances the 50% probability seems strange. But there are 22 other people also comparing birthdays. 23 people at the start, then 22 people, then 21 people and so on. When we add up the number of comparisons we get a finite arithmetic series with a sum of 253. That’s 253 chances for a shared birthday.

The 50% here is easier to understand when we flip the problem around, that is, calculate the probability that no one shares a birthday.

Let’s assume 365 possible birthdays.

We want the probability that all 23 birthdays are different.

So:

  • Person 1 can have any birthday → 365/365
  • Person 2 must avoid Person 1’s birthday → 364/365
  • Person 3 must avoid birthdays of 1 & 2 → 363/365
  • Person 23 must avoid 22 other birthdays → 343/365

So the probability of no matches is:

For n = 23:

Therefore, the probability that at least two people share a birthday is:

The 50% collision point provides a useful rule of thumb for when collisions become likely.

Suppose you pick k random values from a space of N possible outcomes. Then, without going into the full derivation, the number of values needed for a 50% chance of at least one collision is approximately:

For rough estimates, it’s often simplified to:

So what’s the chance we’ll get some hash collisions?

Let’s say we’re looking at a hash output of 8 characters using only [a–z0–9] (36 choices per character), then:

This is where things get really interesting. Even with a huge output space, you don’t need to generate that many inputs before a collision is likely.

Using the back of the envelope estimate, for a hash function with N possible outputs, you can expect a 50% chance of a collision after about:

Thus,

So even with just 1.7 million inputs, you already have a 50% chance of a collision — not because the inputs exceed the number of outputs (as in the pigeonhole principle), but due to the birthday paradox, which describes how collisions become likely surprisingly early.

Footnotes

  1. It is also commonly called Dirichlet’s box principle or Dirichlet’s drawer principle after an 1834 treatment of the principle by Peter Gustav Lejeune Dirichlet under the name Schubfachprinzip (“drawer principle” or “shelf principle”). The word pigeonhole comes from describing a desk or cabinet with small holes for letters or papers, like a pigeonhole house.