# Cryptography Corollary: Introduction to Hash Functions

Hash functions are strong cryptographic tools, but sometimes they’re not as secure as we’d like.

We’ve previously explored the foundations of encryption. It’s worth observing, however, that encryption has an important cryptographic cousin: Hash functions. What’s the difference? Hash functions aren’t necessarily a form of encryption because hash functions don’t encrypt anything. That is, to qualify as encryption, a function or algorithm must be able to both encrypt and decrypt. Hash functions, as we’ll see, lack this latter property altogether—or, they’re generally expected to.

You may have seen hash functions mentioned in the news recently. Indeed, it’s common for data breaches to include stolen passwords that have been hashed. While hashes are generally safe, they’re not foolproof. For example, SHA-1, a government standard hash function, has recently been proven broken. But what are hash functions, and how do they work? How do they fit into our lives? What does the SHA-1 news mean for cybersecurity? All told, hash functions are unsung heroes that make cryptography practical, but they don’t represent a silver bullet for data security.

## Hash Functions

In the abstract, we can think of a hash function as any computational function that maps an arbitrary string of data to a fixed-length output. In other words, we want to compress any piece of data—names, social security numbers, MP3 files, whatever—into fixed-length values.

For this to work, we need our hash functions to operate in a deterministic, public, and pseudorandom way. We need to know that for any given input x, the hash of x will always be the same—that is, the hash is computationally determined by its implementation. A deterministic hash is useless unless anyone can use it, so we want the implementation to be publicly available to everyone. We would also like to obscure the original input by ensuring the output appears random. There are some fundamental properties we can leverage to make hash functions work.

Hash functions are defined by their properties, and it’s these properties that make hash functions so useful. We primarily need hash functions to be one-way functions. We want it to be easy to compute the hash for x, but we want it to be impractical—or impossible—to reverse the hash to find x. In other words, we shouldn’t be able to deduce x given its hash value. We would also like our hash functions to be collision resistant. We don’t want similar inputs like “Hello World!” and “Hello World?” to hash to the same output, for example. We also don’t want two completely different inputs to hash to the same output. For security’s sake, we need hash functions that behave seemingly randomly. The output from the hash function needs to look sufficiently random so that we can’t easily deduce the original input.

All these properties are important, but collision resistance is often the weakest link. Let’s consider a hash function whose output is 128 bits in length. We would expect there to be 2128 possible outputs—that’s over 340 undecillion possibilities.[1] It seems like our 128-bit hash should be sufficient for just about anything we throw at it. Unfortunately, the Birthday Paradox[2] tells us that the actual effort to find a collision (i.e., two inputs hashing to the same output) would be closer to 2(128+1)/2 ≈ 264—that’s only 18 quintillion tries. But why should we care? The Birthday Paradox tells us that the effort to find a collision is correlated almost directly to the square root of the number of bits in the hash output. In other words, as the graph below shows, the square root of x is significantly less than x, meaning much less effort is required to find a collision than we might have expected. That’s bad news for hash functions with small bit-lengths.

The conclusion here is simple: We need more bits. More specifically, the bit-length of the output of a hash function needs about twice the bits of the desired security level. For example, if we want to design a hash function that has a similar cryptographic strength (i.e., security level) as the AES block cipher—which has a 128-bit output—then we would need to use a 256-bit hash. This may seem like an odd bit of nuance, but a hash function’s bit-length is critical to establishing its ability to resist collisions. Let’s see how this consequence factors into some real-world scenarios.

## Some Practical Applications

Like most of us, Alice uses passwords to access her online accounts. Now, there’s a great deal of trust Alice is sharing with online services because she doesn’t know how a service secures her password. Instead of storing her password as plaintext, a service will save the hash value of her password. Then, when she tries to log into her account, the system will compare the hashed value of her input against the stored hash. This is the basic idea behind virtually every login system.

In this system, an attacker like Eve cannot easily determine Alice’s password. Even if Eve steals a list of hashed passwords, she will have a difficult time determining Alice’s password just from its hash—that’s the power of one-way functions. However, there’s an important caveat. If Alice reuses passwords on different websites and those websites use the same hash functions, then the stored hash values will be the same (or highly similar). That’s a huge boon for Eve because now she needs to find only one collision to unlock Alice’s accounts across several websites. That’s bad news for Alice, but it serves as a reminder that password reuse is always a bad idea. Let’s check out another example.

In a previous post, we explored Bitcoin and the blockchain. Within that system, a user like Alice must digitally sign her transactions using public-key cryptography. This is another place where Alice can use a hash function.

In a public-key system, Alice has a public key Kpub and a private key Kpri. She can digitally sign a message m by using a cryptographic function like sign(Kpri, m). She can then freely distribute her message m and its associated signature s. A user like Bob can then verify Alice’s message and her signature by using a function like verify(m, s, Kpub). If Bob gets a valid result, then the signature is good; otherwise, the signature is invalid. How do hash functions fit into this?

Typically, Alice’s message m is rather large—maybe she’s signing and sending a high-definition movie. To save time and effort, Alice can compress m by sending it through a hash function. Then, instead of signing m, she will sign the output from the hash function. Alice then sends her message and signature to Bob like normal. But now, Bob hashes the message (like Alice did) and verifies that the signed hash is correct. It sounds like extra work, but that extra hashing step can save a lot of time and space, especially if the message is large.

## The Fall of SHA-1

Up to now, we’ve been thinking about hash functions in rather abstract terms. Let’s switch to a concrete example derived from the NIST Secure Hash Standard (SHS). As it turns out, the SHS specifies a family of hash functions known as the Secure Hash Algorithms (SHAs). As their name implies, we should expect these hash functions to be secure against collisions. However, since 2011, NIST has discouraged the use of a SHS hash known as SHA-1. It’s widely accepted now that no one should be using SHA-1 in security-critical applications; indeed, better alternatives exist within the SHA hash family. Let’s briefly outline SHA-1 and see why it has been deprecated by NIST.

SHA-1 takes an arbitrary input and produces a 160-bit hashed output. At the highest level, the input is broken into blocks of bits. To ensure those blocks are all the same size (perhaps the last block isn’t long enough), we pad the blocks to 512 bits by adding some meaningless data to the block.

We then feed the padded blocks through a compression function. As its name suggests, the compression function compresses the padded 512-bit blocks into 160-bit hash values. Now, it’d be too easy if we did this once and quit. Remember that we’re compressing blocks of bits from the input, so we can’t just compress one and say we’re done. We need to hash all the blocks.

To achieve this—and to achieve a nice level of security—we send each compressed hash value through a feedback loop into the compression function. There, the previously compressed hash value is used to compute the next compressed hash value. When we get to the last block of input, the final output of the compression function serves as our final 160-bit hashed output.[3]

Of course, there’s a lot more nuance involved in the actual compression function—for instance, it’s made up of 80 rounds of compression—but we’ll skip the details for sanity’s sake. The important takeaway is that SHA-1 is a pretty good 160-bit hash function. If we use our Birthday Paradox result, we see that SHA-1 has a security level of 2(160+1)/2 ≈ 280, which isn’t too shabby. Unfortunately for SHA-1, NIST has deprecated and discouraged its use since 2011—and for good reason, really.

Although SHA-1 has existed since 1995, it has been suspected of weaknesses since at least 2004. That’s why we have better SHS hash functions like the SHA-2 variants and SHA-3. At any rate, it wasn’t clear to what extent SHA-1 was potentially broken until earlier this year. In February, a team of researchers from the Centrum Wiskunde & Informatica and Google published the first known SHA-1 collision, known as SHAttered. The researchers created two PDF files that produced the same SHA-1 hash, resulting in a practical collision that has effectively broken the hash.

The silver lining here is twofold. Most mainstream applications migrated away from SHA-1 long ago, and the SHAttered collision took over nine quintillion computations to find. So, while the news of the collision is devastating to SHA-1, the number of people impacted should be limited. Stronger hash functions exist—indeed, better functions exist within the SHA family—so there’s no reason to continue using SHA-1 except for limited, noncritical applications.

## Going Forward

While SHA-1 has been definitively shown to be broken, the SHAttered collision wasn’t exactly computationally trivial. It took the researchers the equivalent of 6,500 years of single-CPU computations to find the collision. Note that that’s single-CPU computations; it’s unclear how that metric would translate to multicore CPUs, supercomputers, or even quantum computers. Regardless, the result does serve as a potent reminder that computing is only increasing in speed and power.

SHA-1 is a hash function standardized by the federal government, and it has since been shown to be broken. A morbid curiosity remains: How long will it take to break SHA-2 and SHA-3, the definitive government hash standards? It’s only a matter of time, as no cryptographic function is perfectly secure. For the time being, though, hash functions are too critical to everyday cryptography for them to be abandoned outright. Going forward, it’d be best to migrate to securer hashes like SHA-2 and SHA-3 and to keep an eye out for any new reports of collisions.