In our review of cryptographic algorithms, we will start with the Message-Digest, MD5 hashing algorithm. This isn’t because MD5 is “the best” or even “the first” (it’s certainly neither of those). In fact, it’s not technically an encryption algorithm at all. It is important though, as a precursor to other encryption algorithms we will talk about in the data center.

## Message-Digest: data fingerprint

Message-Digest algorithms are mathematical functions that transform a data string of arbitrary length into a second string of data of fixed length (128 bits, in this case). The resultant, second version of the data (call it the output of the function), is often referred to as the “fingerprint” of the original data. With this type of function, it should be impossible to have two different versions of the input data that returns the same output data. Conversely, it should be impossible to produce the input value even if you know the output value. It’s a one-way function. It can’t be reversed, in other words.

This sort of operation is also called a “Hash” or “Hashing Function.” This is the basis for lots of common crypto applications, including digital signatures and identity cards.

It’s not hard to imagine how this “fingerprinting” can be useful. Let’s say for example that you wanted to by a software program from company “M.” You found a website that offers to sell that package for quite a discount, but how do you know it’s really the package from “M?” Well, if “M” publishes the hash of the software, you can apply the same hash to the software you buy and if it matches, you know for sure that it’s authentic. ‘just one example.

## Where did the MD5 hashing algorithm come from

The MD5 hashing algorithm was created in the early 1990’s, and is one of a family of Message-Digest algorithms. Several of these (the later versions) were developed by Ronald Rivest.

Who is Ron Rivest? Well, Ron Rivest is a cryptographer with significant contributions to the field. He is a professor at Massachusetts Institute of Technology. He’s also one of the inventors of the RSA Algorithm (the “R” in RSA), as well as the RC cypher algorithms (which we’ll cover in a separate section). In short, the man is a giant in the cryptography world.

## How does it work

The following description outlines the five steps in the MD5 hashing algorithm. This description comes via Ius Mentis and details can be found in IETF RFC 1321.

### Preparing the input

The MD5 algorithm first divides the input in blocks of 512 bits each. 64 Bits are inserted at the end of the last block. These 64 bits are used to record the length of the original input. If the last block is less than 512 bits, some extra bits are ‘padded’ to the end.

Next, each block is divided into 16 words of 32 bits each. These are denoted as M0 … M15.

### MD5 helper functions

#### The buffer

MD5 uses a buffer that is made up of four words that are each 32 bits long. These words are called A, B, C and D. They are initialized as

word A: 01 23 45 67

word B: 89 ab cd ef

word C: fe dc ba 98

word D: 76 54 32 10

#### The table

MD5 further uses a table K that has 64 elements. Element number i is indicated as Ki. The table is computed beforehand to speed up the computations. The elements are computed using the mathematical sin function:

#### Ki = abs(sin(i + 1)) * 232

#### Four auxiliary functions

In addition MD5 uses four auxiliary functions that each take as input three 32-bit words and produce as output one 32-bit word. They apply the logical operators and, or, not and xor to the input bits.

F(X,Y,Z) = (X and Y) or (not(X) and Z)

G(X,Y,Z) = (X and Z) or (Y and not(Z))

H(X,Y,Z) = X xor Y xor Z

I(X,Y,Z) = Y xor (X or not(Z))

#### Processing the blocks

The contents of the four buffers (A, B, C and D) are now mixed with the words of the input, using the four auxiliary functions (F, G, H and I). There are four rounds, each involves 16 basic operations. One operation is illustrated in the figure below.

The figure shows how the auxiliary function F is applied to the four buffers (A, B, C and D), using message word Mi and constant Ki. The item “<<<s” denotes a binary left shift by s bits.

The output

After all rounds have been performed, the buffers A, B, C and D contain the MD5 digest of the original input.

So you see, MD5 has five steps with four rounds of computations that compute the hash of the input value.

## MD5 Collisions

MD5 has been in common use in the data center for quite some time. As we mentioned before though, it’s not “the first” nor “the best” hashing function. The relatively short, 128-bit message digest makes it fast to compute, but not appropriate for use in cryptography.

In cryptography, a “collision” is when two distinct input values produce the same hash. This is bad, because if there are collisions then the algorithm can be compromised. In 1996, collisions were found in MD5. Further exploits were demonstrated through the beginning of the 21st century. Even so, MD5 is still in widespread use for certain applications, including HTTPS encryption.

MD5 hashing algorithm and its cousin, SHA1, are in widespread use in the Transport Layer Security (TLS) protocol on which HTTPS is based. In fact, even though collisions were found with MD5 as early as 1996, it was still included in TLS as late as 2008. That said, MD5 was banned at that time in TLS certificates but not for other aspects of TLS.

Researchers have devised attacks taking advantages of these weaknesses. Such techniques are called Security Losses from Obsolete and Truncated transcript Hashes, or SLOTH. With significant but easily obtainable (approximately 50 cores) computing power, impersonation attacks can be conducted on TLS-based web sites and applications.

In other posts, we’ll talk about functions that were developed after MD5 as well as some crypto algorithms in particular.