Even powerful computers require a great deal of time to compute a digital signature. In addition, large documents would need many signatures, since the document to be signed cannot be arbitrarily long. A trick is therefore used. The document is first compressed to a much shorter fixed length, and then the signature of the compressed data is computed. It does not matter whether the compression can be reversed, since the signature can always be reproduced from the original document. The functions used for this type of computation are called one-way hash functions. Generally speaking, a one-way hash function is a function that derives a fixed-length value from a variable-length document in a manner such that this value represents the original content of the document in a compressed form and cannot be used to reconstruct the original document. In the smart card domain, these functions are used exclusively to compute the input values for digital signatures. If the length of the document is not a multiple of the block length used by the hash function, it must be padded appropriately. For a hash function to be effective, it must exhibit certain properties. The result must have a fixed length, so that it can be readily used by signature algorithms. Since large quantities of data normally have to be processed, the hash function must have a high throughput. It must also be easy to compute the hash value. By contrast, it should be difficult, or better yet impossible, to derive the original document from a known hash value. Finally, the hash function must be collision-resistant. This means that for a given document, it should not be easy to find a second document that yields the same hash value. Nevertheless, there certainly will be other documents with the same hash value. This is only natural, since all possible messages, ranging in length from null to infinity, are represented by a set of hash values having the same fixed length. An unavoidable consequence of this is that collisions will occur. That is why the term ‘collision-resistant’ is used, rather than ‘collision-free’.

What is the effect of a collision? There will be two different documents with the same hash value, and thus the same digital signature. This will have the fatal consequence of making the signature worthless, since it would be possible to alter the document without anyone being able to detect the fact. This is precisely what is involved in one of the two typical attacks on hash functions, which consists of systematically searching for a second document that has the same hash value as the original document. If the content of this document makes sense, the digital signature derived from the hash value is discredited. Since the two documents are interchangeable, the signature is worthless. After all, it makes an enormous difference whether a house purchase contract is for €10,000 or €750,000. The second type of attack on a hash value is somewhat subtler. In this case, two documents with identical hash values but different contents are prepared in advance. This is not particularly difficult, considering all the special symbols and extensions available in the character set. The result is that a single digital signature is valid for both documents, and it is impossible to prove which document was originally signed. Finding two documents with the same hash value is not as difficult as it might seem. It is possible to exploit the birthday paradox, which is well known in statistical theory. This paradox involves two questions. The first question is: how many people must be in a room for the probability to be greater than 50% that one of them has the same birthday as the person asking the question. The answer can be easily found, since it is only necessary to compare the birthday of the questioner with the birthday of everyone else in the room. There must be at least 183 (365 ÷ 2) people in the room.

The second question reveals the paradox, or better, the surprising result of this comparison. This question is: how many people must be in a room for the probability to be greater than 50% that two people in the room have the same birthday. The answer is only 23 people. The reason is that although only 23 people are present, this represents a total of 253 pairs for comparing birthdays. The probability that two people have the same birthday is based on these pairs. Precisely this paradox is utilized in attacking a hash function. It is much easier to create two documents that have the same hash value than it is to modify a document until it yields a given hash value. The consequence is that the results of hash functions must be large enough to successfully foil both types of attack. Most hash functions thus produce values that are at least 128 bits long, which is presently generally considered to be adequate with regard to the two types of attack just described. Many different hash functions have been published up to now, and some of them are also defined in standards. However, these functions are frequently modified as a consequence of the discovery of a successful form of attack. Table 4.19 provides a short summary of the hash functions currently in common use. Unfortunately, a description of their internal operation is beyond the scope of this book.

The ISO/IEC 10118-2 standard specifies a hash function based on an n-bit block-encryption algorithm (e.g. DES). With this algorithm, the length of the hash value may be n or 2n bits. The MD4 (message digest 4) hash function (presently rarely used) and its successor MD5 were published by Ronald L. Rivest in 1990–1991. They are based on a standalone algorithm, and both functions generate a 128-bit hash value. In 1992, the NIST published a hash function for the DSS algorithm that is known as SHA. After the discovery of certain weaknesses, it was modified, and the resulting function has been known since mid-1995 as SHA-1. It is also standardized under the name FIPS 180–1. Since data transmission to smart cards is generally slow, the hash function is performed in the terminal or in a computer connected to the terminal. This drawback is balanced by the fact that this makes the hash function interchangeable. Besides, in most cases, memory limitations prevent hash functions from being stored in the cards. The program size is in almost every case around 4 kB of assembler code. The throughput of typical hash functions is very high relative to the demands placed on them. With an 80386 computer running at 33 MHz, it is usually at least 300 kB/s, and it lies in the range of 4 to 8 MB/s with a 200-MHz Pentium PC.