**Understanding cryptographic hash function**

Hash functions are part of our daily lives even if we don’t realize it. For example, it is frequently used when we create a password on the Internet or when we electronically sign data or when we download certain files.

We have discussed the notion of cryptographic hash functions in our previous articles relating to technologies used by XSL Labs, notably IPF^{i} and public-key cryptography^{ii}. In this blog post, we will take a deeper look on cryptographic hash functions to understand its usefulness and properties and also review the main mathematical functions used so far.

A hash function is a function that calculates a digital fingerprint from data.

In the example above, the blue text goes through a hash function (namely SHA-1) and becomes a string of characters in the green frame: the hash.

Cryptographic hash functions are one-way functions, meaning they are functions that are almost impossible to reverse. The original text cannot be found from the hash.

One can then wonder what use can be found with these functions and hash values because, at first sight, these sequences of characters do not make any sense.

The first utility for these hash values is to easily identify the original text.

To explain this phenomenon, let’s take the example of passwords storage on the Internet.

We frequently talk about database hacking on our blog but did you know that still in 2021 some websites store your passwords “in clear-text”? So, when you create an account or change your password, it is directly registered in the database of the site or service in question.

A database with passwords in clear-text is an ideal target for cybercriminals, since it contains pairs of identifiers and passwords encrypted and thus directly readable.

Several solutions exist to prevent this such as the use of encryption with a specific key or, to be more secure, the use of cryptographic hash function.

If a cryptographic hash function is used, the password is transformed by a hash function to become a hash. It is this hash that will be stored in the database.

In this case, the password is not stored in clear-text thus even if the database is breached by a cyber-criminals, the password will not to be revealed or easily discovered. On the user side, it is just as easy to log in once the account has been created: when the user logs their username and password, their password goes through the same hash function and the hash obtained is compared with the hash recorded in the database. If the hash values match, the connection is established. If they don’t match, it means that the password entered was not the correct one and thus the connection is refused.

Cryptographic hash functions have several characteristics, namely:

- The first concerns the length of the hash: whatever the size of the initial file, the hash will have a fixed length which is calculated in bits.
- The second, called pre-image resistance concerns irreversibility: it is almost impossible to find the initial data from the hash. This is why we describe it as a one-way function.
- The third characteristic is determinism. This property implies that data to which the hash function is applied will always have the same hash. This property is very useful as the case of passwords mentioned above exemplifies but also to ensure that a file downloaded has been fully transmitted without errors or modifications. For example, Ubuntu, which makes its operating systems available via download, offers a “SHA256 checksum” verification system that allows you to compare the hash of the file that you’ve download with the initial hash:

If the message obtained corresponds to the last line, the hash values are indeed the same and we can conclude that there was no error when downloading the file. The installation of the operating system will suffer no corruption issues.

- Along the same lines, the fourth characteristic, known as the avalanche effect, implies that even a minor change in the initial data will greatly alter the hash.

Let’s take for example the creation of hash values via the SHA-256 algorithm for the following text:

If we add a character in this text (a comma, for example), we’ll get a radically different hash.

This property makes data tamper-proof as any modification to the original data will create a different hash and will no longer correspond to the expected hash at all.

- Finally, the fifth characteristic is the resistance to collisions. To explain this property, consider the “birthday paradox”. This mathematical probability problem originated by Richard von Mises, calculates the probability of having two people in a group who have the same birthday. It calculates that for a year of 365 days (and therefore 365 possible birthdays), it is enough for the group to have 23 people for the probability of two people having the same birthday to be more than 50%. One might think this probability should much lower, which is why this problem is called a “paradox”. Furthermore, if the group has 57 people or more, it estimates that there is more than a 99% chance that two people will share the same birthday.

*The table above shows for a number of N people, the different probabilities that two people have the same birthday.*

There is therefore a risk of collision which must be taken into account in the creation of cryptographic hash functions in order to resist these probabilities. Different data must not give the same hash. Otherwise, it could be possible to modify the initial data and still getting the same hash: we could no longer verify the integrity of the data with its hash.

In practice however, it is still virtually possible to generate data and hash values until collisions are found. To re-use the birthdays example: in it we are limited to 365 days a year, so if the group has 365 people, we are sure that there will be at least one collision. The same reasoning applies to hash values: whatever the hash, there will always be potentially more initial data than possible hash values.

As such, the impossibility of finding different data with the same hash is understood as an impossibility in regards to existing material resources because a reliable cryptographic hash function prevents the most powerful computers to calculate collisions, meaning it would be so costly in time and resources that there would be no point in attempting to do it.

Now that we’ve explained the different characteristics of cryptographic hash functions, we’ll now go through the most common mathematical function.

The first is called MD5, for “Message Digest 5”, and allows you to generate 128-bit hash values. It was invented by Ronald Rivest in 1991 and standardized in 1992^{iii}. This cryptographic hash function is no longer considered secure today. In 2004, an attack found a collision in less than an hour, and in 2006, it took just a few minutes to create one on a laptop^{iv}. MD5 is no longer used today to secure data (or it shouldn’t be used) because it is too easy for cybercriminals to modify a file while keeping the same hash. But it can be used to find errors when copying a file for example.

In 1995, the National Institute of Standards and Technology (NIST) presented the SHA-1 (for Secure Hash Algorithm), designed by the NSA to fix the problems associated with MD5. This cryptographic hash function has long resisted attacks but as of 2011 it is no longer considered sufficiently secure by NIST. In 2017, a group of researchers linked to the Google Research Institute succeeded in creating two PDF files that generate the same hash value via the SHA-1 function^{v}. This voluntary collision has demonstrated at last and for evermore the vulnerability of the hash function.

In 2002, the NSA standardized two new hash functions, known as SHA-2, which allow the creation of 256- and 512-bit hash values. This SHA-2 standard is still considered secure today. The SHA-256 function is typically used by Bitcoin to verify network transactions.

XSL Labs will use this hash function for example in the context of VC Traces as explained in our whitepaper^{vi}. The hash will need to be created by the VC issuer when sending Verifiable Credentials to the SDI subject.

The VC Trace will be inserted in the SDI smart contract which will enable the verification of its timestamp and the identity of its issuer and subsequently the verification of the issuer’s public profile.

It will also allow authentication of data transmitted and verification that it is tamper-proof, by consequence ensuring the chain of trust within the SYL ecosystem.

i Visit our blog for more information about IPFS technology: __https://www.xsl-labs.org/blog/ipfs-revolution/__

ii Visit our blog for more information about public-key cryptography: __https://www.xsl-labs.org/blog/the-role-of-public-key-cryptography-in-the-syl-ecosystem/__

iii See https://www.rfc-editor.org/rfc/rfc1321.txt

iv An archive of it is available in here: https://eprint.iacr.org/2006/105

v See https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

vi Visit this link to read XSL Labs‘ whitepaper: https://www.xsl-labs.io/whitepaper/white_paper_en.pdf