Using SHA1 hashes to identify files

Using SHA1 hashes to identify files

June 1, 2013 - by Patrick van Bergen

Tags:

sha1.png

The question posed here is: is it safe to use the SHA1 hash of a file to identify it?

Let's first start by explaining why you would want to use the SHA1 of a file. Suppose you want to know if a given file has already been stored on your system. Just checking the name is not enough, because different files may have the same name. Also, the file may have been stored under a different name before. You want to compare the contents of the file with all other files on the system. But this would take way too much time. The solution is to compute a hash code for each new file and check if the hash code has been used before.

SHA1 is a 160 bit message digest, which allows for 2160 (= 1048) different hashes, an astronomical amount. Following the birthday paradox, the chance that a new file entering your file system has the same SHA1 as one of the existing files (a so called hash collision) depends on the amount of files present. It can be approximated using the formula


where

n = the number of files in your system, and
m = the number of possible hash values (about 1048)


Now say there are 1.000.000.000 files in your system. That's probably too much but you should be prepared for the future. The odds that the SHA1 of a new file is the same as one of the existing files while its contents is different is 1-e^(-(10^9)*(10^9 / ( 2 * 10^48))) = 5 * 10-31

1 in 50000000000000000000000000000000

This number is so large that it does not need to concern us. For comparison, it exceeds the number of seconds the universe exists, by a million times a billion.

The chance of a SHA1 hash collision is often confused with the deliberately attempts of trying to create a hash collision, which is calculated to take 1018 (or even a little less) attempts. But this number does not affect the statistical properties of the algorithm.

If you happen to find two files that are distinct but share a hash, ask the owner of the two files if it is ok to publish them, since they are unique!

Share this post!

Comments

Leave a comment!

Italic and bold

*This is italic*, and _so is this_.
**This is bold**, and __so is this__.

Links

This is a link to [Procurios](http://www.procurios.nl).

Lists

A bulleted list can be made with:
- Minus-signs,
+ Add-signs,
* Or an asterisk.

A numbered list can be made with:
1. List item number 1.
2. List item number 2.

Quote

The text below creates a quote:
> This is the first line.
> This is the second line.

Code

A text block with code can be created. Prefix a line with four spaces and a code-block will be made.