The question posed here is: is it safe to use the SHA1 hash of a file to identify it?
Let's first start by explaining why you would want to use the SHA1 of a file. Suppose you want to know if a given file has already been stored on your system. Just checking the name is not enough, because different files may have the same name. Also, the file may have been stored under a different name before. You want to compare the contents of the file with all other files on the system. But this would take way too much time. The solution is to compute a hash code for each new file and check if the hash code has been used before.
SHA1 is a 160 bit message digest, which allows for 2160 (= 1048) different hashes, an astronomical amount. Following the birthday paradox, the chance that a new file entering your file system has the same SHA1 as one of the existing files (a so called hash collision) depends on the amount of files present. It can be approximated using the formula
n = the number of files in your system, and
m = the number of possible hash values (about 1048)
Now say there are 1.000.000.000 files in your system. That's probably too much but you should be prepared for the future. The odds that the SHA1 of a new file is the same as one of the existing files while its contents is different is 1-e^(-(10^9)*(10^9 / ( 2 * 10^48))) = 5 * 10-31
1 in 50000000000000000000000000000000
This number is so large that it does not need to concern us. For comparison, it exceeds the number of seconds the universe exists, by a million times a billion.
The chance of a SHA1 hash collision is often confused with the deliberately attempts of trying to create a hash collision, which is calculated to take 1018 (or even a little less) attempts. But this number does not affect the statistical properties of the algorithm.
If you happen to find two files that are distinct but share a hash, ask the owner of the two files if it is ok to publish them, since they are unique!