find_similar - Look for similar files
find_similar [-nlines] [-cchars] [-Csize] [-xlistfile] [-ip] [file ...]
Reads each file and internally stores a signature for each sequence of
lines lines (10 by default).
Reports the locations of any signatures that occur multiple times, whether
within the same file or between distinct files.
If listfile is specified, then that file is read, and each of its
lines is taken as the path of a file to process.
If listfile is -, then the list of files is read from standard input.
If nether listfile nor file is specified, then the standard input is
processed.
Each subsequent occurrence of the same inode (whether or not it was reached
by the same path) results in a warning rather than processing the file.
Blank lines are not considered in computing signatures or determining the
line length of a sequence.
Any sequence of whitespace in a line is considered
a single space for the purposes of computing signatures.
Leading whitespace on each line is ignored completely.
Each line is treated as ending in a space for the purposes of computing
signatures.
- -n lines, --lines lines
-
Set the minimum length in lines of each signature.
The default is 10.
This is ignored for the first line in each file.
- -c chars, --chars chars
-
Set the minimum length in characters of each signature.
The default is 256.
This is ignored for the first line in each file.
Signatures are always computed based on whole lines, so even if more than
lines lines have been accumulated when char characters is reached, the
signature is not computed until the next end of line.
- -C size, --file-chars size
-
A signature for the first line is computed only if the file size
(in characters) exceeds size or chars.
The default is 64.
size has no effect if it is greater than chars
- -x listfile, --xargs listfile
-
Read a list of files to process from listfile.
- -i, --ignore-case
-
Convert all letters to lowercase for the purpose of computing signatures.
- -p, --punctuation
-
Insert spaces around every punctuation character for the purpose of
computing signatures.
The exit status if 0 for no duplicate signatures, 1 for duplicate signatures
found, or 2 for trouble.
-
It is always possible (however unlikely) that two distinct sequences will
have the same signature. That is called aliasing.
It would lead to spurious matches.
Since the current signature algorithm is not cryptographically strong, it
is not difficult to make this happen if you try, but it is highly unlikely
to occur accidentally.
-
It is possible for a discontiguous match to be reported
as a contiguous match.
For example (assuming chars=lines=3), the sequence
``A B C D E F G H I J'' (where each letter represents a new line) would appear
as a contiguous match against a previous sequence
``A B C D E F G X X X X F G H I J''.
This does not occur frequently in practice.
Fixing it would require the number of blank lines that occur after
each sequence to be stored along with the signatures, and that would increase
the database size by 25%.
-
It might make sense to use an entirely different hashing algorithm.
I think we should stick to at least 64 bits, so that the probability of
aliasing is negligible, even if the aggregate file size is many gigabytes.
post_find_similar(1)