NAME

find_similar - Look for similar files


SYNOPSIS

find_similar [-nlines] [-cchars] [-Csize] [-xlistfile] [-ip] [file ...]


DESCRIPTION

Reads each file and internally stores a signature for each sequence of lines lines (10 by default). Reports the locations of any signatures that occur multiple times, whether within the same file or between distinct files.

If listfile is specified, then that file is read, and each of its lines is taken as the path of a file to process. If listfile is -, then the list of files is read from standard input. If nether listfile nor file is specified, then the standard input is processed.

Each subsequent occurrence of the same inode (whether or not it was reached by the same path) results in a warning rather than processing the file.

Blank lines are not considered in computing signatures or determining the line length of a sequence. Any sequence of whitespace in a line is considered a single space for the purposes of computing signatures. Leading whitespace on each line is ignored completely. Each line is treated as ending in a space for the purposes of computing signatures.


OPTIONS

-n lines, --lines lines
Set the minimum length in lines of each signature. The default is 10. This is ignored for the first line in each file.

-c chars, --chars chars
Set the minimum length in characters of each signature. The default is 256. This is ignored for the first line in each file. Signatures are always computed based on whole lines, so even if more than lines lines have been accumulated when char characters is reached, the signature is not computed until the next end of line.

-C size, --file-chars size
A signature for the first line is computed only if the file size (in characters) exceeds size or chars. The default is 64. size has no effect if it is greater than chars

-x listfile, --xargs listfile
Read a list of files to process from listfile.

-i, --ignore-case
Convert all letters to lowercase for the purpose of computing signatures.

-p, --punctuation
Insert spaces around every punctuation character for the purpose of computing signatures.


DIAGNOSTICS

The exit status if 0 for no duplicate signatures, 1 for duplicate signatures found, or 2 for trouble.


BUGS


SEE ALSO

post_find_similar(1)