Find

From Segfault
Jump to navigation Jump to search

Find files with special permissions

Create test files & directories with all available permission bits set:

mkdir test && cd $_
for x in {0..7}; do
   echo "x: $x"
   for w in {0..7}; do
      for r in {0..7}; do
         for s in {0..7}; do
            MODE=${s}${r}${w}${x}
            touch file_$MODE
            mkdir -p dir_$MODE
            chmod $MODE file_$MODE dir_$MODE
         done
      done
   done
done

Find files with the sticky bit set - notice the leading dash, matching all the permission bits:

$ find . -perm -1000 -exec ls -dgo '{}' +
dr-xr-xr-t 2 4096 Jun 14 16:32 ./dir_1555
-r--r--r-T 1    0 Jun 14 16:32 ./file_1444
[...]

Find SUID or SGID objects:

$ find . \( -perm -04000 -o -perm -02000 \) -exec ls -dgo '{}' +
dr-sr--r-- 2 4096 Jun 14 16:32 ./dir_4544
-r-Sr--r-- 1    0 Jun 14 16:32 ./file_4444
[...]                                            # Note: Solaris will also return objects with 
                                                 # the mandatory locking bit[1] set.

Find world-writable objects, omitting symbolic links:

$ find . -perm -2 ! -type l -exec ls -dgo '{}' +
dr-xr---w- 2 4096 Jun 14 16:32 ./dir_0542
-r--r---w- 1    0 Jun 14 16:32 ./file_0442
[...]

And the reverse, finding files that are not world-readable:

$ find . ! -perm -0444 -type f -exec ls -ldgo '{}' +
-rwx-wS--T  1   0 Jun 26 01:41 ./file_3720
--wSrwSrw-  1   0 Jun 26 01:43 ./file_6266
---srws-wt  1   0 Jun 26 01:42 ./file_7173
[...]

Find readable or writable files for a specific user (needs GNU/find 4.3[2] or later):

$ sudo -u alice find /tmp/ -xdev -writable -ls 2>/dev/null 
   2913    0 drwxrwxrwt  16 root   root    200 Sep  6 16:09 /tmp/
4519452    0 drwx------   2 alice alice     80 Sep  6 16:01 /tmp/.vbox-alice-ipc
4519455    0 srwx------   1 alice alice      0 Sep  6 16:01 /tmp/agent.1593
4519453    4 -rw-------   1 alice alice      6 Sep  6 16:01 /tmp/tmperr

Find files with an unknown user and/or group:

$ sudo chown 123:456 file_0001
$ find . \( -nouser -o -nogroup \) -ls
525558    0 ---------x   1 123      456             0 Jun 14 16:32 ./file_0001

Find files with exactly the permission bits set:

$ find . -perm 0644 -exec ls -dgo '{}' + 
drw-r--r--   2       2 Jun 17 21:47 ./dir_0644
drw-r--r--   2       2 Jun 17 21:47 ./dir_2644
-rw-r--r--   1       0 Jun 17 21:47 ./file_0644
-rw-r--r--   1       0 Jun 17 21:47 ./file_1644

Find files with all of the permission bits set (not portable):

$ find . -perm -7774 -exec ls -dgo '{}' +
drwsrwsr-T 2 4096 Jun 14 16:32 ./dir_7774
drwsrwsr-t 2 4096 Jun 14 16:32 ./dir_7775
drwsrwsrwT 2 4096 Jun 14 16:32 ./dir_7776
drwsrwsrwt 2 4096 Jun 14 16:32 ./dir_7777
-rwsrwsr-T 1    0 Jun 14 16:32 ./file_7774
-rwsrwsr-t 1    0 Jun 14 16:32 ./file_7775
-rwsrwsrwT 1    0 Jun 14 16:32 ./file_7776
-rwsrwsrwt 1    0 Jun 14 16:32 ./file_7777

Find files with any of the permission bits set (not portable):

$ find . -perm /0444 -exec ls -dgo '{}' +
d------r--    2   4096 Jun 14 16:32 ./dir_0004
----r-----    1      0 Jun 14 16:32 ./file_0040
[...]

Find files with ACLs set:

$ ls -go
-rw-r-----  1 0 Jan 27 13:32 file1
-rw-r-----+ 1 0 Jan 27 13:32 file2
-rw-r-----  1 0 Jan 27 13:32 file3

$ getfacl -Rps . | awk -F'file: ' '/file:/ {print $NF}'
./file2

Find files with EAs set:

$ getfattr -R --absolute-names . | awk -F'file: ' '/file:/ {print $NF}'
./file2
./file3

Find hard links

Find hard links of a specific file, if -samefile is supported:

$ find . -samefile foo -ls
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./foo
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./bar

Find all hard links, i.e. files with a link count greater than 1:

$ find . -type f -links +1 -ls
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./foo
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./bar

Or, if for some reason -links is not supported:

$ find . ! -type d -ls | awk '$4 > 1'
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./foo
1076027678     12 -rw-------   2  bob users       38 Jan 11  2018 ./bar

More specifically, we can only display the link count and then only print the file names:

$ find . ! -type d -printf '%n %p\n' | awk '$1 > 1 {$1=""; print }'
./foo
./bar

This of course excludes directories, as hard links are usually not allowed for directories. If they were, we'd need to look for directories with a link count greater than 2:

$ mkdir baz && ls -dog baz
drwxr-x---. 2 0 Jul 31 16:12 baz

...depending on its contents:

$ touch baz/file && mkdir baz/dir
$ ls -dog baz
drwxr-x---. 3 0 Jul 31 16:13 baz

Find broken symlinks

find -L . -type l -ls

The -L option makes find(1) to follow symbolic links. Combined with "-type l", it tries to find symlinks. Because -L followed all symlinks before, "-ls" reports only the broken ones, because it could not be followed to an actual file.

Find symlinks pointing to an absolute target

In backups, symlinks to absolute targets may sometimes not be desirable. For example, backing up the another machine's /etc/mtab symlink may then look like this:

$ find . -type l
lrwxrwxrwx 12 19 Apr 24 10:54 ../backups/alice/etc/mtab -> /proc/self/mounts

Subsequent backups of this directory will backup the host's version of /proc/self/mounts, thus needlessly backing up files. Let's see if we can find (and remove) symlinks, pointing to an absolute target:

$ find . -type l | while read l; do readlink -- "$l" | grep -q ^/ && ls -go -- "$l" && echo rm -v "$f"; done 
lrwxrwxrwx 12 33 Apr 24 10:33 etc/localtime -> /usr/share/zoneinfo/Europe/Berlin
rm -fv etc/localtime
[...]

If our find(1) version understands -lname, we can try:[3]

$ find . -type l -lname '/*' -exec ls -go '{}' +
lrwxrwxrwx 12 33 Apr 24 10:33 etc/localtime -> /usr/share/zoneinfo/Europe/Berlin

Find sparse files

An sparse file is a file of certain size, but its data blocks are only allocated once it's being written to.

dd if=/dev/zero of=file   bs=1 count=1024k            2>/dev/null
dd if=/dev/zero of=sparse bs=1 count=0     seek=1024k 2>/dev/null

For Solaris or macOS we can use:

mkfile -n 1024k sparse

With GNU/coreutils installed:

truncate -s 1024K sparse

And so we have:

$ du -k file sparse
1024    file
0       sparse

By using the -s option of ls,[4] we can print the allocated size of each file[5], in blocks (1024 byte):

$ ls -gos file sparse 
1024 -rw-r--r--. 1 1048576 Aug 29 02:56 file
   4 -rw-r--r--. 1 1048576 Aug 29 02:57 sparse

Let's use GNU/find to search for sparse files:[6]

$ find . -type f -printf "%S\t%i\t%p\n" 2>/dev/null | awk '{if ($1 < 1.0) print}'
0.00390625      48880   ./sparse
|               |       |
|               |       |-- file name
|               |---------- inode number
|-------------------------- BLOCKSIZE * st_blocks / st_size. Sparse files are usually below 1.0

With BSD/find this is much easier, as it supports -sparse:

-sparse
    True if the current file is sparse, i.e. has fewer	blocks allo-
    cated than	expected based on its size in bytes.  This might also
    match files that have been	compressed by the filesystem.

Find long file names

mkdir dir
touch {,dir/}0{,1{,2{,3{,4{,5{,6{,7{,8{,9}}}}}}}}}

Find path names of at least 14 characters:

$ find . -regextype posix-extended -regex '.{14,}'
./dir/01234567                                                   # 14 characters, including the leading "./"
./dir/012345678
./dir/0123456789

If GNU/find is not available, we can use awk too:

$ find . | awk 'length >= 12 {print length, $0}' | sort -n
12 ./0123456789                                                  # 12 characters, including the leading "./"
12 ./dir/012345
13 ./dir/0123456
14 ./dir/01234567
15 ./dir/012345678
16 ./dir/0123456789

Find file names of at least 8 characters:

$ find . -regextype posix-extended -regex './[^/]{9,}.*'
./012345678                                                      # 9 characters, exluding the leading "./"
./0123456789

Again, with awk:

$ find . | awk -F/ 'length($NF) >= 9 {print length($NF), $0}'
 9 ./012345678                                                   # 9 characters, exluding the leading "./"
10 ./0123456789

Or, to find file names of specific length:

$ find . -type f | awk -F/ 'length($NF) <= 3 {print length($NF), $0}' | sort -n
1 ./0
1 ./dir/0
2 ./01
2 ./dir/01
3 ./012
3 ./dir/012

Find directory with the most objects

Sometimes we need to know where all inodes are being used:[7]

find . -xdev -printf '%h\n' | sort | uniq -c | sort -n

From GNU/find(1):

%h     Leading directories of file's name (all but the last element). If the file name contains no 
       slashes (since it is in the current directory) the %h specifier expands to ".".

If no GNU/findutils[8] are installed, maybe GNU/coreutils are and we can use du[9][10] to count inodes per directory:

du --inodes -xS . | sort -n                    # We use -S to not include the size of subdirectories

Find empty directories

Apparently, -empty is quite common[11], although it's not portable[12]:

$ mkdir -p dir1/dir_{a,b}
$ touch dir1/dir_a/foo                                                                                                                                                   
$ find dir1/ -depth -type d -empty
dir1/dir_b

Without -type d, it would also list the file dir1/dir_a/foo, for being empty:

$ man find | grep -- -empty\ F
    -empty File is empty and is either a regular file or a directory.

Find devices

While find(1)[12] doesn't have an option to search for major/minor number[13], we can still use it to seach /sys and examine the udev information:

$ awk -F= '/DEVNAME/ {print $NF}' $(find /sys/dev/ -name 8:64)/uevent
sde

Or, with udevadm:

$ udevadm info -rq name $(find /sys/dev/ -name 8:64)
/dev/sde

Note: this is not portable but only works on Linux. For Solaris (and probably other Unix systems too), I found no other way than to grep:

$ find -L /dev/ -ls | grep '33, ' | grep ' 64 '
17301636    0 crw-r-----   1 root     sys       33,  64 Jun 17 21:25 /dev/rsd8a
17301635    0 brw-r-----   1 root     sys       33,  64 Jun 17 21:25 /dev/dsk/c1t0d0s0
17301636    0 crw-r-----   1 root     sys       33,  64 Jun 17 21:25 /dev/rdsk/c1t0d0s0
17301635    0 brw-r-----   1 root     sys       33,  64 Jun 17 21:25 /dev/sd8a

The results all point to different representations of the same device. Here, '33' is the major and '64' is the minor device number.[14]

Find non-ASCII file names

If the -regex or -iregex option is supported:

$ ls
bär  baß  foo

$ find . -type f ! -regex '[a-zA-Z0-9\ -\.\,_]*'           # This may need to be extended to match other 
                                                           # characters of the ASCII character set
./bär
./baß

An more elegant solution[15] is to use the ASCII table directly and find all files with characters not matching printable characters:

LC_ALL=C find . -name '*[! -~]*' | cat

Similarily:

LC_ALL=C find . -name '*[![:print:]]*' | cat

Restricting that character set even further we can use the following to find only objects with unusual or invalid[16] names:

LC_ALL=C find . -name "*[![a-zA-Z0-9\\.\ \-\(\),'\&_]*"

And while we're at it, we can match strings[17] as well:

$ echo 'fooπbar' | grep -P  '[^\x00-\x7F]'
fooπbar

$ echo 'fooπbar' | grep -P  [[:ascii:]]
fooπbar

$ echo 'fooπbar' | grep -P  [^[:ascii:]]
fooπbar

$ echo 'fooπbar' | grep -Po [^[:ascii:]]
π

Find most recent files

Find (sort) the most recent modified (%T) file within a directory structure:

$ mkdir -p dir/b && touch dir/a && touch -r /bin/ls dir/b/c

$ LC_ALL=C find dir/ -type f -printf "%T@ %Tc %h/%f\n" | sort -n 
1510336586.0000000000 Fri Nov 10 09:56:26 2017 dir/b/c
1515811677.2218102000 Fri Jan 12 18:47:57 2018 dir/a

If find[12] doesn't support printf, we can use stat(1) and a version of awk[18] that supports strftime. It doesn't seem to work when objects have spaces though:

$ find dir/ -type f -exec stat -r '{}' + | sort -nk10 | awk '{print strftime("%c", $10), $NF}'
Fri Nov 10 09:56:26 2017 dir/b/c
Fri Jan 12 18:47:57 2018 dir/a

An easier way to check for "recent" files within a directory structure might be to use a reference file:[19]

$ touch -t 201812310000 foo && ls -go foo
-rw-r-----  1   0 Dec 31  2018 foo

$ find dir/ -type f -newer foo -exec ls -ghtrd '{}' +
-rw-------  1 wheel    0B Nov 10 09:56 dir/b/c
-rw-------  1 wheel    0B Jan 12 18:47 dir/a

Find files from the future

While we can find files and directories from the past, it was not as easy to find objects with timestamps in the future[20][21]. find v4.3.3 implements -newerXY[22] to do just that:

touch -d "+20 years" foo
find . -newermt "1 day"

If find doesn't support -newerXY, maybe it supports -newer by using a reference file:

touch -d "+19 years" bar
find . -newer bar

Exclude directories

Exclude directories from our search:

find .            -xdev \( -path ./dir1 -o -path ./dir2 \) -prune -o -type f
find / /home/bob/ -xdev \( -path /home/.ecryptfs -o -path /var/cache/fscache/cache \) -prune -o -type f

Note the leading ./ in front of those directories, as we need the full (but not absolute) path when we're searching a relative path (".").

Remove files from target directory

Not really a find only job, but it's involved in the process. Say we want to extract an archive into a (target) directory but want to remove everything in the directory that's not part of the archive.[23] We can make a list of the archive's contents and then remove everything else:

tar -tzf archive.tar.gz | cut -d/ -f2- > foo
find /some/target/directory/ | cut -d/ -f5- | while read f; do
   grep -qw "^${f}$" foo || echo "not in archive: ${f}"
done

Let's see how this works:

$ find . | xargs echo 
. ./dir0 ./dir0/file.txt ./dir0/file.com ./dir1 ./dir1/file.exe ./dir1/file.dll ./file0 ./file1
$ tar -tf ../archive.tar | xargs echo 
. ./dir0 ./dir0/file.txt                 ./dir1 ./dir1/file.exe                 ./file0

$ tar -tf ../archive.tar | cut -d/ -f2- > foo
$ find . | cut -d/ -f2- | while read f; do grep -qw "^${f}$" foo || echo "not in archive: ${f}"; done
not in archive: .
not in archive: dir0
not in archive: dir0/file.com
not in archive: dir1
not in archive: dir1/file.dll
not in archive: file1
not in archive: foo

Close, but good enough :-)

Find duplicate files

Let's assume two directories, with some files, some of different/same size and/or content:

mkdir foo bar
echo hello > foo/file2.txt && echo hallo > foo/file3.txt && sleep 5 && cp foo/* bar/ && \
  sed 's/ll/xx/' bar/file2.txt > bar/file4.txt && echo hall > bar/file5.txt

We should have something like this now:

$ find foo bar -type f | xargs ls -gotr
-rw-r-----  1   6 Mar 28 18:02 foo/file3.txt
-rw-r-----  1   6 Mar 28 18:02 foo/file2.txt
-rw-r-----  1   6 Mar 28 18:02 bar/file3.txt
-rw-r-----  1   6 Mar 28 18:02 bar/file2.txt
-rw-r-----  1   5 Mar 28 18:02 bar/file5.txt
-rw-r-----  1   6 Mar 28 18:02 bar/file4.txt

One could of course calculate checksums of all files and look for duplicate checksums:

$ find foo bar -type f -exec md5 -r '{}' + | sort
42b8651cc8d149d469b5a389ee5dabf3 bar/file4.txt                         -- hexxo
ae2a2ddab163997947a38a994ba263df bar/file5.txt                         -- hall
aee97cb3ad288ef0add6c6b5b5fae48a bar/file3.txt                         -- hallo
aee97cb3ad288ef0add6c6b5b5fae48a foo/file3.txt                         -- hallo
b1946ac92492d2347c6235b4d2611184 bar/file2.txt                         -- hello
b1946ac92492d2347c6235b4d2611184 foo/file2.txt                         -- hello

Nice, but we still need a way to programmatically look for duplicates and then remove duplicate files but keeping at least one copy. Ideally, the only copy should be the oldest file. That's quite a dance with the usual tools:

$ find foo bar -type f -exec md5 -r '{}' + | sort > sums
$ for dup in $(find foo bar -type f -exec md5 -q '{}' + | sort | uniq -d); do grep $dup sums; echo; done
aee97cb3ad288ef0add6c6b5b5fae48a bar/file3.txt                         -- hallo
aee97cb3ad288ef0add6c6b5b5fae48a foo/file3.txt                         -- hallo

b1946ac92492d2347c6235b4d2611184 bar/file2.txt                         -- hello
b1946ac92492d2347c6235b4d2611184 foo/file2.txt                         -- hello

Better, but now we still need to figure out which file to keep and which to remove. Let's use the file's mtime for that:

$ for dup in $(find foo bar -type f -exec md5 -q '{}' + | sort | uniq -d); do 
    grep ${dup} sums | cut -c34- | xargs ls -gotr | tail -n +2; echo; done
-rw-r-----  1   6 Mar 28 18:03 bar/file3.txt

-rw-r-----  1   6 Mar 28 18:03 bar/file2.txt

OK, there we have it: both file3.txt and file2.txt had (slightly newer) duplicates in the bar directory. Both of those can be deleted with some slight variation:

$ for dup in $(find foo bar -type f -exec md5 -q '{}' + | sort | uniq -d); do 
     grep ${dup} sums | cut -c34- | xargs ls -tr | tail -n +2 | xargs rm -v; done
bar/file3.txt
bar/file2.txt

That might work, but may get expensive pretty quickly depending on the number and size of files to compare. Another option would be to only generate checksums of files that share the same size, assuming only files of the same size share the same checksum.

mkdir baz
echo hello > baz/file2.txt && echo hallo > baz/file3.txt && echo hall > baz/file4.txt && echo hello > baz/file5.txt

Let's print all file sizes and adjust the routine above accordingly:

$ find baz -type f -exec stat -f %z\ %N '{}' + > sizes
$ awk '{print $1}' sizes | sort | uniq -d | while read s; do grep -w "^${s}" sizes | cut -d\  -f2-; done
baz/file2.txt
baz/file3.txt
baz/file5.txt

Clearly file4.txt is not a duplicate of anything because of its different size, so we only have to compare the checksums of file{2,3,5}.txt, see above on how to do this.

With GNU/coreutils it's a bit easier to do [24] since uniq understands to only compare up to N characters per line:

$ find baz/ -type f -exec md5sum {} + | sort | uniq -w32 -D
b1946ac92492d2347c6235b4d2611184  baz/file2.txt
b1946ac92492d2347c6235b4d2611184  baz/file5.txt

While this all works just fine let's see if we can find some tools[25] that do all of the above in a more convenient way.

fdupes

fdupes:

$ fdupes -P -t -r -o time       foo/ bar/
2020-03-28 19:23 foo/file3.txt          
2020-03-28 19:24 bar/file3.txt

2020-03-28 19:23 foo/file2.txt
2020-03-28 19:24 bar/file2.txt

With -d one can interactively remove files. With lots of duplicates this can be done automatically, but apparently there's no way to test this before:

$ fdupes -P -t -r -o time -d -N foo/ bar/

  [+] foo/file3.txt
  [-] bar/file3.txt

  [+] foo/file2.txt
  [-] bar/file2.txt

jdupes

jdupes claims to be faster than fdupes, with similar syntax:

$ jdupes -q -r -o time foo/ bar/
foo/file3.txt                                               
bar/file3.txt

foo/file2.txt
bar/file2.txt

As with fdupes, duplicates can be removed automatically too:

$ jdupes -q -r -o time -d -N foo/ bar/

  [+] foo/file3.txt
  [-] bar/file3.txt

  [+] foo/file2.txt
  [-] bar/file2.txt

rdfind

rdfind runs non-interactively, but doesn't allow us to specify which duplicate to remove:

$ rdfind -deleteduplicates true -dryrun true foo bar | grep delete
(DRYRUN MODE) delete bar/file3.txt
(DRYRUN MODE) delete bar/file2.txt

Apparently it decides to remove the duplicate in the last directory specified. IOW, if we change the command above we will get:

$ rdfind -deleteduplicates true -dryrun true bar foo | grep delete
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete foo/file3.txt
(DRYRUN MODE) delete foo/file2.txt

fslint

fslint is more of a linter and provides a GUI but usually has a command line version as well:

$ dpkg -L fslint | grep fslint$
/usr/share/doc/fslint
/usr/share/fslint
/usr/share/fslint/fslint
/usr/share/fslint/fslint/fslint

$ usr/share/fslint/fslint/fslint --help | grep dup
findup -- find DUPlicate files

$ /usr/share/fslint/fslint/findup -t -d foo/ bar/
keeping:  bar/file3.txt
deleting: foo/file3.txt 

keeping:  bar/file2.txt
deleting: foo/file2.txt

Unfortunately it provides no way to specify which duplicate to remove so we'd have to figure this out manually again.

rmlint

rmlint appears to be another linter and is able to remove duplicates as well. It will search for duplicates and generate a script to remove them. It also provides strategies on how files and duplicates are ordered to aid their removal:

$ rmlint -T duplicates -S pOma foo bar                # p: keep first named path
                                                      # O: keep file with highest number of hardlinks
                                                      # m: keep lowest mtime
                                                      # a: keep first alphabetically

# Duplicate(s):
   ls 'foo/file2.txt'
   rm 'bar/file2.txt'
   ls 'foo/file3.txt'
   rm 'bar/file3.txt'

==> Note: Please use the saved script below for removal, not the above output.
==> In total 6 files, whereof 2 are duplicates in 2 groups.
==> This equals 12 B of duplicates which could be removed.
==> Scanning took in total 0.148s.

Wrote a sh file to: rmlint.sh
Wrote a json file to: rmlint.json

Nothing has been removed yet, and we can even test the removal script:

$ ./rmlint.sh -n
# ////////////////////////////////////////////////////////////
# ///  This is only a dry run; nothing will be modified! ///
# ////////////////////////////////////////////////////////////
[  0%] Keeping:  foo/file2.txt
[ 25%] Deleting: bar/file2.txt
[ 50%] Keeping:  foo/file3.txt
[ 75%] Deleting: bar/file3.txt
[100%] Done!

Windows

And while we're at it, there is a GUI application for Windows systems called Anti-Twin. Although it's last release is from 2012 (!), it's said to be still useful for current Windows systems. Some command line examples[26][27] would be neat too.

TBD

Links

References