Wandering around the command line, one thing has become clear: We have lots of files! This raises the questions: How do we find things? How do we keep the system’s data secure and perform timely backups? So this blogpost is about searching, compressing, archiving, and synchronizing files.
As last time, all the knowledge I share with you here is based on “The Linux Command Line” book written by William E. Shotts, which I highly recommend.
1. Searching for Files
The sheer number of files can present a daunting problem. In this part, we will look at two tools that are used to find files on a system.
locate — Find Files the Easy Way
The locate program performs a rapid database search for pathnames and then outputs every name that matches a given substring.
Say, for example, we want to find all the programs with names that begin with zip. Since we are looking for programs, we can assume that the name of the directory containing the programs would end with bin/. Therefore, we could try to use locate this way to find our files:
$ locate bin/zip
locate will search its database of pathnames and output any that contain the string bin/zip:
/usr/bin/zip /usr/bin/zipcloak /usr/bin/zipdetails /usr/bin/zipdetails5.18 /usr/bin/zipgrep /usr/bin/zipinfo /usr/bin/zipnote /usr/bin/zipsplit
If the search requirement is not so simple, locate can be combined with other tools, such as grep, to design more interesting searches. The pipe operator | (vertical bar) is used here to pipe the output of one command into the input of another.
$ locate zip | grep bin /Applications/Atom.app/Contents/Resources/app/apm/node_modules/.bin/decompress-zip /Applications/Atom.app/Contents/Resources/app/apm/node_modules/decompress-zip/bin /Applications/Atom.app/Contents/Resources/app/apm/node_modules/decompress-zip/bin/decompress-zip /anaconda3/bin/bunzip2 /anaconda3/bin/bzip2 /anaconda3/bin/bzip2recover (...)
find — Find Files the Hard Way
While the locate program can find a file based solely on its name, the find program searches a given directory (and its subdirectories) for files based on a variety of attributes.
In its simplest use, find is given one or more names of directories to search. For example, it can produce a list of our home directory:
$ find ~
On most active user accounts, this will produce a large list. Since the list is sent to standard output, we can pipe the list into other programs. Let’s use wc to count the number of files:
$ find ~ | wc -w 445784
Yep, we’ve been busy! The beauty of find is that it can be used to identify files that meet specific criteria. It does this through the application of tests, actions, and options.
find + tests
By adding the test -type, we can limit our search to common file-types.
Let’s say that we want a list of directories (-type d) from our search:
$ find ~ -type d | wc -w 1695
Conversely, we could have limited the search to regular files (-type f):
$ find ~ -type f | wc -w 38737
We can also search by file size and filename by adding some additional tests. Let’s look for all the regular files that match the wildcard pattern *.JPG and are larger than 1 megabyte:
$ find ~ -type f -name "*.JPG" -size +1M | wc -w 11942
The leading plus sign on the string +1M indicates that we are looking for files larger than the specified number. A leading minus sign would change the string to mean smaller than the specified number. Using no signs means match the value exactly. The trailing letter M indicates that the unit of measurement is megabytes. G would mean Gigabytes, k kilobytes, and c would mean bytes.
find supports a large number of different tests, for example:
-cmin n # Match files or directories whose content or attributes were # last modified exactly n minutes ago. # To specify fewer than n minutes ago, use -n; tp specify more # than n minutes ago, use +n.
-empty # Match empty files and directories.
-newer specific_file # Match files and directories whose contens were modified more # recently than the specific_file.
What if we needed to determine if all the files and subdirectories in a directory had secure permissions? We would look for all the files with permissions that are not 0600 and the directories with permissions that are not 0700. (We didn’t cover permissions so far, but it just serves as an example.)
Fortunately, find provides a way to combine tests using logical operators to create more complex logical relationships:
$ find ~ \( -type f -not -perm 0600 \) -or \( -type d -not -perm 0700 \)
Since the parentheses have special meaning to the shell, we must escape them to prevent the shell from trying to interpret them. Preceding each one with a backslash character does the trick.
Other operators are -and and -not.
find + actions
Having a list of results from our find command is useful, but what we really want to do is act on the items on the list. Fortunately, find allows actions to be performed based on the search results.
For example, we can use find to delete files that meet certain criteria. Here, to delete all files in the user’s home directory that have the file extension .BAK (which is often used to designate backup files), we could use this command:
$ find ~ -type f -name '*.BAK' -delete
It should go without saying that you should use extreme caution when using the -delete action!
Let’s take another look at how the logical operators affect actions. Consider the following command:
find ~ -print -and -type f -and -name '*BAK'
This command will print each file and then test for fily type and the specified file extension. Since the logical relationship between the tests and actions determines which of them are performed, the order of the tests and actions is important. If we were to reorder the tests and actions above so that the -print action was the last one, the command would behave much differently.
In addition to predefined actions as dicussed above, we can also invoke arbitrary commands. But we won’t cover this topic here and now.
2. Compressing Files
Throughout the history of computing, there has been a struggle to get the most data into the smallest available space, whether that space be memory, storage device, or network bandwidth. Many of the data devices that we take for granted today – such as portable music players or broadband Internet – owe their existence to effective data compression which is the process of removing redundancy from data.
gzip — Compress or Expand Files
The gzip program is used to compress one or more files. When excuted, it replaces the original file with a compressed version of the original. The corresponding gunzip program is used to restore compressed files to their original, uncompressed form. Here is an example:
ls -l /Users/bbettendorf/ > foo.txt MacBook-Air:~ bbettendorf$ ls -l foo.* -rw-r--r-- 1 bbettendorf staff 906 5 Apr 12:01 foo.txt MacBook-Air:~ bbettendorf$ gzip foo.txt MacBook-Air:~ bbettendorf$ ls -l foo.* -rw-r--r-- 1 bbettendorf staff 384 5 Apr 12:01 foo.txt.gz
In this example, we create a text file named foo.txt from a directory listing. With “>” we store the output of our -ls -l command in the file.
Next, we run gzip, which replaces the original file with a compressed version named foo.txt.gz. In the directory listing of foo.*, we see that the original file has been replaced with the compressed version and that the compressed version is about one-third the size of the original. We can also see that the compressed file has the same permissions (-rw-r–r–) and timestamp as preserved.
Next, we run the gunzip program to uncompress the file:
MacBook-Air:~ bbettendorf$ gunzip foo.txt MacBook-Air:~ bbettendorf$ ls -l foo.* -rw-r--r-- 1 bbettendorf staff 906 5 Apr 12:01 foo.txt
We can see that the compressed version of the file has been replaced with the original, again with the permissions and timestamp preserved.
gzip has many options, such as -t which tests the integrity of a compressed file (may also be specified with – -test) or -v which displays verbose messages while compressing (may also be specified with – -verbose) or -l which lists compression statistics for each file compressed (may also be specified with – – list).
Let’s look again at our earlier example:
MacBook-Air:~ bbettendorf$ gzip foo.txt MacBook-Air:~ bbettendorf$ gzip -tv foo.txt.gz foo.txt.gz: OK MacBook-Air:~ bbettendorf$
Here, we replaced the file foo.txt with a compressed version and afterward, tested the integrity of the compressed version, using the -t and -v options.
Don’t be compressive compulsive!
People sometimes attempt to compress a file that has already been compressed. According to William E. Shotts and his book “The Linux Command Line”: Don’t do it. You’re probably just wasting time and space.
If you apply compression to a file that is already compressed, you will actually end up with a larger file. This is because all compression techniques involve some overhead that is added to the file to describe the compression. If you try to compress a file that already contains no redundant information, the compression will not result in any savings to offset the additional overhead.
3. Archiving Files
A common file-management task used in conjunction with compression is archiving. Archiving is the process of gathering up many files und bundling them into a single large file. Archiving is often done as a part of system backups. It is also used when old data is moved from a system to some type of long-term storage.
zip — Package and Compress Files
The zip program is both a compression tool and an archiver. The file format used by the program is familiar to Windows users, as it reads and writes .zip files. In Linux, however, gzip is the predominant compression program with bzip2 being a close second. Linux users mainly use zip for exchanging files with Windows systems, rather than performaing compression and archiving.
In its most basic usage, zip is invoked like this:
zip -options zipfile file...
To show zip in action, let’s imagine we had a playground directory with many sub-directories. To make a zip archive of this playground, we would do this:
$ zip -r playground.zip playground
Unless we include the -r option for recursion, only the playground directory (but none of its contents) is stored. Although the addition of the extension .zip is automatic, we will include the file extension for clarity.
Extracting the contents of a zip file is straightforward when using the unzip program:
$ cd foo $ unzip ../playground.zip
4. Synchronizing Files
A common strategy for maintaining a backup copy of a system involves keeping one or more directories synchronized with another directory (or directories) located on either the local system or a remote system. We might, for example, have a local copy of a website under development and synchronize it from time to time with the “live” copy on a remote web server.
rsync — Remote File and Directory Synchronization
The preferred tool for this task is rsync. This program can synchronize both local and remote directories by using the rsync remote-update protocol, which allows rsync to quickly detect the differences between two directories and perform the minimum amount of copying required to bring them into sync. This makes rsync very fast and ecomomical to use.
rsync is invoked like this:
rsync options source destination
where source and destination are each one of the following:
- A local file or directory
- A remote file or directpry in the form of [user@]host:path
- A remote rsync server specified with a URI (Uniform Resource Identifier) of rsync://[user@]host[:port]/path
Note that either the source or the destination must be a local file. Remote-to-remote copying is not supported.
Let’s try rsync out on some local files. First, we clean the foo directory using the -r option (which always must be specified to delete a directory) and the -f option (for forcing, ignores nonexistent files and does not prompt)
$ rm -rf foo/*
Next, we’ll synchronize the playground directory (source) with a corresponding copy in foo (destination):
rsync -av playground foo
We’ve included both the -a option (for archiving, causes recursion and preservation of file attributes) and the -v option (verbose output) to make a mirror of the playground directory within foo. While the command runs, we will see a list of the files and directories being copied. At the end, we will see a summary message like this, indicating the amount of copying performed:
sent 135759 bytes received 57870 bytes 387258.00 bytes/sec total size is 3230 speedup is 0.02
If we run the command again, we will see a different result:
$ rsync -av playground foo building file list ... done sent 22635 bytes received 20 bytes 45310.00 bytes/sec total size is 3230 speedup is 0.14
Notice that there was no listing of files. This is because rsync detected that there were no differences between ~/playground and ~/foo/playground, and therefore it didn’t need to copy anything.
Well, that’s it for today! But stay tuned: I plan at least one additional blog post on the command line in the next few weeks!
. . . . . . . . . . . . . .
Thank you for reading! I hope you enjoyed reading this article, and I am always happy to get critical and friendly feedback, or suggestions for improvement!