Searching for Text on the Command Line with Regular Expressions

Before we can fully appreciate all of features offered by tools to manipulate text, we have to examine a technology that is frequently associated with the most sophisticated uses of these tools: regular expressions.

As last time, all the knowledge I share with you here is based on “The Linux Command Line” book written by William E. Shotts, which I highly recommend.

What are Regular Expressions?

Simply put, regular expressions are symbolic notations used to identify patterns in text. They are also supported by and known from most programming languages, such as Python, to facilitate the solution of text manipulation problems.

grep — Search through Text

The main program we will use to work with regular expresions is grep. The name grep is actually derived from the phrase global regular expression print.

The grep program accepts options and arguments this way:

grep [options] regex [file...]

where regex is a regular expression. Commonly used grep options are:

-i —– Ignore case. Do not distinguish between upper- and lowercase characters. May also be specified –ignore-case

-v —– Invert match. This option causes grep to print every line that does NOT contain a match. May also be specified –invert-match

-c —– Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves. May also be specified –count

-h —– For multiple searches, suppress the output of filenames. May also be specified –no-filename

Metacharacters and Literals

Regular expression metacharacters consist of the following:

^ $ . [ ] { } - ? * + ( ) | \

All other characters are considered literals, though the backslash character is used in few cases to create metasequences, as well as allowing the metcharacters to be escaped and treated as literals instead of being interpreted as metacharacters.

As we can see, many of the regular-expression metacharacters are also characters that have meaning to the shell when expansion is performed. When we pass regular expressions containing metacharacters on the command line, it is vital that they be enclosed in quotes to prevent the shell from attempting to expand them.

The Any Character

The first metacharacter we will look at at is the dot or period character, which is used to match any character. For example:

$ grep -h '.zip' dirlist*.txt

Anchors

The caret (^) and dollar sign ($) characters are treated as anchors in regular expression. This means that they cause the match to occur only if the regular expression is found at the beginning of the line (^) or at the end of the line ($).

Here we searched the list of files for the string zip at the beginning of the line in a list of files:

$ grep -h '^zip' dirlist*.txt
zip
zipcloak
zipgrep
zipinfo
zipnote
zipsplit

… the string zip at the end of the line:

$ grep -h 'zip$' dirlist*.txt
gunzip
gzip
funzip
gpg-zip
preunzip
prezip
unzip
zip

… and the string zip at both the beginning and the end of the line:

$ grep -h '^zip$' dirlist*.txt
zip

Bracket Expressions

In addition to matching any character at a given position in our regular expression, we can also match a single character from a specified set of characters by using bracket expressions.

In this example, using a two-character set, we match any line that contains the string bzip or gzip:

$ grep -h '[bg]zip' dirlist*.txt
bzip2
bzip2recover
gzip

We can also search for files that contain the string zip preceded by any character – except b or g:

$ grep -h '[^bg]zip' dirlist*.txt
bzip2
bzip2recover
gzip

The caret character invokes negation here.

If we wanted to search for files with the filename containing an uppercase letter (no matter which), we would do this:

$ grep -h '[A-Z]' dirlist*.txt

This, on the other hand, will match every filename containing a dash, an uppercase A, or an uppercase Z:

$ grep -h '[-AZ]' dirlist*.txt

Character Classes

Character Classes provide useful ranges of characters. This command produces a list of only files whose names begin with an uppercase letter:

MacBook-Air:~ bbettendorf$ ls /usr/sbin/[[:upper:]]*
/usr/sbin/AppleFileServer	/usr/sbin/DirectoryService	/usr/sbin/NetBootClientStatus
/usr/sbin/BootCacheControl	/usr/sbin/FileStatsAgent	/usr/sbin/PasswordService
/usr/sbin/DevToolsSecurity	/usr/sbin/KernelEventAgent	/usr/sbin/WirelessRadioManagerd

These classes might be useful in searching:

  • [:alnum:] for alphanumeric characters – equivalent to [A-Za-z0-9]
  • [:alpha:] for alphabetic characters – equivalent to [A-Za-z]
  • [:digit:] for numerals 0 through 9
  • [:lower:] for lowercase letters
  • [:upper:] for uppercase letters

Quantifiers

Extended regular expressions support several ways to specify the number of times an element is matched.

?  – – Match an element zero times or one time

*  – – Match an element zero or more times

+  – – Match an element one or more times

{} – – Match an element a specific number of times

The { and } metacharacters are used to express minimum and maximum numbers of required matches. They may be specified in four possible ways:

{n} – – Match the preceding element if it occurs exactly n times.

{n,m} – – Match the preceding element if it occurs at least n times, but no more than m times.

{n,} – – Match the preceding element if it occurs n or more times.

{,m} – – Match the preceding element if it occurs no more than m times.

Say, we wanted to search a text file for phone numbers, we could use this expression:

$ grep -E '\([0-9]{3}\)? [0-9]{3}-[0-9]{4}$'

By the way, the -E option here supports extended regular expressions (ERE) which is one of two variations on the syntax of the specified pattern. The other is basic regular expressions (BRE)

The difference between basic and extended regular expressions is in the behavior of a few special characters: ‘?’, ‘+’, parentheses, braces (‘{}’), and ‘|’. With basic (BRE) syntax, these characters do not have special meaning unless prefixed with a backslash (‘\’) – while with extended (ERE) syntax it is reversed: these characters are special unless they are prefixed with backslash (‘\’).

For further practise, here is a regular expression that will match only lines consisting of groups of one or more alphabetic characters separated by single spaces:

$ grep -E '^([[:alpha:]]+ ?)+$'

Alternation

Another extended regular expression feature is called alternation, which is the facility that allows a match to occur from among a set of expressions.

To demonstrate, we’ll use grep in conjunction with echo. We pipe the output of echo into grep and see the result.

$ echo "AAA" | grep -E 'AAA|BBB'
AAA

The regular expression ‘AAA|BBB’ means “match either the string AAA or the string BBB”. We enclosed the regular expression in quotes to prevent the shell from interpreting the vertical pipe metacharacter as a pipe operator.

To combine alternation with other regular-expression elements, we can use () to seperate the alternation:

$ grep -Eh '^(bz|gz|zip)' dirlist*.txt

This expression will match the filenames in our lists that start with either bz, gz, or zip.

If we leave off the parenthesis, the meaning of this regular expression changes to match any filename that begins with bz or contains gz or contains zip:

$ grep -Eh '^bz|gz|zip' dirlist*.txt

Well, that’s it for today! Probably the last blog post will follow soon where we look at many tools for manipulating text on the command line.

. . . . . . . . . . . . . .

Thank you for reading! I hope you enjoyed reading this article, and I am always happy to get critical and friendly feedback, or suggestions for improvement!