Regular Expressions and Sed & Awk
Regular Expressions and Sed & Awk
Regular expressions
• Key to powerful, efficient, and flexible text processing by allowing for variable information in the search patterns
• Defined as a string composed of letters, numbers, and special symbols, that defines one or more strings
• You have already used them in selecting files when you used asterisk (*) and question mark characters to select filenames
• Used by several Unix utilities such as ed, vi, emacs, grep, sed, and awk to search for and replace strings
– Checking the author, subject, and date of each message in a given mail folder
egrep "ˆ(From|Subject|Date): " <folder>
– The quotes above are not a part of the regular expression but are needed by the command shell
– The metacharacter | (or) is a convenient one to combine multiple expressions into a single expression to match any
of the individual expressions contained therein
∗ The subexpressions are known as alternatives
• A regular expression is composed of characters, delimiters, simple strings, special characters, and other metacharacters
defined below
• Characters
– A character is any character on the keyboard except the newline character ’\n’
– Most characters represent themselves within a regular expression
– All the characters that represent themselves are called literals
– A special character is one that does not represent itself (such as a metacharacter) and needs to be quoted
∗ The metacharacters in the example above (with egrep) are ", ˆ, (, |, and )
– We can treat the regular expressions as a language in which the literal characters are the words and the metacharac-
ters are the grammar
• Delimiters
– A delimiter is a character to mark the beginning and end of a regular expression
– Delimiter is always a special character for the regular expression being delimited
– The delimiter does not represent itself but marks the beginning and end of the regular expression
– Any character can be used as a delimiter as long as it (the same character) appears at both ends of the regular
expression
– More often than not, people use forward slash ‘/’ as the delimiter (guess why)
– If the second delimiter is to be immediately followed by a carriage return, it may be omitted
– Delimiters are not used with the grep family of utilities
• The metacharacters in the regular expressions are
ˆ $ . * [ ] \{ \} \ \( \)
– In addition, the following metacharacters have been added to the above for extended regular expressions (such as
the one used by egrep)
+ ? | ( )
– The dash (-) is considered to be a metacharacter only within the square brackets to indicate a range; otherwise, it is
treated as a literal
Regular Expressions/sed/awk 2
∗ Even in this case, the dash cannot be the first character and must be enclosed between the beginning and the
end of range characters
• The regular expression search is not done on a word basis but utilities like egrep display the entire line in which the
regular expression matches
• Simple strings
∗ Examples
Reg. Exp. Matches Examples
/ˆT/ a T at the beginning of a line This line ...
That time...
/ˆ+[0-9]/ a plus sign followed by +5 + 45.72
a number at the beginning +759 Keep this...
of a line
/:$/ a colon that ends a line ...below:
– Quoting special characters
∗ Any special character, except a digit or a parenthesis, can be quoted by preceding it with a backslash
∗ Quoting a special character makes it represent itself
∗ Examples
Reg. Exp. Matches Examples
/end\./ all strings that contain end The end.
followed by a period send.
pretend.mail
/\\/ a single backslash \
/\*/ an asterisk *.c
an asterisk (*)
/\[5\]/ [5] it was five [5]
/and\/or/ and/or and/or
– Range metacharacters
∗ Used to match a number of expressions
∗ Described by the following rules
r\{n\} Match exactly n occurrences of regular expression r
r\{n,\} Match at least n occurrences of regular expression r
r\{n,m\} Match between n and m occurrences of regular expression r
Both n and m above must be integers between 0 and 256
For now, r must be considered to be a single character regular expression (strings must be enclosed in bracketed
regular expressions)
– Word metacharacters
∗ The word boundaries in the regular expressions are denoted by any whitespace character, period, end-of-line,
or beginning of line
∗ Expressed by
\< beginning of word
\> end of word
• Rules
/mike/
:s//robert
• Bracketing expressions
– Regular expressions can be bracketed by quoted parentheses \( and \)
– Quoted parentheses are also known as tagged metacharacters
– The string matching the bracketed regular expression can be subsequently used as quoted digits
– The regular expression does not attempt to match quoted parentheses
– A regular expression within the quoted parentheses matches exactly with what the regular expression without the
quoted parentheses will match
– The expressions /\(rexp\)/ and /rexp/ match the same patterns
– Quoted digits
∗ Within the regular expression, a quoted digit (\n) takes on the value of the string that the regular expression
beginning with the nth \( matched
∗ Assume a list of people in the format
last-name, first-name initial
∗ It can be changed to the format
first-name initial last-name
by the following vi command
:%s/\([ˆ,]*\), \(.*\)/\2 \1/
– Quoted parentheses can be nested
∗ There is no ambiguity in identifying the nested quoted parentheses as they are identified by the opening \(
∗ Example
/\([a-z]\([A-Z]*\)x\)
matches
3 t dMNORx7 l u
• Replacement string
– vi and sed use regular expressions as search strings with the substitute command
– Ampersands (&) and quoted digits (\n) can be used to match the replacement strings within the replacement string
– An ampersand takes on the value of the string that the search string matched
– Example
:s/[0-9][0-9]*/Number &/
• Redundancy
– You can write the same regular expression in more than one way
– To search for strings grey and gray in a document, you can write the expression as gr[ae]y, or grey|gray,
or gr(a|e)y
∗ In the last case, parentheses are required as without those, the expression will match gra or ey which is not
the intension
• Regular expressions cannot be used for the newline character
sed
• Stream editor
Regular Expressions/sed/awk 6
• Derivative of ed
– Takes a sequence of editor commands
– Goes over the data line by line and performs the commands on each line
• Basic syntax
• The commands are applied from the list in order to each line and the edited form is written to stdout
• Changing a pattern in the file
• Removing the information from the output of the finger command to get only the user id and login time
• Another way to do it
sed ’/rexp/d’
• Automatic printing
– By default, sed prints each line on the stdout
– This can be inhibited by using the -n option as follows
sed -n ’/pattern/p’
– Matching conditions can be inverted by the !
sed -n ’/pattern/!p’
– The last achieves the same effect as grep -v
• Inserting newlines
– Converting a document from single space to double space
$ sed ’s/$/\
> /’
– Creating a list of words used in the document
$ sed ’s/[ ->][ ->]*/\
> /g’ file
– Counting the unique words used in the document
$ sed ’s/[ ->,.][ ->,.]*/\
> /g’ file | sort | uniq | wc -l
• Writing on multiple files
• Line numbering
– Line numbers can be used to select a range of lines over which the commands will operate
– Examples
$ sed -n ’20,30p’
$ sed ’1,10d’
$ sed ’1,/ˆ$/d’
$ sed -n ’/ˆ$/,/ˆend/p’
– sed does not support relative line numbers (difference with respect to ed)
awk
• Acronym for the last names of its designers – Aho, Weinberger, Kernighan
• Not as good as sed but includes arithmetic, variables, built-in functions, and a programming language like C; on the
other hand, it is a more general processing model than a text editor
• Looks more like a programming language rather than a text editor
Regular Expressions/sed/awk 8
• Mostly used for formatting reports, data entry, and data retrieval to generate reports
• awk is easier to use than sed but is slower
• Usage is
pattern { action }
pattern { action }
...
• awk reads one line in the file at a time, compares with each pattern, and performs the corresponding action if the pattern
matches
• Just like sed, awk does not alter its input files
• The patterns in awk can be regular expressions, or C-like conditions
• grep can be written in awk as
• Just like sed, the awk_script can be presented to awk from a file by using
• Fields
• Printing
– The current input line (or record) is tracked by the built-in variable NR
– The entire input record is contained in the variable $0
– To add line numbers to each line, you can use the following
awk ’{print NR, $0}’ filename
– Fields separated by comma are printed separated by the field separator – a blank space character by default
– Complete control of the output format can be achieved by using printf instead of print as follows
awk ’{ printf "%4d %s\n", NR, $0 }’ filename
– printf in awk is almost identical to the corresponding C function
• Patterns
– Checking for people who do not have a password entry in the file /etc/passwd
awk -F: ’$2 == ""’ /etc/passwd
– Checking for people who have a locked password entry
awk -F: ’$2 == "*"’ /etc/passwd
– Other ways to check for empty string
Regular Expressions/sed/awk 10
If the section number is not specified, the output will be for the user command from section 1
– The macros for man are discussed in section 7 of the manual and can be invoked by
man 7 man
– Usual device driver man pages are user-level descriptions and not internal descriptions
– A regular joke was “Anyone needing documentation to the kernel functions probably shouldn’t be using them.”
– /* you are not expected to understand this */ – from Unix V6 kernel source
– The manual page is laid out as per the specifications in the man macro of troff
∗ Any text argument may be zero to six words
∗ Quotes can be used to include the space character in a “word”
∗ Some native nroff conventions are followed, for example, if text for a command is empty, the command is
applied to the next line
A line starting with .I and with no other inputs italicizes the next line
∗ The prevailing indentation distance is remembered between successive paragraphs but not across sections
– The basic layout of a man page is described by
.TH COMMAND <section-number>
.SH NAME
command \- brief description of function
.B command
options
.SH DESCRIPTION
Detailed explanation of programs and options.
Paragraphs are introduced by .PP
.PP
This is a new paragraph.
.SH FILES
Files used by the command, e.g., passwd(1) mentions /etc/passwd
.SH "SEE ALSO"
References to related documents, including other manual pages
.SH DIAGNOSTICS
Description of any unusual output (e.g., see cmp(1))
.SH BUGS
Surprising features (not always bugs)
– If any section is empty, its header is omitted
– The .TH line and the NAME, SYNOPSIS, and DESCRIPTION sections are mandatory
– The .TH line
∗ Begins a reference page
∗ The full macro is described by
.TH command section date_last_changed left_page_footer center_header
∗ Sets prevailing indent and tabs to 0.5”
– The .SH lines
∗ Section headers
∗ Identify sections of the manual page
∗ NAME and SYNOPSIS sections are special; other sections contain ordinary prose
Regular Expressions/sed/awk 13
∗ NAME section
· Names the command (in lower case)
· Provides a one-line description of it
∗ SYNOPSIS section
· Names the options, but does not describe them
· The input is free form
· Font changes can be described with the .B, .I, and .R macros
· The name and options are bold while the rest of the information is in roman
∗ DESCRIPTION section
· Describes the commands and its options
· It tells the usage of the command
· The man page for cc(1) describes how to invoke the compiler, optimizer, where the output is, but does
not provide a reference page for the manual
· The reference page can be cited in the SEE ALSO section
· However, man(7) is the description of the language of manual macros
· Command names and tags for options are printed in italics, using the macros .I (print first argument in
italics) and .IR (print first argument in italic, second in roman)
∗ FILES section
· Mentions any files implicitly used by the commands
∗ DIAGNOSTICS section
· Optional section and generally not present
· Reports any unusual output produced by the command
· May contain diagnostic messages, exit statuses, or surprising variations of the command’s normal behavior
∗ BUGS section
· Could be called LIMITATIONS
· Reports shortcomings in the program that may need to be fixed in a future release
– Other requests and macros for man
.IP x Indented paragraph with a tag x
.LP Left-aligned paragraph
.PP Same as .LP
.SS Section subheading