Structure of AWK programsAn AWK program is a series of pattern action pairs, written as: pattern { action }
where pattern is typically an expression and action is a series of commands. Each line of input is tested against all the patterns in turn and the action executed if the expression is true. Either the pattern or the action may be omitted. The pattern defaults to matching every line of input. The default action is to print the line of input. In addition to a simple AWK expression, the pattern can be BEGIN or END causing the action to be executed before or after all lines of input have been read, or pattern1, pattern2 which matches the range of lines of input starting with a line that matches pattern1 up to and including the line that matches pattern2 before again trying to match against pattern1 on future lines. In addition to normal arithmetic and logical operators, AWK expressions include the tilde operator, ~, which matches a regular expression against a string. As a handy default, /regexp/ without using the tilde operator matches against the current line of input. AWK commandsAWK commands are the statement that is substituted for action in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions. For brevity, the enclosing curly braces ( { } ) will be omitted from these examples. [edit] The print commandThe print command is used to output text. The output text is always terminated with a predefined string called the output record separator (ORS) whose default value is a newline. The simplest form of this command is: This displays the contents of the current line. In AWK, lines are broken down into fields, and these can be displayed separately:
Although these fields ($X) may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, $0, refers to the entire line. In fact, the commands "print" and "print $0" are identical in functionality. The print command can also display the results of calculations and/or function calls: print 3+2 print foobar(3) print foobar(variable) print sin(3-2) Output may be sent to a file: print "expression" > "file name" or through a pipe: print "expression" | "command" [edit] Variables and SyntaxVariable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators + - * / represent addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other. It is optional to use a space in between if string constants are involved. But you can't place two variable names adjacent to each other without having a space in between. String constants are delimited by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using # as the first character on a line. [edit] User-defined functionsIn a format similar to C, function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example of a function. function add_three (number, temp) {
temp = number + 3
return temp
}
This statement can be invoked as follows: print add_three(36) # Outputs 39 Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin. Sample applications[edit] Hello WorldHere is the ubiquitous "Hello world program" program written in AWK: BEGIN { print "Hello, world!" }
Note that you do not need an explicit exit statement, as if the only pattern is BEGIN, no command-line arguments are processed. [edit] Print lines longer than 80 charactersPrint all lines longer than 80 characters. Note that the default action is to print the current line. length > 80 The AWK Programming Language now specifies an explicit $0 in the length function: length($0) > 80 [edit] Print a count of wordsCount words in the input, and print lines, words, and characters (like wc) {
w += NF
c += length + 1
}
END { print NR, w, c }
As there is no pattern for the first line of the program, every line of input matches by default so the increment actions are executed for every line. Note that w += NF is shorthand for w = w + NF. [edit] Sum last word{ s += $NF }
END { print s + 0 }
s is incremented by the numeric value of $NF which is the last word on the line as defined by AWK's field separator, by default white-space. NF is the number of fields in the current line, e.g. 4. Since $4 is the value of the fourth field, $NF is the value of the last field in the line regardless of how many fields this line has, or whether it has more or fewer fields than surrounding lines. $ is actually a unary operator with the highest operator precedence. (If the line has no fields then NF is 0, $0 is the whole line, which in this case is empty apart from possible white-space, and so has the numeric value 0.) At the end of the input the END pattern matches so s is printed. However, since there may have been no lines of input at all, in which case no value has ever been assigned to s, it will by default be an empty string. Adding zero to a variable is an AWK idiom for coercing it from a string to a numeric value. (Concatenating an empty string is to coerce from a number to a string, e.g. s "". Note, there's no operator to concatenate strings, they're just placed adjacently.) With the coercion the program prints 0 on an empty input, without it an empty line is printed. [edit] Match a range of input lines$ yes Wikipedia | awk 'NR % 4 == 1, NR % 4 == 3 { printf "%6d %s\n", NR, $0 }' | sed 7q
1 Wikipedia
2 Wikipedia
3 Wikipedia
5 Wikipedia
6 Wikipedia
7 Wikipedia
9 Wikipedia
$
The yes command repeatedly prints the letter "y" on a line. In this case, we tell the command to print the word "Wikipedia". The action statement prints each line numbered. The printf function emulates the standard C printf, and works similarly to the print command described above. The pattern to match, however, works as follows: NR is the number of records, typically lines of input, AWK has so far read, i.e. the current line number, starting at 1 for the first line of input. % is the modulo operator. NR % 4 == 1 is true for the first, fifth, ninth, etc., lines of input. Likewise, NR % 4 == 3 is true for the third, seventh, eleventh, etc., lines of input. The range pattern is false until the first part matches, on line 1, and then remains true up to and including when the second part matches, on line 3. It then stays false until the first part matches again on line 5. The sed command is used to print the first 7 lines, to prevent yes running forever. It is equivalent to head -7 if the head command is available. The first part of a range pattern being constantly true, e.g. 1, can be used to start the range at the beginning of input. Similarly, if the second part is constantly false, e.g. 0, the range continues until the end of input: /^--cut here--$/, 0 prints lines of input from the first line matching the regular expression ^--cut here--$, that is, a line containing only the phrase "---cut here---", to the end. [edit] Calculate word frequenciesWord frequency, uses associative arrays: BEGIN { FS="[^a-zA-Z]+" }
{ for (i=1; i<=NF; i++)
words[tolower($i)]++
}
END { for (i in words)
print i, words[i]
}
The BEGIN block sets the field separator to any sequence of non-alphabetic characters. Note that separators can be regular expressions. After that, we get to a bare action, which performs the action on every input line. In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears. Finally, in the END block, we print the words with their frequencies. The line for (i in words) creates a loop that goes through the array words, setting i to each subscript of the array. This is different from most languages, where such a loop goes through each value in the array. This means that you print the word with each count in a simple way. tolower was an addition to the One True awk (see below) made after the book was published. [edit] Match pattern from command lineThis program can be represented in several ways. The first one uses the Bourne shell to make a shell script that does everything. It is the shortest of these methods: $ cat grepinawk
pattern=$1
shift
awk '/'$pattern'/ { print FILENAME ":" $0 }' $*
$
The $pattern in the awk command is not protected by quotes. A pattern by itself in the usual way checks to see if the whole line ($0) matches. FILENAME contains the current filename. awk has no explicit concatenation operator; two adjacent strings concatenate them. $0 expands to the original unchanged input line. There are alternate ways of writing this. This shell script accesses the environment directly from within awk: $ cat grepinawk
pattern=$1
shift
awk '$0 ~ ENVIRON["pattern"] { print FILENAME ":" $0 }' $*
$
This is a shell script that uses ENVIRON, an array introduced in a newer version of the One True awk after the book was published. The subscript of ENVIRON is the name of an environment variable; its result is the variable's value. This is like the getenv function in various standard libraries and POSIX. The shell script makes an environment variable pattern containing the first argument, then drops that argument and has awk look for the pattern in each file. ~ checks to see if its left operand matches its right operand; !~ is its inverse. Note that a regular expression is just a string and can be stored in variables. The next way uses command-line variable assignment, in which an argument to awk can be seen as an assignment to a variable: $ cat grepinawk
pattern=$1
shift
awk '$0 ~ pattern { print FILENAME ":" $0 }' "pattern=$pattern" $*
$
Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script (as the variable assignment on command line one does), but is a bit lengthy: BEGIN {
pattern = ARGV[1]
for (i = 1; i < ARGC; i++) # remove first argument
ARGV[i] = ARGV[i + 1]
ARGC--
if (ARGC == 1) { # the pattern was the only thing, so force read from standard input (used by book)
ARGC = 2
ARGV[1] = "-"
}
}
$0 ~ pattern { print FILENAME ":" $0 }
The BEGIN is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN block ends. ARGC, the number of arguments, is always guaranteed to be ≥1, as ARGV[0] is the name of the command that executed the script, most often the string "awk". Also note that ARGV[ARGC] is the empty string, "". # initiates a comment that expands to the end of the line. Note the if block. awk only checks to see if it should read from standard input before it runs the command. This means that awk 'prog' only works because the fact that there are no filenames is only checked before prog is run! If you explicitly set ARGC to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files. Therefore, you need to explicitly say to read from standard input with the special filename -. Self-contained AWK scriptsAs with many other programming languages, self-contained AWK script can be constructed using the so-called "shebang" syntax. For example, a UNIX command called hello.awk that prints the string "Hello, world!" may be built by creating a file named hello.awk containing the following lines: #!/usr/bin/awk -f
BEGIN { print "Hello, world!" }
The -f tells awk that the argument that follows is the file to read the awk program from, which is placed there by the shell when running. Variables and Special VariablesVariables can be used in an awk program by referencing them. With the exception of function parameters (see User-Defined Functions), they are not explicitly declared. Function parameter names shall be local to the function; all other variable names shall be global. The same name shall not be used as both a function parameter name and as the name of a function or a special awk variable. The same name shall not be used both as a variable name with global scope and as the name of a function. The same name shall not be used within the same scope both as a scalar variable and as an array. Uninitialized variables, including scalar variables, array elements, and field variables, shall have an uninitialized value. An uninitialized value shall have both a numeric value of zero and a string value of the empty string. Evaluation of variables with an uninitialized value, to either string or numeric, shall be determined by the context in which they are used. Field variables shall be designated by a '$' followed by a number or numerical expression. The effect of the field number expression evaluating to anything other than a non-negative integer is unspecified; uninitialized variables or string values need not be converted to numeric values in this context. New field variables can be created by assigning a value to them. References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters. If appropriate, the field variable shall be considered a numeric string (see Expressions in awk). Implementations shall support the following other special variables that are set by awk:
Regular ExpressionsThe awk utility shall make use of the extended regular expression notation (see the Base Definitions volume of
IEEE Std 1003.1-2001, Section 9.4, Extended Regular Expressions)
except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the
table in the Base Definitions volume of IEEE Std 1003.1-2001, Chapter 5, File
Format Notation ( '\\', '\a', '\b', '\f', '\n', '\r', '\t'
, '\v' ) and the following table; these escape sequences shall be recognized both inside and outside bracket expressions.
Note that records need not be separated by <newline>s and string constants can contain <newline>s, so even the
"\n" sequence is valid in awk EREs. Using a slash character within an ERE requires the escaping shown in the
following table.
A regular expression can be matched against a specific field or string by using one of the two regular expression matching operators, '˜' and "!˜". These operators shall interpret their right-hand operand as a regular expression and their left-hand operand as a string. If the regular expression matches the string, the '˜' expression shall evaluate to a value of 1, and the "!˜" expression shall evaluate to a value of 0. (The regular expression matching operation is as defined by the term matched in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.1, Regular Expression Definitions, where a match occurs on any part of the string unless the regular expression is limited with the circumflex or dollar sign special characters.) If the regular expression does not match the string, the '˜' expression shall evaluate to a value of 0, and the "!˜" expression shall evaluate to a value of 1. If the right-hand operand is any expression other than the lexical token ERE, the string value of the expression shall be interpreted as an extended regular expression, including the escape conventions described above. Note that these same escape conventions shall also be applied in determining the value of a string literal (the lexical token STRING), and thus shall be applied a second time when a string literal is used in this context. When an ERE token appears as an expression in any context other than as the right-hand of the '˜' or "!˜" operator or as one of the built-in function arguments described below, the value of the resulting expression shall be the equivalent of: $0 ˜ /ere/ The ere argument to the gsub, match, sub functions, and the fs argument to the split function (see String Functions) shall be interpreted as extended regular expressions. These can be either ERE tokens or arbitrary expressions, and shall be interpreted in the same manner as the right-hand side of the '˜' or "!˜" operator. An extended regular expression can be used to separate fields by using the -F ERE option or by assigning a string containing the expression to the built-in variable FS. The default value of the FS variable shall be a single <space>. The following describes FS behavior:
Except for the '˜' and "!˜" operators, and in the gsub, match, split, and sub built-in functions, ERE matching shall be based on input records; that is, record separator characters (the first character of the value of the variable RS, <newline> by default) cannot be embedded in the expression, and no expression shall match the record separator character. If the record separator is not <newline>, <newline>s embedded in the expression can be matched. For the '˜' and "!˜" operators, and in those four built-in functions, ERE matching shall be based on text strings; that is, any character (including <newline> and the record separator) can be embedded in the pattern, and an appropriate pattern shall match any character. However, in all awk ERE matching, the use of one or more NUL characters in the pattern, input record, or text string produces undefined results. PatternsA pattern is any valid expression, a range specified by two expressions separated by a comma, or one of the two special patterns BEGIN or END. Special PatternsThe awk utility shall recognize two special patterns, BEGIN and END. Each BEGIN pattern shall be matched once and its associated action executed before the first record of input is read (except possibly by use of the getline function-see Input/Output and General Functions - in a prior BEGIN action) and before command line assignment is done. Each END pattern shall be matched once and its associated action executed after the last record of input has been read. These two patterns shall have associated actions. BEGIN and END shall not combine with other patterns. Multiple BEGIN and END patterns shall be allowed. The actions associated with the BEGIN patterns shall be executed in the order specified in the program, as are the END actions. An END pattern can precede a BEGIN pattern in a program. If an awk program consists of only actions with the pattern BEGIN, and the BEGIN action contains no getline function, awk shall exit without reading its input when the last statement in the last BEGIN action is executed. If an awk program consists of only actions with the pattern END or only actions with the patterns BEGIN and END, the input shall be read before the statements in the END actions are executed. Expression PatternsAn expression pattern shall be evaluated as if it were an expression in a Boolean context. If the result is true, the pattern shall be considered to match, and the associated action (if any) shall be executed. If the result is false, the action shall not be executed. Pattern RangesA pattern range consists of two expressions separated by a comma; in this case, the action shall be performed for all records between a match of the first expression and the following match of the second expression, inclusive. At this point, the pattern range can be repeated starting at input records subsequent to the end of the matched range. ActionsAn action is a sequence of statements as shown in the grammar in Grammar. Any single statement can be replaced by a statement list enclosed in braces. The application shall ensure that statements in a statement list are separated by <newline>s or semicolons. Statements in a statement list shall be executed sequentially in the order that they appear. The expression acting as the conditional in an if statement shall be evaluated and if it is non-zero or non-null, the following statement shall be executed; otherwise, if else is present, the statement following the else shall be executed. The if, while, do... while, for, break, and continue statements are based on the ISO C standard (see Concepts Derived from the ISO C Standard), except that the Boolean expressions shall be treated as described in Expressions in awk , and except in the case of: for (variable in array) which shall iterate, assigning each index of array to variable in an unspecified order. The results of adding new elements to array within such a for loop are undefined. If a break or continue statement occurs outside of a loop, the behavior is undefined. The delete statement shall remove an individual array element. Thus, the following code deletes an entire array: for (index in array)
delete array[index]
The next statement shall cause all further processing of the current input record to be abandoned. The behavior is undefined if a next statement appears or is invoked in a BEGIN or END action. The exit statement shall invoke all END actions in the order in which they occur in the program source and then terminate the program without reading further input. An exit statement inside an END action shall terminate the program without further execution of END actions. If an expression is specified in an exit statement, its numeric value shall be the exit status of awk, unless subsequent errors are encountered or a subsequent exit statement with an expression is executed. String FunctionsThe string functions in the following list shall be supported. Although the grammar (see Grammar ) permits built-in functions to appear with no arguments or parentheses, unless the argument or parentheses are indicated as optional in the following list (by displaying them within the "[]" brackets), such use is undefined.
All of the preceding functions that take ERE as a parameter expect a pattern or a string valued expression that is a regular expression as defined in Regular Expressions. Arithmetic FunctionsThe arithmetic functions, except for int, shall be based on the ISO C standard (see Concepts Derived from the ISO C Standard). The behavior is undefined in cases where the ISO C standard specifies that an error be returned or that the behavior is undefined. Although the grammar (see Grammar) permits built-in functions to appear with no arguments or parentheses, unless the argument or parentheses are indicated as optional in the following list (by displaying them within the "[]" brackets), such use is undefined.
Input/Output and General FunctionsThe input/output and general functions are:
|