AWK language course aims to explain AWK in 15 minutes to let you find awesome tool friend despite it's given name. The correct pronunciation is [auk] after smaller seabirds Parakeet auklets.
AWK language (is):
- (mainly) text processing language
- available on most UNIX-like systems by default, on Windows there is either native binary or cygwin one
- syntax is influenced by
c
andshell
programming languages - programs from single line to multiple library files
- several implementations available, notably
gawk
andmawk
- solves generaly same problems as similar text-processing tools
sed
,grep
,wc
,tr
,cut
,printf
,tail
,head
,cat
,tac
,bc
,column
, ...
AWK language use-cases are:
- computing int / floating point math formulas (based on input)
- general text-processing
- cutting pieces from input text stream
- reformatting input text stream
- (shell) meta-programming generator
AWK language capabilities:
- text-processing functions
- regular expression support
- math functions
- dynamic typing, support for
- integer / long
- floats
- associative arrays (including multi-dimensional array support)
- external execution support
Every AWK execution consist of folowing three phases:
- [1]
BEGIN{ ... }
are actions performed at the beginning before first text character is read- multiple blocks allowed (normally single)
- [2]
[condition]{ ... }
are actions performed on everyAWK record
(default text line)- every
AWK record
is automatically split intoAWK fields
(by default words) - multiple blocks allowed
- every
- [3]
END{ ... }
are actions performed at the end of the execution after last text character is read- multiple blocks allowed (normally single)
$ echo -e "AWK is still useful\ntext-processing technology!" | \
> awk 'BEGIN{wcnt=0;print "lineno/#words/3rd-word:individual words\n"}
> {printf("% 6d/% 6d/% 8s:%s\n",NR,NF,$3,$0);wcnt+=NF}
> END{print "\nSummary:", NR, "lines/records,", wcnt, "words/fields"}'
lineno/#words/3rd-word:individual words
1/ 4/ still:AWK is still useful
2/ 2/ :text-processing technology!
Summary:2 lines/records, 6 words/fields
-
Passing text data to AWK:
- from pipe:
cat input-data.txt | awk <app>
- from file[s] read by awk itself:
awk <app> input-data.txt
- from pipe:
-
AWK application execution styles (
-f
):- on command-line
awk '{ ... }' input-data.txt
- in separate files
awk -f myapp.awk input-data.txt
- on command-line
-
specifying an AWK variable on command-line
-v var=val
-
specifying
AWK field
separatorFS
variable or-F <FS>
switch
Global variables are documented here, most common ones are:
$0
value of currentAWK record
(whole line without line-break)$1
,$2
, ...$NF
values of first, second, ... lastAWK field
(word)
FS
Specifies the inputAWK field
separator, i.e. how AWK breaks input record into fields (default: a whitespace).RS
Specifies the inputAWK record
separator, i.e. how AWK breaks input stream into records (default: an universal line break).OFS
Specifies the output separator, i.e. how AWK print parsed fields to the output stream usingprint()
(default: single space).ORS
Specifies the output separator, i.e. how AWK print parsed records to the output stream usingprint()
(default: line break)FILENAME
contains the name of the input file read by awk (read only global variable)
AWK functions are documented, the most important ones are:
print
,printf()
andsprintf()
- printing functions
length()
- length of an string argument
substr()
- splitting string to a substring
split()
- split string into an array of strings
index()
- find position of an substring in a string
sub()
andgsub()
- (regexp) search and replace (once respectivelly globally)
~
operator andmatch()
- regexp search
tolower()
andtoupper()
- convert text to lowercase resp. uppercase
- Hello world
- Word count using wc and awk
- Pattern search using grep and awk
- Uniq words in awk
- Computing the average
- Text stream FSM machine
- Manipulation with text columns
- Shell metaprogramming with awk
- Why is cut very limited to awk
- Memory hungry application
- CPU intensive application
- Debugging / profiling AWK application
- GNU AWK network programing
- 30 seconds of AWK code
Prefer general awk
before an specific AWK implementation:
- use general
awk
for portable programs - otherwise use the particular implementation e.g.
gawk
General rule of thumb is to create AWK program as a *.awk
file if equivalent one-liner is not well readable.
If you have troubles to understand one line awk program then feel free to use GNU AWK's profiling functionality i.e. -p
option to receive pretty printed AWK code (in awkprof.out
).
- comment properly
- indent similarly as in c/c++ programmimng languages
- use functions whenever possible
- stay explicit avoiding awk default (implicit) actions which make AWK application hard to understand
- example:
length > 80
should be rather written'length($0) > 80 { print }'
or'length($0) > 80 { print $0 }'
- example:
- don't forget to always use apostrophe
'
quotation when writing awk oneline applications to avoid shell expansion (for instance$1
)awk "{print $1}"
should beawk '{print $1}'
- use one of the recommended implementations as old implementations are quite limited (old
awk
ornawk
) - string / array indexing from
1
(index()
,split()
,$i
, ...) - GNU AWK implementation understand localization & utf-8/unicode and thus replacing with
[g]sub()
can lead to unwanted behavior unless you force gawk to drop such support via exporting environment variableLC_ALL=C
- other awk implementations may not support utf-8/unicode:
# awk implementation versions
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.1)
mawk 1.3.4 20161107
BusyBox v1.22.1 (2016-02-03 18:22:11 UTC) multi-call binary.
$ echo "Zřetelně" | gawk '{print toupper($0)}'
ZŘETELNĚ
$ echo "Zřetelně" | mawk '{print toupper($0)}'
ZřETELNě
$ echo "Zřetelně" | busybox awk '{print toupper($0)}'
ZřETELNě
- extended reqular expressions are available just for gawk (and for older version has to be explicitly enabled):
$ ps auxwww | gawk '{if($2~/^[0-9]{1,1}$/){print}}'
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]
$ ps auxwww | gawk --re-interval '{if($2~/^[0-9]{1,1}$/){print}}'
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]
$ ps auxwww | mawk '{if($2~/^[0-9]{1,1}$/){print}}'
$