Command Line Control: awk

Unless you spend a whole lot of time working primarily on the command line, awk is one of those commands you've likely come across a few times, but never really learned how to use properly. Similar to my recent adventures with xargs, though, I recently came across a little use case where I could benefit from using it, so I decided to sit down and look into it.

Even more so than with xargs, this is a really powerful tool, but let's focus on the basics and see what we can make of it.

What is `awk`?

awk is a command that processes text using programs written in the AWK language.

To understand what it can do, understanding the language it uses is probably a good place to start. Let's use the words of Alfred Aho---one of the creators of the AWK language---from a 2008 /Computerworld/ interview to get a rough grasp of what it is:

AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

That may be a bit dense, so let's take it apart:

"AWK is a language for processing [..] text": AWK is a domain-specific language (DSL) focused on text processing. The awk command expects that its first argument is a script or a string in this language.
"AWK reads the input a line at a time": AWK is line-oriented and works through the input line by line. (It's actually record-oriented, but the default separator is a newline character, so this is the same thing by default.)
"Each line is broken into a sequence of fields": Each word in a line maps to a field. These fields are accessed with the $ operator, e.g. $1 for the first word, $2 for the second, and so on. $0 is the whole line. By default, fields are delimited by whitespace (which is why I've called them words) but this can be customized.
"An AWK program is of a sequence of pattern-action statements": This means it's a sequence of predicates with actions. If a predicate evaluates to true, perform the specified action. If no predicate is specified, it will always evaluate to true, and if no action is specified, it will default to printing the whole line.

Huh? Yeah, this is still a bit confusing, but maybe some examples will make it clearer:

Patterns: This is a predicate to check each line against. It usually takes the form of a regex enclosed in forward slashes. /foo/, /b[ar]?/, /^baz/, /(fizz|buzz)$/ are all examples. Most of the regex skills you have will be applicable here (character sets, character classes, alternations, etc.).

You can also match specific fields against a regex. Only want to match lines where the second field contains 'cheese'? $2 ~ /cheese/

The pattern can also consist of functions and comparisons; so if you wish to act only on lines that aren't empty: length > 0

If no pattern is given, every line will match.

Actions: These are commands telling awk what to execute. They are enclosed in curly braces ({}). This is where you might instruct awk to print a certain field of the string---~print $3~, for instance---or increment a counter if you're counting words: word_count += NF (yeah, I'll get to what NF means in a bit).

If no action is given, awk will print the matching line.

Basic auth `awk`

That's a quick overview of how the language is structured. Before we start playing with it, let's explore some of the features.

Built-in variables

awk has a number built-in variables, and while I won't cover all of them, these are the ones that I've found the most useful:

~NR~: Gives you the line number of the current match. Could be used for adding line numbers to an input file: ~awk '{print NR, $0}'~. Or maybe looking for lines that contain 'ice cream' is more your speed: ~awk 'ice cream {print NR}'~.

~NF~: This is the number of fields in a line. Useful if you're looking for the last field, either for finding out how many fields are in a line or for seeing if it contains a pattern: awk '$NF ~ /out/ {print NR}

~FS~: The field separator value. This is what awk will use to split each line into fields. By default this is whitespace. If you have a file full of comma-separated values and want to split each line on commas instead of whitespace: BEGIN {FS=","}

~RS~: This is the line ('record') equivalent of the field separator. The default is \n. Say you want to print your PATH over multiple lines: ~echo $PATH | awk 'BEGIN {RS=":"} {print}'~

`BEGIN` and `END`

awk lets you supply commands to be executed at the start and end of a script by using BEGIN and END. While they may seem like it, they're not really special at all. Instead, think of them as being patterns that evaluate to true before anything is evaluated and after everything is evaluated, respectively.

BEGIN could be used to set the field separator or initialize a variable.

END is useful for printing out results accrued through the life of the program, such as a word count. If we bring back our word counting example: ~awk '{word_count += NF} END {print word_count}'~

Functions and conditionals

Like most languages, AWK has a number of built-in functions (such as the print and length functions we saw earlier) and also lets you define your own functions if you so please. This is probably overkill for most trivial operations but could come in handy in certain cases.

And AWK has conditionals too! While you can use the if else construct within actions, I'd like to highlight that you can do conditional statements based on the supplied pattern. I.e. awk '/foo/ {print "FOO"} {print "bar"}'~ will print 'FOO' for lines that match ~/foo/ and 'bar' for lines that don't.

Applications

Now that we've got some idea how it works, what can we use it for? Let's look at some sample applications for it:

Sorting lines in a file by length

Here's a fun little application: Let's take a file, count the number of characters in each line, and then sort it based on the number of characters:

awk '{print length, $0}' <file> | sort -n

And if you want to exclude empty lines, try this:

awk 'length > 0 {print length, $0}' <file> | sort -n

Friendly `PATH`

This is the same one that was listed above, and is useful if you're looking for certain entries that can easily get lost when the path is on one line:

# bash, zsh
echo $PATH | awk 'BEGIN {RS=":"} {print}'

# fish
echo $PATH | awk 'BEGIN {RS=" "} {print}'

Counting words

Another example that was used previously. Count the words in a file or any input stream:

awk '{word_count += NF} END {print word_count}'

Parsing git logs

Maybe your team has agreed that all commit messages should start with #<task_no> if it relates to a task and #-- if it doesn't. To find all commits that relate to a specific task---say #42---we could do this:

git log --oneline | awk '/#42/ {print $1}'

Or how about finding the ratio of commits that belong to a task versus those that don't?

git log --oneline --no-decorate | \
awk '$2 ~ /#[0-9]+/ {task += 1} {notask +=1} \
  END {printf "Tasks: %d\nNo-tasks: %d\nTask to no-task ratio: %f", \
  task, notask, task / notask }'

(Yeah, the printf function works pretty much the same as in C---so much so that I haven't looked it up yet!)

Killing processes

This is actually what triggered me to look into awk and what eventually led to this post being brought into existence.

There are a number of ways to kill applications from the command line, and for a while, my default fallback has been the classic ps aux | grep <app> combo.

While this lists out all the relevant processes and their process IDs (PIDs), you then have to manually copy the PID and put it into the kill command to shut it down. This is annoying at best, and gets even worse if the process has spawned children that we want to take down as well.

How do we deal with this? Well:

ps aux | awk '/<app>/ {print $2}' | xargs kill

ps lists all processes on the system

awk then goes over every line that contains the application name, extracts the second whitespace-delimited word---which in this case is the PID---and prints that.

For the last stretch, we use xargs to feed the PIDs into the kill command, thus killing all the processes.

This works just fine, though it comes with the caveat that it'll also try and kill the awk process (because the application name is part of the command, so it gets listed by ps), but that's only a minor annoyance, and I'll leave fixing that as an exercise for the reader.

Now, I'm sure you can make killall do something quite similar to this, but I've found this way to be more effective (by which I mean: closer to what I expect).

Closing up

Phew. We've learned a lot today, and this post grew much longer than what I had originally imagined---about four or five times longer---but it's been quite the journey and I'm glad you were there with me, partner. I hope you've gleaned some new insight too and that you'll find some application for this in your day-to-day.

See you next time!

What is awk?

Basic auth awk