Stream processing with sed and awk
After this chapter, the students can basic pattern search and replacement with sed
and are aware of its more advanced features. They can construct simple scripts using awk
, incorporating conditions and counter variables and the END section. Finally, they can design and build analysis pipelines that consist of separate steps for data formatting, merging and analysis.
sed
and awk
are the central tools for the manipulation of table-formatted data. I once got an email from a student thanking me for my evolutionary genomics course. She didn’t mention anything about the population genetics or sequence analysis (which are the topics of the course) but said that she had learned about awk
first time in my course, and now her skills with that had got her a job in data analysis. It took it as a compliment.
sed
, the stream editor
sed
is a surprisingly complex program. For a beginner, it’s reasonable to aim at doing simple pattern replacements, possibly with matched patterns. We’ll learn these first. However, it is useful to be aware of what else sed
could do. One day you may have a task that you recognise as suitable for sed
and can then go and study how to resolve it.
The name of the program comes from stream editor and one can see being used in pipes, possibly multiple times, like this:
However, sed
can read in files directly:
and even modify the input file:
In scientific projects, one should not edit the raw data but create modified copies of the data. If a file is edited in place, it’s a good practice to test the command with sed 'cmd' file | less
and, once certain about the outcome, edit the input file with sed -i 'cmd' file
.
sed
in pattern replacement
Earlier we worked with temperature measurements from four locations in Helsinki. The data files looked like this:
> cd ~/IntSciCom/Helsinki/
> head -6 Kaisaniemi.tsv
Helsinki_Kaisaniemi 2024 1 1 00:00 -13.9 -13.7 -14
Helsinki_Kaisaniemi 2024 1 1 01:00 -14.1 -14 -14.3
Helsinki_Kaisaniemi 2024 1 1 02:00 -14.4 -14.1 -14.6
Helsinki_Kaisaniemi 2024 1 1 03:00 -14.7 -14.5 -14.9
Helsinki_Kaisaniemi 2024 1 1 04:00 -14.6 -14.4 -14.8
Helsinki_Kaisaniemi 2024 1 1 05:00 -14.3 -14.1 -14.6
One annoying thing is that the location takes so much space. We could reduce that by replacing “Helsinki” with “HKI”:
> cat Kaisaniemi.tsv | sed 's/Helsinki/HKI/' | head -6
HKI_Kaisaniemi 2024 1 1 00:00 -13.9 -13.7 -14
HKI_Kaisaniemi 2024 1 1 01:00 -14.1 -14 -14.3
HKI_Kaisaniemi 2024 1 1 02:00 -14.4 -14.1 -14.6
HKI_Kaisaniemi 2024 1 1 03:00 -14.7 -14.5 -14.9
HKI_Kaisaniemi 2024 1 1 04:00 -14.6 -14.4 -14.8
HKI_Kaisaniemi 2024 1 1 05:00 -14.3 -14.1 -14.6
The command we used was sed 's/Helsinki/HKI/'
: here, the command is within single quotes and consists of s
(for substitute) and /<old pattern>/<new pattern>/
. As soon as one learns to read the forward slash signs as separators, the structure of the command is very clear. The replacement pattern can be be empty:
> cat Kaisaniemi.tsv | sed 's/Helsinki_//' | head -2
Kaisaniemi 2024 1 1 00:00 -13.9 -13.7 -14
Kaisaniemi 2024 1 1 01:00 -14.1 -14 -14.3
That kind of replacements one could do with text processing programs. The replacement command becomes more powerful if we utilise the patterns that are found in the data. We can keep the found pattern in memory by wrapping the search pattern in brackets: in /(Helsinki)_(Kaisaniemi)/
, the text “Helsinki” will be assigned to variable 1 and “Kaisaniemi” to variable 2; we can reuse them with \1
and \2
such that the command:
> cat Kaisaniemi.tsv | sed -r 's/(Helsinki)_(Kaisaniemi)/\2_\1/' | head -2
Kaisaniemi_Helsinki 2024 1 1 00:00 -13.9 -13.7 -14
Kaisaniemi_Helsinki 2024 1 1 01:00 -14.1 -14 -14.3
swaps the positions of “Helsinki” and “Kaisaniemi”, writing \2_\1
as the replacement pattern.
This could also be done with text processing programs. We can improve our search pattern by writing it using wildcards: [A-Z]
means one capital letter and [a-z]+
means multiple lower-case letters. (To do that, we have to use extended sed
with the argument -r
.) Now the same command can be used to edit the Kaisaniemi records:
> cat Kaisaniemi.tsv | sed -r 's/(Helsinki)_([A-Z][a-z]+)/\2_\1/' | head -2
Kaisaniemi_Helsinki 2024 1 1 00:00 -13.9 -13.7 -14
Kaisaniemi_Helsinki 2024 1 1 01:00 -14.1 -14 -14.3
as well as the Kumpula records:
> cat Kumpula.tsv | sed -r 's/(Helsinki)_([A-Z][a-z]+)/\2_\1/' | head -2
Kumpula_Helsinki 2024 1 1 00:00 -15 -14.8 -15.1
Kumpula_Helsinki 2024 1 1 01:00 -14.9 -14.8 -15
However, it still has to match”Helsinki” as the first pattern. We can make that more generic and simplify the command elsewhere:
> cat Kumpula.tsv | sed -r 's/([a-z]+)_([a-z]+)/\2_\1/I' | head -2
Kumpula_Helsinki 2024 1 1 00:00 -15 -14.8 -15.1
Kumpula_Helsinki 2024 1 1 01:00 -14.9 -14.8 -15
Here, the pattern is simply [a-z]+
and it matches both upper- and lowercase letters as we have added I
(for “case Insensitive”) at the end of the command. Another common argument is g
for global matching, i.e. making the pattern replacement for each case on each line.
The pattern replacement introduced above is the most common use of sed
and a beginner can be happy for mastering the basics of that, starting with a replacement of one word with another one. The regular expressions to match different patterns can be scary-looking but they are also extremely powerful. If one needs to find and edit specific patterns in large text files, it may be useful to learn the basics of regular expressions and try using sed
and other bash tools for the task. A relatively easy-to-read tutorial for sed
can be found at https://www.grymoire.com/Unix/Sed.html.
sed
vs. perl
Perl is a comprehensive programming language and was very popular in the early 2000s before going out of fashion with the arrival of Python and other more modern alternatives. Despite that, the language is installed by default on most Linux systems and can thus be easily used.
Among other things, Perl has functionality highly similar to that of sed
and its syntax for regular expression matching and replacement is both wider and easier to understand. Of the examples above, one could be replicated with Perl like this:
> cat Kumpula.tsv | perl -pe 's/(\w+)_(\w+)/$2_$1/' | head -2
Kumpula_Helsinki 2024 1 1 00:00 -15 -14.8 -15.1
Kumpula_Helsinki 2024 1 1 01:00 -14.9 -14.8 -15
Here, -pe
means “print every line” and “execute the following command”; \w
matches all characters [a-z, A-Z, 0-9] (Perl would also allow using [a-z]
and flag i
to ignore the case) and $1
and $2
are the pattern fields similar to \1
and \2
. I personally prefer perl
for all more complex regular expression cases and don’t even know how to do specific tasks with sed
.
If one finds sed
useful but getting limited in the more complex cases, it may be worth looking for a tutorial on Perl regex. Switching from sed
to perl
is trivial. The only challenge is not to get confused by the programming side of Perl. As shown above, it can be used as one-liners similar to sed
.
sed
in line editing
Many of the other functionalities of sed
can be performed with combinations of other commands, but some are unique to sed
. The next two subsections introduce some more advanced features of sed
which can be useful in specific tasks and which an ambitious data analyst is good to be aware of.
The power of sed
is that one can control the pattern replacement to happen in specific parts of the file or certain positions on the line. Here we look at the print and delete operations within sed
. The command sed 1,12d
deletes the lines 1-12, as shown by the hour column:
> sed 1,12d Kaisaniemi.tsv | head -6
Helsinki_Kaisaniemi 2024 1 1 12:00 -15.9 -15.8 -16.1
Helsinki_Kaisaniemi 2024 1 1 13:00 -15.7 -15.7 -15.8
Helsinki_Kaisaniemi 2024 1 1 14:00 -15.7 -15.7 -15.8
Helsinki_Kaisaniemi 2024 1 1 15:00 -15.9 -15.7 -16
Helsinki_Kaisaniemi 2024 1 1 16:00 -16.3 -16 -16.7
Helsinki_Kaisaniemi 2024 1 1 17:00 -16.9 -16.7 -17.1
We can similarly delete everything from a specific line till the end of the file. The end of the file is marked with character “$” which has a special meaning in bash (it indicates that the word is a variable). Because of that, the dollar sign has to be escaped with a backslash:
> sed 7,\$d Kumpula.tsv
Helsinki_Kumpula 2024 1 1 00:00 -15 -14.8 -15.1
Helsinki_Kumpula 2024 1 1 01:00 -14.9 -14.8 -15
Helsinki_Kumpula 2024 1 1 02:00 -15.4 -15 -15.6
Helsinki_Kumpula 2024 1 1 03:00 -15.8 -15.6 -16.1
Helsinki_Kumpula 2024 1 1 04:00 -16.1 -15.9 -16.4
Helsinki_Kumpula 2024 1 1 05:00 -15.8 -15.7 -16.1
Alternatively, one could write the range within single quotes and thus avoid interpreting the dollar sign in the bash context: sed '7,$d'
.
Instead of deleting (d
), we can print (p
) but then the default behaviour of printing every line has to be turned off. That is done with the argument -n
and thus the command:
> sed -n 25,30p Kaisaniemi.tsv
Helsinki_Kaisaniemi 2024 1 2 00:00 -16.5 -16.4 -16.5
Helsinki_Kaisaniemi 2024 1 2 01:00 -16.6 -16.4 -16.7
Helsinki_Kaisaniemi 2024 1 2 02:00 -16.7 -16.5 -16.8
Helsinki_Kaisaniemi 2024 1 2 03:00 -16.4 -16.3 -16.7
Helsinki_Kaisaniemi 2024 1 2 04:00 -15.9 -15.5 -16.3
Helsinki_Kaisaniemi 2024 1 2 05:00 -15.3 -15.1 -15.5
prints the observations for the beginning of the second day, or the lines 25-30.
Alternatively, we could print, starting from the 1st line, every 6th line:
> sed -n '1~6p' Kaisaniemi.tsv | head -8
Helsinki_Kaisaniemi 2024 1 1 00:00 -13.9 -13.7 -14
Helsinki_Kaisaniemi 2024 1 1 06:00 -14.2 -14.1 -14.3
Helsinki_Kaisaniemi 2024 1 1 12:00 -15.9 -15.8 -16.1
Helsinki_Kaisaniemi 2024 1 1 18:00 -17 -16.9 -17.2
Helsinki_Kaisaniemi 2024 1 2 00:00 -16.5 -16.4 -16.5
Helsinki_Kaisaniemi 2024 1 2 06:00 -15.1 -15 -15.2
Helsinki_Kaisaniemi 2024 1 2 12:00 -15.5 -15.2 -15.8
Helsinki_Kaisaniemi 2024 1 2 18:00 -17.7 -17 -18.3
sed
in line editing with patterns
Instead of line numbers, the same commands can be done with patterns. In another directory, we have a copy of Hamlet:
> cd ~/IntSciCom/Hamlet/
> less Shakespeare_Hamlet.txt
We can check that it contains the phrase:
> cat Shakespeare_Hamlet.txt | grep -C2 'To be, or not to be'
HAMLET.
To be, or not to be, that is the question:
Whether ’tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
grep
finds pattern matches, in this case for the pattern “To be, or not to be”, and prints the lines with a hit. Often it is useful to get some context for the hit: -A 2
prints the line with a hit and two lines After it, and -B 2
prints the line with a hit and two lines Before it; quite naturally, -C 2
then means two lines before plus two lines after. Above, one could also write grep -A 2 -B 2 'To be, or not to be'
. Other useful arguments are e.g. -v
to make the opposite (print lines that do not match), -i
to ignore upper/lowercase, -n
to print the line numbers of hits, -e
to include multiple patterns, and -f
to read patterns from a file. In bash scripts, it is often useful to know (silently) if a file contains a match or not. This is achieved with the argument -q
.
grep
command
I’m so used to the command that I had started to think “grep” as a proper English word, meaning something like “catch” (the word is actually “grip”). When studying the material for this course, I was surprised to learn that “grep” is actually not a proper word and comes from a sed
-like command g/re/p
. Here, g
means “global” matching (find all hits), re
is the abbreviation for “regular expression” and p
is the “print” command familiar from sed
. This command was so commonly used that it was packaged into an easier-to-use program, but there’s still a similar functionality in sed
:
> cat Shakespeare_Hamlet.txt | sed -n '/To be, or not to be/p'
To be, or not to be, that is the question:
In fact, pattern matching is so central in stream processing that the same task can be done in many different ways. These commands are equivalent:
> cat Shakespeare_Hamlet.txt | grep 'To be, or not to be'
> cat Shakespeare_Hamlet.txt | sed -n '/To be, or not to be/p'
> cat Shakespeare_Hamlet.txt | awk '/To be, or not to be/'
Above, we printed two lines of context for the hit, but we probably would like to read all of Hamlet’s words in that scene, i.e. all the way until the next break or empty line. For that, we use sed -nr
, meaning the extended version and not printing every line, and then give the start pattern /To be, or not to be/
and the end pattern /^\r$/
and p
for the command “print”:
> cat Shakespeare_Hamlet.txt | sed -nr '/To be, or not to be/,/^\r$/p' | less
Note that we had to use ^\r$
as the empty line as the text was written on Windows: here, ^
means the beginning of the line, \r
is carriage return (CR) and $
means the end of the line. If the text were in Unix format, an empty line would match /^$/
, i.e. nothing between the start and the end.
“To be” is the start of the famous sentence in Hamlet but how often do those words individually appear in the text? We can resolve that with bash commands, though we need quite many of them. In the box below, the numbers on the left are just line numbers and you can copy-paste the command from cat
all the way to head
in the terminal:
> cat Shakespeare_Hamlet.txt \
| sed 's/\r$//' \
| sed -n '/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/,$p' \
| sed '/^\*/,$d' \
| grep -v '*' \
| tr [A-Z] [a-z] \
| tr ' ' '\n' \
| grep -v ^$ \
| sort \
| uniq -c \
| sort -k1,1nr \
| grep -n . \
| head
1: 1107 the
2: 968 and
3: 719 to
4: 673 of
5: 545 a
6: 516 my
7: 514 i
8: 438 in
9: 429 you
10: 379 hamlet.
Unsurprisingly, we find that the words “the” and “and” are the most frequent in a piece of English text! But how did we do the analysis? Here’s the description of each line:
- print the text
- remove the Windows end-of-line (CR)
- delete everything before the title
- delete everything after line starting with “*” (about the Gutenberg project)
- remove the last line with asterisks
- convert everything to lowercase
- convert spaces to newlines, thus printing each word on a line of its own
- ignore empty lines (Unix)
- sort alphabetically
- count occurrences of each unique word
- sort numerically in reverse order by 1st column
- grep every line printing the line number
- show the first 10 lines
We could replace the last head
with grep -w -e to -e be -e or -e not
and thus capture the famous words specifically:
> cat Shakespeare_Hamlet.txt | sed 's/\r$//' | sed -n '/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/,$p' | sed '/^\*/,$d' | grep -v '*' | tr [A-Z] [a-z] | tr ' ' '\n' | grep -v ^$ | sort | uniq -c | sort -k1,1nr | grep -n . | grep -w -e to -e be -e or -e not
3: 719 to
15: 272 not
23: 204 be
38: 112 or
197: 19 not,
271: 13 not.
367: 9 be,
605: 5 be.
678: 5 not?
1188: 3 not;
1297: 3 to.
1301: 3 to’t
1302: 3 to’t.
2159: 2 to,
2587: 1 be.’
2588: 1 be:
2589: 1 be;
3391: 1 die—to
3592: 1 else,—be
4275: 1 highly—not
5076: 1 ‘not
5168: 1 one.—to
5200: 1 or,—not
6486: 1 think,—or
6546: 1 to:
6570: 1 to’t,
6571: 1 to’t;
Here, the number on the left is the position of the word in the ordered list of most common words, the number in the middle is the count of how many times it was found in the text, and the word on the right is the actual word. The results reveal that our code has problems and we should remove the punctuation before doing the word counting. Nevertheless, the word “or” seems to be the least frequent word of the famous sentence.
The example above is a bit artificial and the role of sed
is secondary. In real language analyses, one would probably clean the input data as the first step (for which sed
is used here) and then analyse the cleaned data with more advanced tools e.g. in the Python programming language. However, this semi-artificial example is a great example of the Unix philosophy and the use of pipes to perform complex combinations of operations on data streaming from left to right. The final combination looks like this:
Probably this could be much simplified and the task could certainly be performed by numerous different combinations of bash tools. There is rarely – if ever – a single correct way of doing the task on bash!
awk
for pattern scanning and processing
awk
is defined as a pattern scanning and processing language. It is a true programming language and one can write meaningful programs with awk
alone. However, its functionality is pretty limited and it is best suited for the processing of table-formatted data. For that, awk
is exceptionally powerful and, with its small number of features, relatively easy to learn.
awk
in column processing
We look at the use of awk
with the familiar temperature data but first convert it to Unix format by removing the CR characters:
> cd ~/IntSciCom/Helsinki/
> tr -d '\r' < Kaisaniemi.tsv > Kaisaniemi_LF.tsv
> head -2 Kaisaniemi_LF.tsv
Helsinki_Kaisaniemi 2024 1 1 00:00 -13.9 -13.7 -14
Helsinki_Kaisaniemi 2024 1 1 01:00 -14.1 -14 -14.3
With the simple measurements, there is not very much computation that we can do row-wise. As an example, we recompute the mean temperature as the average of the minimum (7th column) and maximum (8th column) temperature; in the data, that is already given in the 6th column.
By default, awk
expects the data columns to be separated by TAB characters, as in our data; awk
then reads the data one row at a time and assigns the value of column 1 to variable $1, column 2 to variable $2 and so on. With that, we print the columns 6-8 and average of columns 7 and 8:
> cat Kaisaniemi_LF.tsv | awk '{print $6,$7,$8,($7+$8)/2}' | head -5
-13.9 -13.7 -14 -13.85
-14.1 -14 -14.3 -14.15
-14.4 -14.1 -14.6 -14.35
-14.7 -14.5 -14.9 -14.7
-14.6 -14.4 -14.8 -14.6
The awk
command has the form:
awk 'pattern { action }'
or if there is no pattern to match, simply:
awk '{ action }'
An example of pattern matching would be to replicate the previous computation of the average temperature for 12 noon each day:
> cat Kaisaniemi_LF.tsv | awk '$5=="12:00"{print $2"-"$3"-"$4,($7+$8)/2}' | head -5
2024-1-1 -15.95
2024-1-2 -15.5
2024-1-3 -15.3
2024-1-4 -17.85
2024-1-5 -15.9
Above, we have replaced the comma separators with "-"
: this “glues” the fields together with the character within the double quotes (with ""
one can glue two columns together). The default output field separator is space but that can be redefined, e.g. as OFS='\t'
to make it TAB.
In addition to the action for each line, the awk
command may include a block that is performed before anything else and a block that is performed after everything else:
awk '
BEGIN { do first }
{ action }
END { do last }
'
As an example of the BEGIN
and END
blocks, we can add a header line, sum the average temperatures at noon (avg
) and then divide the sum (sum
) by the number of records (NR
) to get the average of the month:
> cat Kaisaniemi_LF.tsv | awk '
BEGIN{ OFS="\t"; print "Date\t\tAvg.temp" }
$5=="12:00"{ avg=($7+$8)/2; sum+=avg; print $2"-"$3"-"$4,avg }
END { print "average:",sum/NR }
' | head -5
Date Avg.temp
2024-1-1 -15.95
2024-1-2 -15.5
2024-1-3 -15.3
2024-1-4 -17.85
The output ends with:
2024-1-30 0.7
2024-1-31 1.55
average: -0.250538
Above, the spacing is just for better readability and everything could be written on one line:
> cat Kaisaniemi_LF.tsv | awk 'BEGIN{OFS="\t";print "Date\t\tAvg.temp"}$5=="12:00"{ avg=($7+$8)/2; sum+=avg;print $2"-"$3"-"$4,avg}END{print "average:",sum/NR}' | less
awk
with condition statements
In awk
, the column values are stored in $1, $2 etc. One can define and use also other variables and, somewhat confusingly, these have no symbol such $
in bash, or \
in sed to indicate what they are. (In fact, this is the “normal” practice and the languages indicating variables with a specific character are a minority.) They are nevertheless variables and can be used as such. Earlier we used variables to store the average temperature and the sum of averages. Below, we use sum
for the sum of average hourly temperatures; n_obs
for the count of observations; and prev_day
for the day of the month of the previous measurements:
> cat Kaisaniemi_LF.tsv | \
awk '{
if( $4!=prev_day && n_obs>0 ) {
print $2"-"$3"-"$4,sum/n_obs;
sum=0; n_obs=0;
}
sum+=$6; n_obs+=1; prev_day=$4;
}' | head -15
2024-1-2 -15.6417
2024-1-3 -16.4083
2024-1-4 -16.5583
2024-1-5 -17.1583
2024-1-6 -15.975
2024-1-7 -14.8833
2024-1-8 -19.325
2024-1-9 -7.13333
2024-1-10 0.0666667
2024-1-11 1.325
2024-1-12 -3.24583
2024-1-13 -7.79167
2024-1-14 -9.02083
2024-1-15 -5.84583
2024-1-16 -8.46667
We have an if-condition on lines 3-6: if the condition on line 3 is true, we perform the action within the curly brackets on lines 4-5. Line 4 prints the output (seen below) and line 5 sets the sum and the counter to null. Note that the condition on line 3 has two parts: $4!=prev_day
tests if the day in field 4 is different from the day on the previous line (if it is, the day has changed and we need to calculate and print the average of the previous day); however, we also need the test n_obs>0
for not to do the calculation on the very first line (where the first test would be true). In condition statements, &&
means “AND” and ||
means “OR”. Line 7 is performed for every row of the data file: sum+=$6
sums the temperature measurements, n_obs+=1
counts the observations (we could have missing data and not exactly 24 observations per day) and prev_day=$4
keeps the information of the day (and allows to detect the change of the day). Note that these three commands have to be separated by a semicolon; if they all were written on a line of their own, no semicolons would be needed.
We had earlier a brief look at the command printf
. One could replace the fourth line like this:
> cat Kaisaniemi_LF.tsv | \
awk '{
if( $4!=prev_day && n_obs>0 ) {
printf $2"-"$3"-"$4" %.1f\n",sum/n_obs;
sum=0; n_obs=0;
}
sum+=$6; n_obs+=1; prev_day=$4;
}' | head -5
Here, the formatting "%.1f\n"
tells to print the result of sum/n_obs
as a floating point number with one decimal place. As printf
doesn’t automatically print a newline, we have to add \n
in the command. The output is now:
2024-1-2 -15.6
2024-1-3 -16.4
2024-1-4 -16.6
2024-1-5 -17.2
2024-1-6 -16.0
awk
vs. R
Those familiar with R may think that why bother learning another language as the same analyses can be done with R. That is true and even for an experienced user there are cases where one starts with awk
and then realises that the analysis is getting so advanced that it would be better done with R. However, there are cases where `awk´ is far superior and it is good to be aware of that alternative.
The main difference is that awk
is a stream processing language and takes the input one line at a time. Above, we have used some variables to store the information in the previous line as the language itself doesn’t remember it. Because of that, awk
can easily handle data files consisting of millions of lines. R is different and normally reads the full data into the memory and then manipulates the full data table. Reading millions of lines of data with R takes a long time and lots of computer memory.
If one needs the full data for the analysis, then R is the only choice. However, often one doesn’t need the full data and it may be helpful to preprocess the data with awk
and then import the reduced data set into R. For example, if only the temperature at 12 noon would be used in the analysis, one could reduce the data with the command:
> cat Kaisaniemi_LF.tsv | awk '$5=="12:00"' > Kaisaniemi_12noon.tsv
Similarly, one could drop columns that are not needed in the analysis before importing the data to R.
Hamlet’s word count with awk
awk
can be used for the analysis of non-table-formatted data. Then, the variables storing the column data are less useful and awk
becomes more like any other programming language. We can implement much of the word counting of Hamlet within one awk
program. However, here we print the program code into a file called count_words_in_hamlet.awk
:
> cat > count_words_in_hamlet.awk << 'EOF'
{
if(/THE TRAGEDY OF HAMLET, PRINCE OF DENMARK/) { use = 1}
if($1~/^*/) { use = 0}
if(use == 1) {
sub("\r", "", $NF)
for(i = 1; i <= NF; i+=1) {
counts[tolower($i)] += 1
}
}
}
END {
for (word in counts) {
if(word != "") {
print counts[word],word
}
}
}
EOF
We can then use the argument awk -f
to read in the program from the file. The data can be read from the stream (given by cat
) or from the data file directly.
> cat Shakespeare_Hamlet.txt | awk -f count_words_in_hamlet.awk | sort -k1,1nr | head
> awk -f count_words_in_hamlet.awk Shakespeare_Hamlet.txt | sort -k1,1nr | head
1107 the
968 and
719 to
673 of
545 a
516 my
514 i
438 in
429 you
379 hamlet.
In the program code, the lines 3 and 4 catch the begin and end of the text to be processed and set the variable use
either 1 or 0, meaning TRUE or FALSE. That variable is evaluated on line 5 and if TRUE, the code on lines 6-9 is executed. In awk
, variable NF
is the number of fields on a line. We first substitute \r
with ""
(i.e. nothing) in the last word (line 6), and then go through the words (or fields) in a line from 1 to NF (line 7). We convert the word to lowercase and add to its counter 1 (line 8). In the END block, executed after finishing with the file, we go through all the words in the array counts
(line 13): if the word is not empty (line 14), we print the count value and the actual word. We could have written the functionality of sort
andhead
in the awk
code but that would have made it unnecessarily difficult.
Those with sharp eyes noticed we used the Heredoc functionality to write the code directly into a file. However, the command cat > count_words_in_hamlet.awk << 'EOF'
has single quotes around EOF
. The quotes prevent bash from expanding the variables when paste the text in the terminal. If that were not prevented, bash wouldn’t write $1
but would search for the variable “$1” and use its value (which is probably ““, i.e. nothing).
The official documentations for the GNU awk
and sed
programs are available at the GNU website:
Many tutorials are available in the net. One easy-to-read tutorial for awk
can be found at https://www.grymoire.com/Unix/Awk.html.