Revisiting loops and conditions

Learning outcome

After this chapter, the student can construct loops that iterate a specific task multiple times, possibly for different files or sets of data. They can build conditions that change the behaviour of the command depending on specific rules.

The point of automation is to easily replicate something several times. This replication can be over a list of things (e.g. multiple data files) or for a certain number of times (e.g. 50 replicates of a heuristic search). We’ve seen these used previously but revisit the concept more thoroughly here.

For-loop

The for-loop is familiar to everyone who has done any computer programming and a variant of that is available also on bash. The basic structure of the for-loop is:

for item in list; do
  command [ $item ]
done

This is nearly human-readable: “For each item in a list, do the command, possibly providing the item as the argument; once ready, say ‘done’”.

We’ll first have a look at the “list”. That can literally be a list of words or numbers:

> for num in 1 2 3 4 5; do
  echo value: $num
done 
value: 1
value: 2
value: 3
value: 4
value: 5

However, writing a list of numbers is exactly the type of a task that should be given to the computer and there’s a bash command for that, seq. We can specify the start, end and interval, but at the simplest, we can just write

> seq 5
1
2
3
4
5

We can then use command substitution and write a command inside the command to generate the list. The format for that is $(command) such as:

> for num in $(seq -w 5 5 25); do
  echo value: $num
done 
value: 05
value: 10
value: 15
value: 20
value: 25

For the command seq, the argument -w is useful if the generated numbers have to be later sorted. See man seq for details.

By default, many bash commands (e.g. ls, sort) use lexicographic order when listing the files. Lexicographically, ‘1’ is before ‘5’ and thus ‘10’ is sorted before ‘5’; ‘0’ comes before ‘1’ and writing ‘05’ sorts ‘five’ before ‘ten’.

Moreover, it is often easier and tidier if the numbers are equally long and the file names with numbers align nicely. Most commands producing lists of numbers or patterns have an option to make the running pattern of fixed length and to sort correctly in lexicographic order. seq -w is one of these. Similarly, many commands manipulating lists of files have an option to specify an alternative sorting rule; however, the lexicographic order is the default and the easiest to use.

We can similarly generate lists of files with the ls command:

> cd ~/IntSciCom/Helsinki/
> ls H*_*_*.csv
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv  Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv     Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv

and use command substitution for other things:

> for file in $(ls H*_*_*.csv); do
  echo "$(echo $file | cut -d_ -f2) has $(cat $file | wc -l) observations"
done
Kaisaniemi has 745 observations
Kumpula has 745 observations
Malmi has 745 observations
Vuosaari has 745 observations

If you do not understand the function of the loop above, try to break it into pieces. First, test the plain loop:

> for file in $(ls H*_*_*.csv); do
  echo $file
done
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv

and then the internal commands:

> file=Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
> echo $file | cut -d_ -f2
Kaisaniemi

and

> cat $file | wc -l
745


Exercise: For-loops

For-loop is the simplest loop type and exists in most programming languages, including ‘bash’. The loop itself just repeats a set of commands for something and the key is to define the something (a list of things) and the set of commands.

Exercise 1 in Moodle.


While-loop

An alternative for the for-loop is the while-loop. In principle, the while-loop iterates as long as a condition is true:

while [ condition ]; do
  command
done

This allows replicating the previous functionality:

> num=5
> while [ $num -le 25 ]; do
  echo value: $num
  num=$(($num +5))
done 
value: 5
value: 10
value: 15
value: 20
value: 25

However, it is stupid to do counting with while as for does it so much better. On the other hand, while can be converted behave similarly to for-loop by doing the counting outside the loop and reading the value from STDIN:

> seq -w 5 5 25 | while read num; do
  echo value: $num
done 
value: 05
value: 10
value: 15
value: 20
value: 25

This works equally well with ls and files:

> ls H*_*_*.csv | while read file ; do
  echo "$(echo $file | cut -d_ -f2) has $(cat $file | wc -l) observations"
done
Kaisaniemi has 745 observations
Kumpula has 745 observations
Malmi has 745 observations
Vuosaari has 745 observations

The advantage of the while read pair is that the input list can be piped in. I find it easier to construct first the command generating the list (possibly a combination of ls and grep commands) and then add the while loop after that (separated by a pipe). The alternative would be to embed the complex command at the beginning of the for-loop.

Note that read can read multiple variables at a time.

Finnish is an inflected language and the word order within sentences is pretty flexible. For example, for the sentence ‘the dog bit the man’, the three words can be in any order (though, the emphasis of the sentence changes a bit). We can demonstrate this with a bash command that outputs the three words in a random order, all orderings being valid Finnish sentences:

> words=("koira" "puri" "miestä")
> shuf -i 0-2 | xargs -n3 echo | while read a b c; do echo ${words[$a]} ${words[$b]} ${words[$c]} ; done
koira puri miestä

If one executes the command again – pressing the arrow up key and enter – it changes the output randomly.

> shuf -i 0-2 | xargs -n3 echo | while read a b c; do echo ${words[$a]} ${words[$b]} ${words[$c]} ; done
puri miestä koira

This is an unnecessarily complex command to demonstrate the ability of the read command to take multiple arguments and it may need some clarification. words=("koira" "puri" "miestä") defines an array (or a vector) of three words, called $words. The new command shuf -i shuffles the input, here the numbers 0-2, and prints them one at the time to STDOUT. xargs -n3 collects three lines together and outputs them all to STDOUT with echo. The loop construct while read a b c; do ...; done reads three variables per row, called $a, $b and $c, and does something with them. And the last command echo ${words[$a]} ${words[$b]} ${words[$c]} prints the words in a random order depending on the values of $a, $b and $c.

We can add an outer loop to execute the commands multiple times and see that the outputs really are random:

> for i in $(seq 1 10); do
  shuf -i 0-2 | xargs -n3 echo | while read a b c; do 
    echo ${words[$a]} ${words[$b]} ${words[$c]}
  done
done
miestä puri koira
puri miestä koira
miestä puri koira
koira puri miestä
miestä koira puri
miestä puri koira
puri koira miestä
koira miestä puri
miestä koira puri
miestä puri koira


Exercise: While-loops

While-loop is a simple loop structure although it may not be as intuitive as the for-loop. The loop itself just repeats a set of commands for something and the key is to define the something (a list of things) and the set of commands.

Exercise 2 in Moodle.


Iterating with find

At the time of Google search, it would seem obvious that every computer system has a powerful search method. However, the search (especially with the content) requires building and keeping up-to-date complex indexes or databases, and such systems aren’t widely used in the command line. Nevertheless, bash has an efficient program for searching directory structures and it can be extended to search also the contents of the files.

The find command may look clumsy and complicated but it is also very powerful. The basic format of the command is find <dir> [arguments]. Giving only the directory name goes recursively through the target and prints out everything found:

>  find ~/IntSciCom/village/ | head -4
/users/username/IntSciCom/village/
/users/username/IntSciCom/village/house2
/users/username/IntSciCom/village/house2/bedroom
/users/username/IntSciCom/village/house2/bedroom/.hidden

The files starting with a dot are not shown by many programs and the leading dot is used to hide unnecessary details such as configuration files. They can be seen with ls by adding the argument -a:

> ls ~/IntSciCom/village/house2/bedroom
> ls -a ~/IntSciCom/village/house2/bedroom
.  ..  .hidden

Above, the single dot is the target directory and the double dot is the parent directory (that is why cd .. goes one step backwards); the file .hidden is empty and created just for git to include the directory in the repository where the course material is copied from.

Typical arguments for find specify the name (as pattern), type (e.g. f for file, d for directory) or modification/access times. Focusing on ~/IntSciCom/village/, we can find the directories whose name starts with ‘b’ with the command:

> find ~/IntSciCom/village/ -name "b*" -type d
/users/username/IntSciCom/village/house2/bedroom
/users/username/IntSciCom/village/house2/office/bookshelf
/users/username/IntSciCom/village/house1/bedroom
/users/username/IntSciCom/village/house1/office/bookshelf
/users/username/IntSciCom/village/house3/bedroom
/users/username/IntSciCom/village/house3/office/bookshelf
/users/username/IntSciCom/village/shoppingcentre/bookstore

Alternatively, we can find all non-empty (sign ! negates the condition) files with the command:

> find ~/IntSciCom/village/ -type f ! -empty 
/users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
/users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt

We could of course pipe the output to a while-loop to do something for the file (and often do so). However, find allows executing commands for the hits found within the command itself. That is done with -exec <command> {} \; where {} is the position where the filename is put and \; ends the command. Then, we could count the words in the non-empty files with the command:

> find ~/IntSciCom/village/ -type f ! -empty -exec wc -w {} \;
12 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
3 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
34988 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
21427 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt

find only finds files and directories but we can extend the search for the contents with grep. Focusing on the same non-empty files, we can search for the words “To be” within them:

> find ~/IntSciCom/village/ -type f ! -empty -exec grep -H "To be" {} \; 
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To bear our hearts in grief, and our whole kingdom
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be contracted in one brow of woe;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be commanded.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be a preparation ’gainst the Polack;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be, or not to be, that is the question:
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:To be forestalled ere we come to fall,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:But never the offence. To bear all smooth and even,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt:     To be your Valentine.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:May read strange matters. To beguile the time,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be his purveyor: but he rides well;
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be the same in thine own act and valour
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be invested.
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To be thus is nothing,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:To bed, to bed. There’s knocking at the gate. Come, come, come, come,
/users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt:give me your hand. What’s done cannot be undone. To bed, to bed, to

find is often used to find files that are then deleted, so often that the program has the argument -delete for that. It is generally safer to separate the search for files and their deletion and do the latter using the commands constructed by the former:

> find ~/IntSciCom/village/ -empty -type f -exec echo "rm -i "{} \; > delete_empty.sh
> # bash delete_empty.sh

Here, the find command prints rm -i commands that delete the empty files found. The collection of these rm -i commands (a very simple “script”) can then be executed with bash delete_empty.sh. The argument -i makes the deletion interactive and asks the user confirmation for each step.


Exercise: ‘find’ and find-loops

After some practice, find is really powerful for finding files and directories with specific rules. It becomes even more powerful when the results of the search are used as the input for further commands.

Exercise 3 in Moodle.


Iterating with xargs

The command xargs is far more complex (and powerful) than the previous loop structures and one can do well without it. When your computational tasks require parallel processing or the traditional loop structures feel too verbose, it is worth having a closer look at this command. Those baffled by the contents can jump straight to Conditional commands.

A common alternative for find -exec is to feed the output of find to program xargs. xargs allows for building and executing commands or command combinations using the information provided through STDIN.

We can count the words of non-empty files by feeding the file names to xargs and then providing the bash command wc -w; xargs places the file names as the argument of that command and produces the output:

> find ~/IntSciCom/village/ -type f ! -empty | xargs wc -w
    12 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Hamlet.txt
     3 /users/username/IntSciCom/village/house1/bedroom/notebook/Shakespeare_Macbeth.txt
 34988 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Hamlet.txt
 21427 /users/username/IntSciCom/village/house1/office/bookshelf/Shakespeare_Macbeth.txt
 56430 total

The last line of the output reveals that all four files were given simultaneously as arguments (as in wc -l file1 file2 file3 file4) and the program therefore also outputs the total count. With argument -n1, xargs takes the file names one at a time and produces the output identical with the find -exec one:

> find ~/IntSciCom/village/ -type f ! -empty | xargs -n1 wc -w | less

By default, xargs places the filenames read from STDIN after the command provided – which is the behaviour we want for wc -w. However, the filename (or word) can be also “named” and used explicitly, possibly multiple times. The argument -I% names the filename variable as % and we can then give that as the argument as in wc -w %:

> find ~/IntSciCom/village/ -type f ! -empty | xargs -n1 -I% wc -w % | less

Parallelising with xargs

One big reason for learning the use of xargs is its ability parallelise jobs. Let’s assume that one would have a Linux workstation with 16 CPUs and one hundred long analyses to run over the weekend. It would be inefficient (and possibly impossible) to start all 100 at the same time on Friday afternoon and it would be cumbersome to keep checking over the weekend if any of the jobs has finished and a new one should be started. xargs can take the long list of jobs as the input and run exactly 16 of them in parallel, starting a new one when the previous finish. (The program parallel is even better for this task but also more complex.)

To test that, we can create a small program and run copies of that in parallel. First, let’s make the program called program.sh:

> cat > program.sh << 'EOF'
#!/usr/bin/bash 
echo $1 starts
sleep $1
echo $1 ends
EOF

This expects a number as the argument, prints the number in the beginning and the end, and sleeps the number of seconds in between. We can test it:

> bash program.sh 2
2 starts
2 ends

and see that there’s a 2-second wait in the middle.

To see that xargs runs a specific number of instances in parallel, we can start five copies of the program, each taking one second longer to run. We do that by providing numbers 1, 2, 3, 4, 5 through STDIN; xargs reads these one at a time (-n1) and then runs two copies (-P2) of bash program.sh in parallel, providing the number as the argument (-I% and %):

> seq 5 | xargs -n1 -P2 -I% bash program.sh % 
1 starts
2 starts
1 ends
3 starts
2 ends
4 starts
3 ends
5 starts
4 ends
5 ends

From the output we see that this indeed happened: 1 and 2 were started at the same time but as 1 finished earlier, 3 was started before 2 finished, and so on.

In fact, we don’t necessarily need to write the commands in a program file but can execute them with the xargs command using bash -c:

> seq 5 | xargs -n1 -P2 -I% bash -c "echo % starts; sleep %; echo % done"
1 starts
2 starts
1 done
3 starts
2 done
4 starts
3 done
5 starts
4 done
5 done

Here, the part within the double quotes could be the commands for the time-consuming analysis.

Important

Note that on large computing clusters, much of the parallelisation is done with the scheduling system. On the other hand, if one has lots of small analyses (hundreds or thousands, each running in minutes), they should not be sent as separate jobs to the queue. Sending them to the queue would massively complicate the scheduler’s task and quickly ruin the user’s “Priority” score, slowing down the progress of the jobs in the queue.

In such cases, it would be better e.g. to collect the many analyses in a bash script file, reserve a job with e.g. 20 CPUs and then run these commands with xargs -P20. Details for this can be looked for in the CSC documentation or asked from the CSC support staff.

Multiple xargs

It may be useful to have multiple xargs commands piped together. One useful trick is to use basename to get rid of the suffix. The bash commands dirname and basename take a file path as the argument and output either the directory path (everything but the last) or the base part (the last) of it. The basename can additionally remove a fixed suffix (for example “.csv” in the end) and thus give the base of the name on which we can add different suffixes. We could do that and iterate the conversion script to all “.csv” files, writing the output with the “.tsv” suffix:

> cd ~/IntSciCom/Helsinki/
> ls H*_*_*.csv | xargs basename -s .csv | xargs -n1 -I% bash -c "bash convert.sh %.csv > %.tsv"

We now have all the raw data nicely formatted as tab-separated files:

> ls H*_*_*
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv  Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Helsinki_Kaisaniemi_1.1.2024-31.1.2024.tsv  Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.tsv
Helsinki_Kumpula_1.1.2024-31.1.2024.csv     Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv
Helsinki_Kumpula_1.1.2024-31.1.2024.tsv     Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.tsv

Conditional commands

A script doing exactly the same commands for the same files can be controlled by copying variable files as the input and renaming the output files. A hypothetical setup could be like this:

> cp file1.tsv input.tsv
> bash script.sh
> mv output.txt file1.out
> cp file2.tsv input.tsv
> bash script.sh
> mv output.txt file2.out

Here, the script always reads the data from the file input.tsv and writes the results to the file output.txt.

A more advanced solution is to use input arguments for the script and thus apply the commands to different files. A hypothetical setup could be like this:

> bash script.sh file1.tsv file1.out
> bash script.sh file2.tsv file2.out

Here, the script takes two arguments and uses the first as the input data and the second as the output file.

Often it is necessary to be able to make alternative decisions within the bash scripts and adjust the behaviour depending on the circumstances. This is done with tests and conditions. Most of the tests work on string or numbers, or on files and directories.

Some of the commonly needed string and integer tests are:

Test Function
-z string True if the length of string is zero
string1 = string2 True if the strings are equal
string1 != string2 True if the strings are not equal
int1 -eq int2 True if int1 is equal to int2
int1 -ne int2 True if int1 is not equal to int2
int1 -lt int2 True if int1 is less than int2
int1 -gt int2 True if int1 is greater than int2

Some of the commonly needed file tests are:

Test Function
-e file True if file exists
-f file True if file exists and is a regular file
-d file True if file exists and is a directory

The full list of bash tests can be found at https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html.

bash has the command test to test different things. The result of the test is stored into a special variable $?. As an example, we could test if the variable $val equals a certain value:

> val=5
> test $val -eq 4
> echo $?
1

Here, the test produces FAIL/FALSE which in bash is 1. Another try is more successful:

> test $val -eq 5
> echo $?
0

In bash, SUCCESS/TRUE is coded as 0.

Using the command test and the variable $? is often cumbersome and it is much more straightforward to utilise the if-else structure common in most programming languages.

if [ condition ]; then
  [commands for SUCCESS/TRUE]
else
  [commands for FAIL/FALSE]
fi
> val=5
> for i in {3..7}; do
    if [ $val -eq $i ]; then
      echo value is $i
    else
      echo value is NOT $i
    fi
  done
value is NOT 3
value is NOT 4
value is 5
value is NOT 6
value is NOT 7

In an if-else condition, only the if part is compulsory; even then, the condition has to be ended with fi:

> for i in {3..7}; do
    if [ $val -eq $i ]; then
      echo value is $i
    fi
  done
value is 5

Sometimes it may be easier to have an empty TRUE case and do the actual work in the FALSE case. However, the TRUE case has to have something and then : means “do nothing”:

> for i in {3..7}; do
    if [ $val -eq $i ]; then
      :
    else
      echo value is NOT $i
    fi
  done
value is NOT 3
value is NOT 4
value is NOT 6
value is NOT 7

On the other hand, the conditions are always boolean TRUE/FALSE cases and these can be converted to the opposite value with !:

> for i in {3..7}; do
    if [ ! $val -eq $i ]; then
      echo value is NOT $i
    fi
  done
value is NOT 3
value is NOT 4
value is NOT 6
value is NOT 7

Finally, an if-else condition can have any number of if cases, the subsequent cases given as elif; the final else (if exists) is executed if none of the previous cases is TRUE:

> for i in {3..7}; do
    if [ $val -lt $i ]; then
      echo value is less than $i
    elif [ $val -gt $i ]; then
      echo value is greater than $i
    else
      echo value must be $i
    fi
  done
value is greater than 3
value is greater than 4
value must be 5
value is less than 6
value is less than 7

Conditions within a script

To demonstrate the integration of tests in scripts, we can do a highly simplistic text analysis of the two books found in the directory ‘bookshelf’:

> cd ~/IntSciCom/village/house1/office/bookshelf/
> cat > text_analysis.sh << 'EOF'
books=($(ls *.txt))
len0=$(cat ${books[0]} | wc -w)
len1=$(cat ${books[1]} | wc -w)
echo -n "File ${books[0]} has $len0 words and is "
if [ $len0 -gt $len1 ]; then
   echo -n "longer than"
elif [ $len0 -lt $len1 ]; then
   echo -n "shorter than"
else
   echo -n "as long as"
fi
echo " file ${books[1]} that has $len1 words."
EOF
> bash text_analysis.sh 
File Shakespeare_Hamlet.txt has 34988 words and is longer than file Shakespeare_Macbeth.txt that has 21427 words.

Above, we wrote the script using the Heredoc. In that, on the second row, the filenames are stored in a vector called $books. The contents of vectors are zero-indexed and we can access the first item as ${books[0]}. On the third row, we do that and store the word count (the output of wc -w) in the variable $len0. The filename and the word count are used within the echo on the fifth row, and the two word counts are compared in the tests on the sixth and eighth rows: the first condition tests whether $len0 is greater than $len1 and then if it is, performs the command on row 7; if it is not greater, we do another test and evaluate if $len0 is less than $len1 and depending on the result, may perform the command on row 9; if neither of the tests is true, the integers must be equal and we go for the default alternative behaviour (else on row 10) and perform the command on row 11. The if-else condition has to be closed with fi as on row 12.

Conditional commands within a script

As an example of a file test and conditional execution of commands, we revisit the Helsinki temperature data and embed the test condition within another command. When doing automated tasks, it is useful to check that existing files are not mistakenly overwritten. We incorporate this in the earlier csv-file conversion and add a check of the target file existence [ -e ${name}.tsv ]. The csv-file is converted only if a similarly named target tsv-file doesn’t exist:

> cd ~/IntSciCom/Helsinki/
> rm Helsinki_K*_*.tsv
> ls H*_*_*.csv | while read csv; do 
    name=$(basename -s .csv $csv)
    if [ ! -e ${name}.tsv ]; then 
      echo converting $csv
      bash convert.sh $csv > ${name}.tsv
    fi
  done
converting Helsinki_Kaisaniemi_1.1.2024-31.1.2024
converting Helsinki_Kumpula_1.1.2024-31.1.2024

As we deleted the tsv-files matching the pattern Helsinki_K*_*.tsv only the Kaisaniemi and Kumpula files are regenerated; the Malmi and Vuosaari files exist and they were not overwritten.

The unconditional script was introduced using the command xargs (see above). Conditional execution can be integrated with that, too:

> cd ~/IntSciCom/Helsinki/
> rm Helsinki_K*_*.tsv
> ls H*_*_*.csv | xargs basename -s .csv | \
  xargs -I% bash -c "if [ ! -e %.tsv ]; then echo converting %; bash convert.sh %.csv > %.tsv; fi"  
converting Helsinki_Kaisaniemi_1.1.2024-31.1.2024
converting Helsinki_Kumpula_1.1.2024-31.1.2024

Tests of program return value

Above we used tests for variables and files. One could easily find ways of extending the same approach for the outputs of programs. For example, we could go through all files with the suffix “*.csv” and look for the word “Malmi”, reporting the name of the file with a hit:

> ls *csv | while read file; do 
    if [ $(grep Malmi $file | wc -l) -gt 0 ]; then 
      echo $file 
    fi 
  done
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv

Here, the if condition computes the value of $(grep Malmi $file | wc -l), that is the number of lines with the word “Malmi” in each file, and tests if that is greater than zero; if it is, the file name is printed.

One would think that programs should always output something. That is not the case and many programs have an option to run quietly and only produce the “return value”, stored in the variable $?. The command grep is one of these and runs quietly with the option -q. We can thus simplify the command above and write it as:

> ls *csv | while read file; do 
    if grep -q Malmi $file; then
      echo $file 
    fi 
  done
Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv

Note that we don’t even need the square brackets [ ] for the test as grep returns directly either true or false.

This can be simplified further. In the section about conditions in awk, we learned that && means “AND” and || means “OR”. On the other hand, in the section about scripts and jobs, we learned that lists of commands combined with double-ampersands && stop at the first failure. Using that information, we can further simplify the structure:

> ls *csv | while read file; do 
    grep -q Malmi $file && echo Found in $file || echo Not in $file 
  done
Not in Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv
Not in Helsinki_Kumpula_1.1.2024-31.1.2024.csv
Found in Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv
Not in Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv

Here, grep -q Malmi $file returns either true or false (0 or 1, stored in $?); if it is true, the following command echo Found in $file is executed and the name of the file with the hits is printed; if it is false, the alternative command echo Not in $file is executed.


Exercise: Conditional commands

Utilisation of alternative, conditional functions allows making the scripts more generic and applicable to a wider range of cases. The test of file existence is crucial for safe code and helps avoid overwriting existing files.

Exercise 4 in Moodle.


Take-home message

A key part of programming is automation, the repetition of specific tasks multiple times or for multiple different inputs. bash has many ways of creating loops: some are useful for simple iterations while others apply commands for files found using specific criteria. The commands may change their function depending on specific conditions: these can be e.g. arithmetic or test the properties of files.