Basics of bash
After this chapter, the students can construct chains of commands where the data stream is piped from one command to another. They can utilise and redirect the input, output and error streams and create meaningful combinations of commands for processing table-formatted text data. The students can utilise Heredoc-format to create reproducible parameter files and apply the basic commands for file comparison.
The previous chapter already introduced some bash commands and programs. Here. the most central ones are first listed and then these are extended with some practical commands. Afer that, we have a closer look on how to combine these commands in loops and with conditions.
Table of commonly-used commands
Central commands for command-line working
Command | Function |
---|---|
cd |
change directory |
pwd |
print working (current) directory |
ls |
list content |
tree |
list content as a tree structure |
ln |
create a link to a file |
echo |
display a line of text |
cat |
catenate files, print file or stream content |
less |
browse file or stream content |
head |
print head of file or stream |
tail |
print tail of file or stream |
cut |
select columns |
wc |
character, word and line count |
grep |
get rows with matching patterns |
sort |
sort by different criteria |
uniq |
select and count unique lines |
column |
format data into columns |
jobs |
list jobs |
kill |
kill jobs |
diff |
show differences between files |
paste |
merge lines of files |
join |
join lines on a common field |
tr |
translate (convert) characters |
sed |
edit stream (or file) |
awk |
scripting language |
There is a built-in manual for the commands that can be opened with the command man <command_name>
. The manual mostly lists the arguments and, for a beginner with bash, more informative help can probably be found on the net.
bash
The Unix-world is full is full of insider jokes, puns and plays with words.
The first ever shell language was called Thompson shell (sh
) after its author Ken Thompson, the developer of the first Unix operating system, released in 1971. Eight years later, an improved shell language was released and called Bourne shell (sh
) after its developer Stephen Bourne. The shell was constantly improved and remained in use for a long time.
The Bourne shell was part of the commercial (and very expensive) Unix OS and the free software project GNU, aiming to create a free Unix-like operating system, needed to develop an open-source alternative for it. As a reference to the commercial shell and a pun on the renewal of Born-again Christians, the new shell was named Bourne-Again Shell, or bash
.
Pipes and streams: input, output and error
Pipes and streams
Unix programs work with three streams, more specifically with standard in (known as STDIN), standard out (STDOUT) and standard error (STDERR). The first two are easy to understand and a program takes information in through standard in and writes the processed information out through standard out. We can look at this with an example:
> cd ~/IntSciCom/Helsinki/
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | less
Here, the input file contains temperature measurements for Kaisaniemi, Helsinki, in January 2024. The cat
command writes the temperature data to STDOUT and less
then reads this from STDIN and displays it on the screen. The concept of streams becomes clearer with more complex combinations of programs. We can add tr -d
to delete all the confusing double quotes:
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d '"' | less
or
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | less
Note that the double quote has to be written inside single quotes ('"'
) or escaped with a backslash sign (\"
). That is because the double quote character has a special meaning in bash and would otherwise be interpreted as the start of a string.
Now, it is more obvious that tr
takes the data from STDIN and writes it to STDOUT.
We can add three more commands, each reading from STDIN and writing to STDOUT:
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 | less
Here, the second tr
converts all commas to TAB signs (\t
) and the third one all spaces to underscores; the command tail
shows the end of the file starting from line two.
Showing “the end of the file starting from line two may” sound strange for a command named as tail
. However, if one leaves out the plus sign, the command only shows the last two lines. That’s the “default” behaviour of tail
and the exact opposite of the behaviour of the command head
.
The individual commands are separated by |
signs, known as pipes. These pipes take the STDOUT from the left side and give it as STDIN on the right side. We can have multiple commands piped together as shown above. If we want to direct the STDOUT to a file, we have to use another symbol, > filename
, for that. Here, we are happy with the result and can direct the modified data to the file Kaisaniemi.tsv
:
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Kaisaniemi.tsv
We can do the same manipulation for the other files:
> cat Helsinki_Kumpula_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Kumpula.tsv
> cat Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Malmi.tsv
> cat Helsinki_Vuosaari_satama_1.1.2024-31.1.2024.csv | tr -d \" | tr ',' '\t' | tr ' ' '_' | tail -n+2 > Vuosaari.tsv
Note that we don’t edit the original files but create new files that contain the changes that we want. As all the changes are made with one combination of bash commands, they would be fairly easy to replicate on a new source data, e.g. if we decided to study a longer time period or the temperature data for another month. We could simplify the re-use of these commands by writing them into a file that then could be used like a regular program. We’ll look at that later.
One can use the opposite of > outfile
, namely < infile
, to read data from a file. The command:
> cat < Helsinki.tsv > cat_Helsinki.tsv
reads the data from Helsinki.tsv
and writes it to cat_Helsinki.tsv
. This is equivalent of just copying the file:
> cp Helsinki.tsv cp_Helsinki.tsv
> wc cat_Helsinki.tsv cp_Helsinki.tsv
2976 23808 160440 cat_Helsinki.tsv
2976 23808 160440 cp_Helsinki.tsv
5952 47616 320880 total
> diff cat_Helsinki.tsv cp_Helsinki.tsv
In practice, the Unix pipe format with all the data flowing from left to right is simpler to read and therefore strongly recommended.
Stream manipulation with cut
and sort
From the original csv-file we know that the place is in the 1st column and the minimum temperature in the 8th column.
> head -1 Helsinki_Malmi_lentokenttä_1.1.2024-31.1.2024.csv | tr ',' '\n' | grep -n '"'
1:"Havaintoasema"
2:"Vuosi"
3:"Kuukausi"
4:"Päivä"
5:"Aika [Paikallinen aika]"
6:"Lämpötilan keskiarvo [°C]"
7:"Ylin lämpötila [°C]"
8:"Alin lämpötila [°C]"
We can select specific columns with cut
and then sort the data with sort
:
> cat Kaisaniemi.tsv | cut -f1,8 | sort -k2n | less
Here, cut -f1,8
selects the columns 1 and 8, and sort -k2n
sorts the data based on the second column (-k2
) numerically (n
). By default, sorting is done alphabetically; the sorting can be reversed with -r
and all sort parameters can be written together such as sort -k2nr
.
Note that we wouldn’t have needed to convert the data to tab-separated format and could also process the comma-separated information with spaces in the location names. However, the default field separator for cut
is TAB and one have to use argument -d,
to use commas; the default filed separator for sort
is space and one have to use argument -t,
to use commas. The double quotes have to be removed for sort
to recognise the values as numbers. If that is done, we find the minimum temperature for Kaisaniemi using the original data:
> cat Helsinki_Kaisaniemi_1.1.2024-31.1.2024.csv | cut -d, -f1,8 | tr -d \" | sort -t, -k2n | head -1
Sorting can be done using multiple columns. If we combine the four locations and then sort the data, we see that Malmi seems to be the coldest place:
> cat Kaisaniemi.tsv Kumpula.tsv Malmi.tsv Vuosaari.tsv > Helsinki.tsv
> cat Helsinki.tsv | cut -f1,8 | sort -k2n | less
We can confirm this by using uniq
to count how many times each location appears in row in the sorted list of coldest measurements:
> cat Helsinki.tsv | cut -f1,8 | sort -k2n | cut -f1 | uniq -c | head
50 Helsinki_Malmi_lentokenttä
2 Helsinki_Kumpula
2 Helsinki_Malmi_lentokenttä
1 Helsinki_Kumpula
3 Helsinki_Malmi_lentokenttä
2 Helsinki_Kumpula
1 Helsinki_Malmi_lentokenttä
1 Helsinki_Kumpula
1 Helsinki_Malmi_lentokenttä
1 Helsinki_Kaisaniemi
Based on that, Malmi takes the first 50 places! This is not caused by Malmi having more measurements and each location appears 744 times in the combined file:
> cat Helsinki.tsv | cut -f1 | uniq -c
744 Helsinki_Kaisaniemi
744 Helsinki_Kumpula
744 Helsinki_Malmi_lentokenttä
744 Helsinki_Vuosaari_satama
What happened in the above command? One should be familiar with cat
by now and can peek into the manual (with commands man cut
and man uniq
) to see the meaning of the arguments in the two other commands. If that doesn’t help, one can “break” the command into bits and see what each step does. Normally it’s helpful to inspect the intermediate output using less
:
> cat Helsinki.tsv | less
> cat Helsinki.tsv | cut -f1 | less
Seeing the output of the last command and learning that the argument -c
for uniq
means prefix lines by the number of occurrences the meaning of the command should be clear.
Our combined file Helsinki.tsv
has the problem that the location names in the first column are of different lengths, complicating the reading of the output:
> grep "12:00" Helsinki.tsv | sort -k6,6n | head -n4
Helsinki_Malmi_lentokenttä 2024 1 7 12:00 -23.2 -22.3 -24.7
Helsinki_Kumpula 2024 1 7 12:00 -19.8 -19.1 -20.6
Helsinki_Malmi_lentokenttä 2024 1 4 12:00 -19.3 -19.1 -19.4
Helsinki_Kaisaniemi 2024 1 7 12:00 -19.2 -18.4 -20.2
The command column
helps and can, for example, format the data into nice columns:
> grep "12:00" Helsinki.tsv | sort -k6,6n | head -n4 | column -t
Helsinki_Malmi_lentokenttä 2024 1 7 12:00 -23.2 -22.3 -24.7
Helsinki_Kumpula 2024 1 7 12:00 -19.8 -19.1 -20.6
Helsinki_Malmi_lentokenttä 2024 1 4 12:00 -19.3 -19.1 -19.4
Helsinki_Kaisaniemi 2024 1 7 12:00 -19.2 -18.4 -20.2
column -t
– the argument specifying to learn the dimensions and the column widths from the data – doesn’t work for huge files as it has to read through the data firstq.
Our data files are relatively small and it’s easy to count the number of columns and then access the X
th column with cut -fX
. If the table-formatted file were much wider and we knew that we want to access one of the last columns but didn’t know the exact dimensions of the table, there’s a neat trick to reverse the table, select the column and then reverse it back. Reversing the columns is done with the command rev
and thus the last three columns can be extracted with:
> rev Helsinki.tsv | cut -f1-3 | rev | head -3
-13.9 -13.7 -14
-14.1 -14 -14.3
-14.4 -14.1 -14.6
Check the actual behaviour of rev
with:
> rev Helsinki.tsv | less
Hints of some more advanced tricks
By doing the sorting first alphabetically (default criterion) on column 1 (-k1,1
; we also have to specify that the sorting criterion ends at column 1) and then numerically on column 2 (-k2n
), we get the information for different measurement locations in order:
> cat Helsinki.tsv | cut -f1,8 | sort -k1,1 -k2n | less
To get the minimum temperatures of each location, we need some more advanced tricks that we’ll learn more about later. There are many way to do it but here are two. Using the combined file, we can make an awk
condition that checks if the value in the first column is different from the previous line. If the data are sorted first by location (-k1,1
) and then numerically by the temperature (-k2n
), this gives the lowest temperature for each location:
> cat Helsinki.tsv | cut -f1,8 | sort -k1,1 -k2n | awk '{if($1!=prev){print $0}prev=$1}'
Another approach is to loop over the individual files and then pick the minimum temperature from each. This is easier to understand when written out:
> for file in $(ls [KMV]*tsv); do
cat $file | cut -f1,8 | sort -k2n | head -1
done
Here, the first line creates a list of files (you can test it by typing ls [KMV]*tsv)
in the terminal) and assigns these, one at a time, in the variable $file
; this variable is then used in a very familiar-looking command in the middle; do
and done
set the boundaries of the loop.
Experienced users would write this as a one-liner:
> for file in $(ls [KMV]*tsv); do cat $file | cut -f1,8 | sort -k2n | head -1; done
File comparison with diff
File comparison is a central part of many analyses and a surprisingly common task is to find differences between two files. One could think that all files are different but in data analysis, one often modifies one part of the processing or analysis pipeline and is then interested in understanding how this affects the final output. One would expect the output to be highly similar and one wants to find the lines that differ.
We can create another data file for Kaisaniemi. The details of the command are not important here, the awk
command just forces the number in the 6th column to have one digit and then grep -v
catches all other lines but those that contain the string 12:00
:
> less Kaisaniemi.tsv | awk '{OFS="\t";$6=sprintf("%.1f",$6);print $0}' | grep -v 12:00 > Kaisaniemi_new.tsv
Now, the command diff
outputs the differences:
> diff Kaisaniemi.tsv Kaisaniemi_new.tsv | head -9
12,13c12
< Helsinki_Kaisaniemi 2024 1 1 11:00 -16 -15.9 -16.1
< Helsinki_Kaisaniemi 2024 1 1 12:00 -15.9 -15.8 -16.1
---
> Helsinki_Kaisaniemi 2024 1 1 11:00 -16.0 -15.9 -16.1
19c18
< Helsinki_Kaisaniemi 2024 1 1 18:00 -17 -16.9 -17.2
---
> Helsinki_Kaisaniemi 2024 1 1 18:00 -17.0 -16.9 -17.2
The format may look cryptic as it was designed for other computer programs to read. Basically, each difference starts with the location and 12,13c12
means that below is shown data from lines 12-13 in the first file and from line 12 in the second file; then come the lines on the left file (<
), separator (---
) and the line on the right file (>
). We can indeed see that the difference is the additional digit in the 6th column and the observation for time 12:00 completely missing.
There are arguments for diff
that make the comparison more readable. diff -c1
adds one line of context and marks the differing lines with exclamation marks:
> diff -c1 Kaisaniemi.tsv Kaisaniemi_new.tsv | head -12
*** Kaisaniemi.tsv 2024-02-08 12:16:36.683998118 +0200
--- Kaisaniemi_new.tsv 2024-02-08 15:04:48.387346228 +0200
***************
*** 11,14 ****
Helsinki_Kaisaniemi 2024 1 1 10:00 -15.9 -15.7 -16
! Helsinki_Kaisaniemi 2024 1 1 11:00 -16 -15.9 -16.1
! Helsinki_Kaisaniemi 2024 1 1 12:00 -15.9 -15.8 -16.1
Helsinki_Kaisaniemi 2024 1 1 13:00 -15.7 -15.7 -15.8
--- 11,13 ----
Helsinki_Kaisaniemi 2024 1 1 10:00 -15.9 -15.7 -16
! Helsinki_Kaisaniemi 2024 1 1 11:00 -16.0 -15.9 -16.1
Helsinki_Kaisaniemi 2024 1 1 13:00 -15.7 -15.7 -15.8
diff -u2
adds two lines of context and marks the differences (from the left file to the right file) with minus and plus signs:
> diff -u2 Kaisaniemi.tsv Kaisaniemi_new.tsv | head -10
--- Kaisaniemi.tsv 2024-02-08 12:16:36.683998118 +0200
+++ Kaisaniemi_new.tsv 2024-02-08 15:04:48.387346228 +0200
@@ -10,6 +10,5 @@
Helsinki_Kaisaniemi 2024 1 1 09:00 -15.6 -15.4 -15.8
Helsinki_Kaisaniemi 2024 1 1 10:00 -15.9 -15.7 -16
-Helsinki_Kaisaniemi 2024 1 1 11:00 -16 -15.9 -16.1
-Helsinki_Kaisaniemi 2024 1 1 12:00 -15.9 -15.8 -16.1
+Helsinki_Kaisaniemi 2024 1 1 11:00 -16.0 -15.9 -16.1
Helsinki_Kaisaniemi 2024 1 1 13:00 -15.7 -15.7 -15.8
Helsinki_Kaisaniemi 2024 1 1 14:00 -15.7 -15.7 -15.8
Sometimes command diff -y file1 file2
can be useful but we’ll return to that below in the context of files and streams.
Stream redirection
STDERR
Above, we mentioned STDERR but never really explained it. The standard error is a separate stream where the programs output information when something goes wrong. We’ve seen complex chains of commands and it’s easy to see situations where the error notification by a program would be lost in the pipeline of multiple pipes and steps; in the worst-case scenario, the pipeline would have failed badly but, by only seeing the final output, the user would not be aware of the errors and would believe that everything is fine.
STDERR allows programs to write the error notification to a different place from the actual processing output. A challenge is that the default output of both is the terminal. We can look at that with an example. This simple loop counts the number of hours when the mean temperature was exactly 0°C in different locations in Helsinki. To do that, it places the name of five locations into variable $place
, reads the file $place.tsv
, cuts the 1st and 6th column (the location and the mean temperature for the hour), greps lines that contain "\t0$"
(TAB followed by 0 followed by end-of-line) and counts the number of such observations:
> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c;
done
Because we do not have temperature measurements for Viikki, the output includes one error message:
2 Helsinki_Kaisaniemi 0
8 Helsinki_Kumpula 0
8 Helsinki_Malmi_lentokenttä 0
cat: Viikki.tsv: No such file or directory
7 Helsinki_Vuosaari_satama 0
If we would write the output to a file:
> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c;
done > measurement_counts.out
and then read that file:
> cat measurement_counts.out
2 Helsinki_Kaisaniemi 0
8 Helsinki_Kumpula 0
8 Helsinki_Malmi_lentokenttä 0
7 Helsinki_Vuosaari_satama 0
we wouldn’t notice anything suspicious (except if we count that there is the expected number of lines).
A safer option is to output STDERR to another file with 2> filename
:
> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c;
done > measurement_counts.out 2> measurement_counts.err
Now, the error file contains the error information:
> cat measurement_counts.err
wc: Viikki.tsv: No such file or directory
One can redirect STDERR to STDOUT. One way to write both streams to the same output file is to use &> outfile
as in:
> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c;
done &> measurement_counts.out
Sometimes a poorly-desgined program writes tons of stuff either to STDOUT or STDERR. These can be directed to a file which is then deleted, but there’s a way to delete the output immediately by directing it to a file called /dev/null
. This is Unix systems’ black hole where things go and never come back:
> for place in Kaisaniemi Kumpula Malmi Viikki Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c;
done > measurement_counts.out 2> /dev/null
One should be careful with that approach, however, as it may hide serious errors in the analysis pipeline.
Appending files and Heredoc
The arrow symbol >
creates a file and writes to it, or deletes the contents of an existing file and then writes to it. Appending to a file, i.e. writing new text at the end of an existing file, can be done with double arrows >> file.out
and 2>> file.err
. We can modify the previous commands and write to the output file inside the loop; as we do so multiple times we have to append to the existing file:
> > measurement_counts.out
> for place in Kaisaniemi Kumpula Malmi Vuosaari; do
cat $place.tsv | cut -f1,6 | grep -P "\t0$"| uniq -c >> measurement_counts.out
done
Note that >> file
can create a file if it doesn’t exist. Here, the command > measurement_counts.out
in the beginning is used to empty the file. If it were not there, the commands could append the results at the end of some erroneous text.
As > measurement_counts.out
empties a file, one can also do cat > file
without any STDIN to direct there: in that case, the keyboard comes the STDIN until Control-d is pressed. Thus I could do:
> cat > notes.txt
remember to remove viikki from the script
and end this by pressing Ctrl and d simultaneously. The text is now in that file:
> cat notes.txt
remember to remove viikki from the script
A variant of the same approach is the so-called Heredoc. In scientific analyses, Heredoc allows showing what is written into a file and thus allowing others to replicate the same steps. The special feature Heredoc is the double backwards arrow <<
that tells the end signal, often EOF (end of file) as in << EOF
. Then, we could write in our documentation:
> cat > locations.txt << EOF
Kaisaniemi
Kumpula
Malmi
Vuosaari
EOF
After doing that, the information is now in the file:
> cat locations.txt
Kaisaniemi
Kumpula
Malmi
Vuosaari
The power of this approach is that we can use variables that are then translated to their values. A hypothetical argument file for a program could be created like this:
> place=Kumpula
> cat > args.txt << EOF
input: $place.tsv
output: $place.out
error: $place.err
EOF
> cat args.txt
input: Kumpula.tsv
output: Kumpula.out
error: Kumpula.err
Above, we could write $place.tsv
as the dot separates the variable name from the rest of the text. However, this would not work as intended:
> echo $place_2024.tsv
.tsv
The reason for this is that now the whole place_2024
is considered the variable name and, as it is not defined, nothing is outputted. We can get around that by placing the variable name in curly brackets:
> echo ${place}_2024.tsv
Kumpula_2024.tsv
This is the beginning of the human BRCA2 gene coding sequence.
> cat > BRCA2-201_start.txt << EOF
ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAAC
AAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCA
CCCTATAATTCTGAACCTGCAGAAGAATCTGAACATAAAAACAACAATTACGAACCAAAC
CTATTTAAAACTCCACAAAGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATA
TTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAA
TTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTG
AAAACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGT
GAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCAGTGGTA
TGTGGGAGTTTGTTTCATACACCAAAGTTTGTGAAGGGTCGTCAGACACCAAAACATATT
EOF
There are many command-line tools for the manipulation of sequence files, including reverse-complementing the sequence. However, the same can be done with plain bash tools.
Reverse-complementing requires that the file is read backwards, starting from the last line and from the last character of each row. Then each base is converted to its Watson–Crick pair, A->T, T->A etc. cat
reads the text from top to bottom so what could be the command to do the opposite? Surprisingly, it is called tac
. The command for reversing the lines of text is more boring and simply called rev
. Finally, we need to tr
to translate characters to something else. Put altogether, the bash command could be this:
> cat BRCA2-201_start.txt | tac | rev | tr ACGT TGCA
AATATGTTTTGGTGTCTGACGACCCTTCACAAACTTTGGTGTATGAAACAAACTCCCACA
TACCACTGACTTATCTCTTTGTGGTGTTACATGTGTACATTGTAGAACAACAGGACTTTC
ACTAAGACAAGAATTTAGAAGTGGACAGGAAACATCATCTGCTTGATCCATTTTAGTTTT
CACTGTGCGAAGACTTTTATGTCTACTATTGGGAACATTCCTTCCTAAGTCTAATTTGAA
TTTATCTAATTCTTTTACAGGAGATTGGTACAGCGGCAGAGTCAGCCCTTGCTCTTTGAA
TATTATTGGAGTTGAAGCCAGCTGATTATAAGATGGTTTCCTTTGTGGAGTTTTAAATAG
GTTTGGTTCGTAATTGTTGTTTTTATGTTCAGATTCTTCTGCAGGTTCAGAATTATAGGG
TGGAGCTTCTGAAGAAAGTTCTTCAAACCAATTAAGACTTATTGGTCCTAAATCTGCTTT
GTTGCAGCGTGTCTTAAAAATTTCAAAAAATGTTGGCCTCTCTTTGGATCCAATAGGCAT
To check that the conversion went correctly, we can do it back to the forward frame:
> cat BRCA2-201_start.txt | tac | rev | tr ACGT TGCA > BRCA2-201_start.rev
> cat BRCA2-201_start.rev | tac | rev | tr ACGT TGCA
ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAAC
AAAGCAGATTTAGGACCAATAAGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCA
CCCTATAATTCTGAACCTGCAGAAGAATCTGAACATAAAAACAACAATTACGAACCAAAC
CTATTTAAAACTCCACAAAGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATA
TTCAAAGAGCAAGGGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAA
TTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGTAGACATAAAAGTCTTCGCACAGTG
AAAACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGT
GAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCAGTGGTA
TGTGGGAGTTTGTTTCATACACCAAAGTTTGTGAAGGGTCGTCAGACACCAAAACATATT
Although many things can be done with bash, it’s not always meaningful to do so. There are many sophisticated command-line tools to manipulate DNA data and, if working in the field, one should learn to use them efficiently.
File vs. stream
Command diff
has an option to show the two files next to each other with the differences annotated. For wide files that is clumsy, however. In our earlier example of two Kaisaniemi data files, we could have picked up the important columns into new files and compared those next to each other. We could have done e.g. like this:
> cut -f1,4,5,6 Kaisaniemi.tsv > old.txt
> cut -f1,4,5,6 Kaisaniemi_new.tsv > new.txt
> diff -y old.txt new.txt | head -24
Helsinki_Kaisaniemi 1 00:00 -13.9 Helsinki_Kaisaniemi 1 00:00 -13.9
Helsinki_Kaisaniemi 1 01:00 -14.1 Helsinki_Kaisaniemi 1 01:00 -14.1
Helsinki_Kaisaniemi 1 02:00 -14.4 Helsinki_Kaisaniemi 1 02:00 -14.4
Helsinki_Kaisaniemi 1 03:00 -14.7 Helsinki_Kaisaniemi 1 03:00 -14.7
Helsinki_Kaisaniemi 1 04:00 -14.6 Helsinki_Kaisaniemi 1 04:00 -14.6
Helsinki_Kaisaniemi 1 05:00 -14.3 Helsinki_Kaisaniemi 1 05:00 -14.3
Helsinki_Kaisaniemi 1 06:00 -14.2 Helsinki_Kaisaniemi 1 06:00 -14.2
Helsinki_Kaisaniemi 1 07:00 -14.5 Helsinki_Kaisaniemi 1 07:00 -14.5
Helsinki_Kaisaniemi 1 08:00 -15.2 Helsinki_Kaisaniemi 1 08:00 -15.2
Helsinki_Kaisaniemi 1 09:00 -15.6 Helsinki_Kaisaniemi 1 09:00 -15.6
Helsinki_Kaisaniemi 1 10:00 -15.9 Helsinki_Kaisaniemi 1 10:00 -15.9
Helsinki_Kaisaniemi 1 11:00 -16 | Helsinki_Kaisaniemi 1 11:00 -16.0
Helsinki_Kaisaniemi 1 12:00 -15.9 <
Helsinki_Kaisaniemi 1 13:00 -15.7 Helsinki_Kaisaniemi 1 13:00 -15.7
Helsinki_Kaisaniemi 1 14:00 -15.7 Helsinki_Kaisaniemi 1 14:00 -15.7
Helsinki_Kaisaniemi 1 15:00 -15.9 Helsinki_Kaisaniemi 1 15:00 -15.9
Helsinki_Kaisaniemi 1 16:00 -16.3 Helsinki_Kaisaniemi 1 16:00 -16.3
Helsinki_Kaisaniemi 1 17:00 -16.9 Helsinki_Kaisaniemi 1 17:00 -16.9
Helsinki_Kaisaniemi 1 18:00 -17 | Helsinki_Kaisaniemi 1 18:00 -17.0
Helsinki_Kaisaniemi 1 19:00 -17.3 Helsinki_Kaisaniemi 1 19:00 -17.3
Helsinki_Kaisaniemi 1 20:00 -17 | Helsinki_Kaisaniemi 1 20:00 -17.0
Helsinki_Kaisaniemi 1 21:00 -17 | Helsinki_Kaisaniemi 1 21:00 -17.0
Helsinki_Kaisaniemi 1 22:00 -16.8 Helsinki_Kaisaniemi 1 22:00 -16.8
Helsinki_Kaisaniemi 1 23:00 -16.5 Helsinki_Kaisaniemi 1 23:00 -16.5
The downside of this is that it creates temporary files that we have to recreate every time we want to make the diff and that slowly clutter the analysis directory. Wouldn’t it be great to have temporary files that would be created on the fly and that would disappear immediately after the command has been executed? Such “files” do exist and are created when the command is wrapped inside <( )
. diff
compares two files and expects the format diff [args] file1 file2
. We can do that with the command:
> diff -y <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kaisaniemi_new.tsv) | less -S
According to Linux documentation (man fifo
), the <( )
above is called FIFO, the name coming from First In, First Out, a type of data structure on computers. More specifically, a FIFO is a named pipe that behaves like a file: A FIFO special file (a named pipe) is similar to a pipe, except that it is accessed as part of the filesystem.
Unfortunately, FIFOs don’t work with all programs and sometimes temporary files are needed.
File comparison with paste
With the same trick, we can select parts of large tables and show them next to each other with paste
. This is a simple command and simply reads two files and displays them next to each other:
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | head
Helsinki_Kaisaniemi 1 00:00 -13.9 Helsinki_Kumpula 1 00:00 -15
Helsinki_Kaisaniemi 1 01:00 -14.1 Helsinki_Kumpula 1 01:00 -14.9
Helsinki_Kaisaniemi 1 02:00 -14.4 Helsinki_Kumpula 1 02:00 -15.4
Helsinki_Kaisaniemi 1 03:00 -14.7 Helsinki_Kumpula 1 03:00 -15.8
Helsinki_Kaisaniemi 1 04:00 -14.6 Helsinki_Kumpula 1 04:00 -16.1
Helsinki_Kaisaniemi 1 05:00 -14.3 Helsinki_Kumpula 1 05:00 -15.8
Helsinki_Kaisaniemi 1 06:00 -14.2 Helsinki_Kumpula 1 06:00 -15.4
Helsinki_Kaisaniemi 1 07:00 -14.5 Helsinki_Kumpula 1 07:00 -15.6
Helsinki_Kaisaniemi 1 08:00 -15.2 Helsinki_Kumpula 1 08:00 -16.1
Helsinki_Kaisaniemi 1 09:00 -15.6 Helsinki_Kumpula 1 09:00 -16.2
This becomes useful when we combine it with other commands, especially awk
:
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4<$8' | wc -l
35
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4==$8' | wc -l
30
> paste <(cut -f1,4,5,6 Kaisaniemi.tsv) <(cut -f1,4,5,6 Kumpula.tsv) | awk '$4>$8' | wc -l
679
Here, we use awk to print out the lines where the condition is true, i.e. column 4 (temperature in Kaisaniemi) is smaller (awk '$4<$8'
), equal (awk '$4==$8'
) or greater (awk '$4>$8'
) than column 8 (temperature in Kumpula). According to this quick analysis, Kumpula tends to be a colder place than Kaisaniemi.
The Unix pipes are such a good invention that other programming languages have adopted them. One of these is R and the latest versions now provide a native forward pipe syntax |>
. These are the “classical” and “unix-way” of doing the same computation:
> vals <- c(1,2)
> mean(vals)
1] 1.5
[
> vals |> mean()
1] 1.5 [
The R library dplyr
provides its own version of pipes that uses %>%
as the pipe symbol (but |>
is also accepted):
> library(dplyr)
> vals %>% mean()
1] 1.5 [
More information about the dplyr
package (which is highly recommended for anyone using R) and its pipes can be found in the package documentation.
Note that the pipe symbol in R must be different from bash as the |
character is already reserved (confusingly) for the OR
operator.
Streams and pipes are the core of the Unix philosophy and one should spend time learning the necessary skills. For a novice not knowing the bash commands and tools, it may be difficult to construct the required functionalities. A good principle is that nearly anything can be done using the command line, and for most things useful examples are available in the net.