Strings and numbers
After this chapter, the students can do basic mathematical operations on integers using bash commands and understand why computer systems, built on binary numbers, struggle with decimal numbers and can avoid potential problems caused by that. They can also do basic operations on strings, laying the foundations on later material.
This section introduces two fundamental concepts of bash, namely strings and numbers, and we also touch on the concept of variables and the command substitution that will become more familiar later. The main focus of this course is on manipulation of data files and development of automated pipelines for their analysis. In those, the manipulation of strings and calculation on numbers aren’t central, though it is good to be aware of the concepts if such a need appears.
The main point of this section comes in the latter part about numbers: computers aren’t always very good at calculating on them!
Strings
In computer programming, a “string” is traditionally a sequence of characters. The string can be constant and not change, or stored in a variable such that it can be altered. As an example of this, here, “Matti” is a constant string and printed on the terminal screen with the command echo
:
> echo "Matti"
Matti
(echo
does what its name says, it “echoes” all its arguments.)
On the other hand, here, “Matti” is the value of the variable $name
and the string can be altered:
> name="Matti"
> echo $name
Matti
> name=${name/tti/ija}
> echo $name
Maija
The bash programming language provides some basic string operations. Often the same result can be achieved with a combination of (multiple) bash commands and there’s no single correct way of doing things.
Concatenating two strings
Strings can be concatenated by joining them together. Here, we combine the values of variables $forename
and $surname
and assign the result to variable $name
:
> forename="Matti"
> surname="Meikäläinen"
> name=$forename$surname
> echo $name
MattiMeikäläinen
White space is also a string and can be joined, and variables are evaluated (their value considered) when printed within double quotes. Thus, the first two lines are equivalent:
> name=$forename" "$surname
> name="$forename $surname"
> echo $name
Matti Meikäläinen
Sometimes it is necessary to wrap the variable name in curly brackets so that the variable name stays separate from the surrounding text:
> name="Matti"
> echo "${name}la"
Mattila
One could always write ${name}
instead of $name
but not necessarily $name
instead of ${name}
(one such case is shown above). However, writing the non-compulsory curly brackets is time-consuming and can make the commands more difficult to read and, in this course material, they are often left out.
The text may be a mixture of British and American spelling but in punctuation marks, the aim is to use British names.
( )
= brackets[ ]
= square brackets{ }
= curly brackets
Extracting substrings
Many programming languages have a specific command for extracting a part of a string. In bash, a substring is obtained by adding the start and possibly the end position as numbers: either ${<variable name>:<start pos>:<end pos>}
or ${<variable name>:<start pos>}
. (The latter gets everything till the end of the string.) Note that that bash counts characters from zero!
> echo ${surname:0:5}" "${forename}${surname:8}
Meikä Mattinen
If the command above looks incomprehensible, one can split it into parts and do first echo ${surname:0:5}
and then echo ${surname:8}
.
The length of string is evaluated with the hash sign:
> length=${#surname}
> echo $length
11
Replacing and deleting substrings
Above, the colon was used to specify the substring positions. In replacement and deletion, the separator is the forward slash. Two slashes separate the target and the replacement as in ${<variable name>/<target>/<replacement>}
:
> surname="Meikäläinen"
> echo ${surname/läinen/mies}
Meikämies
If the target is deleted, the second slash can be left out (though it doesn’t harm to write it either):
> echo ${surname/läi}
Meikänen
The replacement or deletion is performed globally (as many times as it can be done) by writing the first slash twice:
> echo ${surname//i/}
Mekälänen
For more complex pattern search and replacement, there is a whole “editor language”, known as sed
. We’ll have a whole section about sed
and awk
later.
Upper and lower-case
The first character of the string can be converted to upper and lower case with ^
and ,
, respectively. If these characters are given twice, the replacement is done for all characters. So to convert to upper case ^
is used:
> name="matti"
> echo ${name^}
Matti
> echo ${name^^}
MATTI
and to convert to lower case ,
is used:
> name="MATTI"
> echo ${name,,}
matti
The target characters can be specified by listing them after the operator in square brackets:
> name="matti"
> echo ${name^^[at]}
mATTi
There are numerous other ways to do the conversion to upper or lower case and we’ll learn a few of those later.
Numbers
Most computer systems are based on binary numbers that consist of only two types of digits, ones and zeros. Although this fact is hidden deep inside the system and one can use computers without ever seeing binary numbers, it is good to be aware of the limitations set by this technical detail.
ASCII
The early computers used binary numbers consisting of eight bits, capable of representing numbers 0-255. As everything inside the computer had to be stored as binary numbers, a standard was defined to encode the printable characters (letters, numbers, punctuation etc.) and the control characters as specific numbers. This long-standing standard is known as ASCII (see Wikipedia) and contains only 128 characters (requiring seven bits). This set of characters does not include umlauts and other non-English letters. Although ASCII has been replaced by standards (especially UTF, see Wikipedia) capable of representing millions of different characters, some computer systems and programs still expect the ASCII character set (e.g. the UH usernames only contain ASCII characters). Although the ASCII set is not about numbers per se, its constraints are defined by the underlying number system and the ASCII system explains many of the limitations of command-line programs.
Integers and floating point numbers
Modern computers use many more bits to code numbers and do perfectly fine with very large positive and negative integers (i.e. “whole” numbers). However, the binary system is not ideally suited for representing decimal numbers and computers can make trivial-looking errors in calculations involving decimals. This can be seen e.g. when summing 0.1 and 0.2 using Python:
> python3 -c "print(0.1+0.2)"
0.30000000000000004
Such errors have no practical significance if the result is rounded to a precision of a few digits, but they may cause errors if not considered e.g. in comparisons. An inexperienced programmer could write pseudo-code like this:
x=0.1
y=0.2
if(x+y==0.3)
print "True: x+y is 0.3"
else
print "False: x+y is not 0.3"
and would then be surprised that the computer keeps making errors. There are ways to get around this problem, but the main point is that computers are prone to make errors with decimal numbers – or floating point numbers as they are typically called.
In bash, the calculation is done inside double brackets preceded by a dollar sign, $(( ))
. The bash language understand the basic operators (+, -, *, /, **, %
) but it can only handle integers and truncates all decimal numbers:
> echo $((3*2))
6
> echo $((3/2))
1
Sometimes one can get around this limitation by using percentages, i.e. multiplying first by 100 and only then dividing the number:
> for i in {1..10}; do
echo "$(($i*100/10))% of analysis done"
done
10% of analysis done
20% of analysis done
30% of analysis done
40% of analysis done
50% of analysis done
60% of analysis done
70% of analysis done
80% of analysis done
90% of analysis done
100% of analysis done
As the decimals are cut out, the alternative using fractions is not very informative:
> for i in {1..10}; do
echo "$(($i/10)) of analysis done"
done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
0 of analysis done
1 of analysis done
The calculation can be done with constant numbers or numbers stored in variables:
> for i in {1..10}; do
echo "$i**$i is $(($i**$i))";
done
1**1 is 1
2**2 is 4
3**3 is 27
4**4 is 256
5**5 is 3125
6**6 is 46656
7**7 is 823543
8**8 is 16777216
9**9 is 387420489
10**10 is 10000000000
Decimal numbers
When working on the command line, one of the easiest programs to do “on the spot” calculation is to start the Python shell (or R shell) and do it there. This can be done with the command python
or python3
, and the shell is closed with Ctrl+d. Another option is the calculator language bc
but that, by default, has the precision set at zero digits and needs more typing than Python for the same task. Nevertheless, bc
should be present on all Linux systems and is often used to provide floating point calculation in bash scripts.
The trick of using bc
is to create a text (written to STDOUT; read by bc
from STDIN) that resembles the command that one would write manually in the bc
interface. In bc
, “scale=1” sets the output to have the precision of one digit. If we have the command:
> echo "scale=1; 26/3" | bc
8.6
we can get the same output by starting bc
and then typing the commands scale=1
and 26/3
there. One can quit by typing “quit” or by pressing Ctrl+d.
With that, one can make meaningful calculations with decimal numbers within bash scripts:
> C=26
> F=$(echo "scale=2; $C * (9/5) + 32" | bc -l)
> echo "$C degrees Celsius is equal to $F degrees Fahrenheit."
26 degrees Celsius is equal to 78.80 degrees Fahrenheit.
Using the command scale=x
, one can set the precision of bc
output. Most programming languages have the printf
command that allows for formatting the variable that is printed. The details of the command can be easily found elsewhere (starting from Wikipedia) and only a simple example for rounding a float is considered here. Below, %.3f
indicates that the variable is “float” and should be printed with three (3) decimal places after the dot (.):
> num=0.99999996779
> val=$(printf "%.3f" $num)
> echo $val
1.000
One can have multiple variables of different types as arguments and the external content (coming from the arguments) can be mixed with constant text and control characters. Note that printf
doesn’t write newline unless specified (\n
below):
> printf "The values are: %.3f and %.1f \n" 0.12476 0.12476
The values are: 0.125 and 0.1
Unlike the bash operations that truncate the numbers, printf
does the rounding to integers correctly:
> printf "%.0f \n" 1.6
2
If needed, bash provides some tools for the manipulation of strings. Bash can do basic calculations on integers and bc
can be used to calculate on decimal numbers. Computers can’t represent all decimal numbers accurately and are prone to make errors in certain calculations.