Bash Programming

Pages:

  1. Basic concepts

  2. Tool selection

  3. Working with files

  4. Collating with associative arrays

  5. Avoiding code injection

  6. Data structures

Examples:

  1. Example 1: Modifying a config file

  2. Example 2: Unraveling the X-Y problem

  3. Example 3: Grouping integers into ranges

  4. Example 4: Searching for backronyms

Basic concepts

This document is intended for programmers who are trying to get things done in bash. It assumes a familiarity with basic programming concepts (variables, loops, functions, arrays), and with the fundamental bash syntax. If you're not familiar with bash syntax, see BashGuide for an introduction.

BashPitfalls is recommended companion reading. Where BashPitfalls tells you what not to do, this document will try to tell you what you should do.

This first page will cover some basic concepts:

Script

A script on a Unix system must use Unix newline characters, not Microsoft's carriage return + newline pairs. It should also have execute permission (chmod +x) and read permission (chmod +r), as the interpreter must literally open and read the content of the script.

Shebang

Every script must start with a shebang line (#!/something). This tells the kernel what interpreter to execute, to read your script. This is not optional.

If your script uses bash features, use a bash shebang -- either #!/bin/bash if you're writing for Linux systems only, or #!/usr/bin/env bash if you're writing for multiple Unixes. Or have your installer detect the location of bash and write the correct shebang line at script installation time. Or tell the user to edit the script in your README file.

If you use #!/bin/sh then your script must conform to either POSIX sh or Bourne sh syntax, depending on whether your target is "most Unix systems" or "literally all Unix systems". This document will not cover POSIX or Bourne sh programming. Bashism has some notes on that, but it's nowhere near a full guide.

If you don't have a shebang, then the kernel will not be able to execute your script. If you try to run the script from a shell, and the kernel fails to execute it, the shell will "helpfully" spawn either /bin/sh or a child copy of itself to interpret the script, which means the script will be interpreted by some shell that you can't predict. This is bad. Don't do this. Always have a shebang line.

The shebang must appear immediately at the beginning of the file. No leading whitespace, no leading Byte Order Marks, etc. The kernel doesn't have time for such foolishness.

Commands and Quoting

This will be an extremely terse section. For more details, see Quotes and Arguments.

A script consists of the shebang, followed by zero or more commands. Commands are typically one line apiece, ending with a newline, but some commands may span multiple lines. Bash does not read the script all at once. Instead, it reads one command at a time, as needed. Each command is read and executed, and then the next command is retrieved, until the script ends, either by reaching end of input, or by running the exit command, or by replacing itself with the exec command. (Or by fatal error.)

Each command is parsed into words, using spaces, tabs and sometimes newlines as word separators. Quotes are used to allow whitespace or other syntactic metacharacters (e.g. < or ;) to appear within a word, and also to suppress word splitting and filename expansions on the results of substitutions.

Simple commands consist of an optional series of variable assignments, followed by an optional command name, followed by optional argument words. Redirection words may appear anywhere within the simple command, although they are usually placed at the end of the command. If there are variable assignments but no command name, then the variables are assigned in the current scope. If there are variable assignments and a command name, the variable assignments become temporary environment variables for the duration of that simple command.

Compound commands (e.g. if or while) typically include lists of simple commands within their grammar. The exit status of the compound command varies, depending on which compound command is used.

Pipelines connect two or more compound commands, with the standard output of the first becoming the standard input of the second, and so on. The exit status of the pipeline is that of the last compound command. If the exit status of an earlier command is needed, it may be retrieved from the PIPESTATUS array variable.

Each compound command within a pipeline is executed in a separate subshell.

Variables

Bash has several types of variables:

String variables can also be given a "readonly" flag, or an "integer" flag, but these are mostly useless, and serious scripts shouldn't be using them. A readonly variable is designed to complement a restricted shell; any attempts to write to it are treated as attempts to violate the system administrator's restrictions. It's not analogous to a "const" from other languages. The classic example of a readonly variable is PATH in a restricted shell.

Variables with the integer flag force bash to treat all assignments as if they were performed in a math context. This is usually a bad idea -- not only does it introduce confusion (the effect of a command is not clear to the reader), but it also introduces potential security risks.

Variables can be local to functions, and in sufficiently new versions of bash, a variable may be used as a "name reference" to another variable; these uses will be discussed under Functions.

Variable names must begin with a letter or underscore, and may contain only letters, digits and underscores. An ALL_CAPS name should only be used for environment variables, or special internal bash variables (e.g. PIPESTATUS, BASHPID). All other variable names should contain at least one lowercase letter, to minimize namespace collisions. (Note: you cannot avoid them. Bash is extremely primitive and hackish.)

Indexed (non-associative) arrays can be created without any special declarations -- either by assigning an entire array at once with a=(this is an array) or by assigning to a single indexed array element (a[42]=foo).

Associative arrays must be declared in advance: declare -A hash. Remember that declarations inside a function create a variable with local scope (the declare -g option was added to work around this, declaring at the global scope).

Arrays cannot be exported into the environment. Only strings can. Remember, the environment is a list of KEY=VALUE pairs used by (potentially) every program on the entire system.

Bash does not have native support for any other data structures. You can use indexed arrays as lists, sort of, in the same way that Perl does (see below). If you want any other data structure, you'll have to build it yourself.

Variable name selection

A good variable name should be short but meaningful. It should describe what you're using the variable to do. The intent of your code should always be clear to the reader -- remember, a year from now, the person reading it may be you. You won't remember why you did whatever crazy crap you did, unless you document it. Variable names and function names are the first two forms of documentation, and arguably the most important ones.

Iterator variables (loop counters) are often a single character, like i for an integer counter, or f for a file iterator. Some people prefer a slight variation of this (e.g. ii instead of i, on the grounds that you can search for ii), but in any case there are many old, deep traditions in programming, and they exist for a reason.

for f in *.wav; do
  lame ... "$f" "${f%.wav}.mp3"
done

When you see code like this, you know immediately that f is a filename iterator, used only to hold the current filename inside the for loop. You will not expect the variable to be used after the loop (one exception: if the loop's entire purpose is to get the name of the last file matching a glob, and even then you'd probably use a different name), or to suddenly and mysteriously hold any other kind of data.

Likewise, if your script opens a logfile that's "configured" by a variable at the top, that variable should be something like log or logfile. How many logfiles does your script use? Just one, right? So you don't need to call it EXCELSIOR_FULL_PATH_LOG_LOCATION. If you open a temporary file, store its name in something like tmpfile, not MY_TEMPORARY_FILE_Q37X.

Remember, shell scripts are short. You are not writing a GUI application in some object-oriented language, with 2000 variables in 100,000 lines of code across 60 source code files in multiple directories. It's a shell script. It's a hundred lines or less (ideally), and you've got maybe 5-10 variables. Keep it simple.

Code that conforms to sane and traditional practices also reassures the reader that whoever wrote the code had half a clue. This is an important piece of information if you're debugging code. If you're tracking down a bug, and the code looks like the examples in this document, then you're probably looking for a subtle bug. If the code looks like a cat ran across the keyboard, then you need to look for obvious newbie blunders first (probably after reformatting). You'll probably have to keep making pass after pass, removing errors in layers, revealing still more mistakes underneath each previous failure stratum.

Arrays as lists

Bash's indexed arrays can be used in a few ways. The most obvious is to treat them as look-up tables, indexed by an integer, to retrieve a string associated with that index. For example,

# Define data field values for customer 42.
last_name[42]="Johnson"
first_name[42]="James"
address[42]="123 Main Street"

Of course, this is vastly inferior to using an actual database, and a programming language that supports an interface to store and retrieve information in that database. But for some simple scripts, this kind of direct lookup may be appropriate.

More commonly, an array is used to hold a list of string values (filenames are common). In these cases, we almost never refer to an individual element of the list. Most often, the entire list is passed as arguments to a command, or we iterate over the whole list using a for loop.

files=(*.ogg)

# We may iterate by index or by element.  If we want to modify
# items in the list, then we iterate by index.

# I use the variable name "i" to indicate that I'm iterating by index.
# If I were iterating by element, I'd use "f" since they are files.

for i in "${!files[@]}"; do
  if vorbiscomment -l "${files[i]}" | grep -qiw "disco"; then
    unset 'files[i]'
  fi
done

# Pass all the remaining files to the music player.
mplayer "${files[@]}"

Loops

Bash has two basic kinds of loops: while and for. Use for if you're iterating over a list (arguments, filenames, array elements, etc.). Use while if there is no list, and you must loop until some arbitrary condition is met (e.g. end of the input file).

The for loop has two forms: for varname [in LIST] and for ((start; check; next)). In the first style, if the in LIST is omitted, in "$@" is assumed. The second style is called the "C-style for loop", and is used when you're counting from n to m.

Examples:

for file in ./*; do diff "$file" "Old/$file"; done

while IFS=: read -r user pwdhash uid gid _; do
  echo "user $user has UID $uid and primary GID $gid"
done < /etc/passwd

find . -type f -name '*.wav' -exec \
  sh -c 'for f; do lame "$f" "${f%.wav}.mp3"; done' x {} +

time for ((i=1; i<=10000; i++)); do something; done

Each of the three expressions in the C-style for loop is an arithmetic expression; these are covered below.

Arithmetic

Bash can do integer arithmetic, but not floating/fixed point. As of bash 2.05b, all integers use C's intmax_t variable type (typically 64 bits, but it depends on the platform).

An arithmetic expression is anything that is evaluated by the C-like expression parser in an arithmetic context. The most basic math context is the $(( )) arithmetic substitution:

x=$((y + z))

The parsing rules are very, very different in a math context. Whitespace is irrelevant. Variable expansions don't need $, and a variable that contains something that looks like a math expression is recursively evaluated. Bash keeps evaluating recursively until it finds an integer, or an empty value (treated as 0), or something that's neither an integer nor a valid math expression, in which case it's an error. * is a multiplication operator, not a glob. The ? : ternary operator from C is present, but it can only give integer results, which makes lots of people sigh and move on. All of the C bitwise operators are present, as well as modulus (%). The ** operator does integer exponentiation (^ is bitwise XOR).

The brackets around an indexed array's index are also a math context, which allows you to write things like "${a[i+1]}".

An arithmetic command is an arithmetic expression that is used where a command would normally appear. It has an exit status, which is derived from the C rules ("0 is false, and anything else is true"). Thus, a command like

((x=0))

is an arithmetic command, with a side effect (the string variable x is given the content 0), and with an exit status ("false", because the expression evaluated to 0; the exact exit status will be 1). Most commonly, arithmetic commands are used in if or while statements:

if ((x < 4)); then echo "Too small, try again"; fi

Note that, because math contexts use C parsing rules, = is an assignment and == is a comparison.

Also note that numeric constants in a math context follow C rules. A number with a leading 0x is hexadecimal, and a leading 0 (not followed by x) is octal. All other integers are treated as base 10, unless prefixed by base# where base is a base 10 integer telling bash which base to use. The leading-zero-is-octal "feature" creates a huge pitfall that you must be aware of, and work around.

if (( $(date +%d) < 7 )); then ...    # WRONG.  Octal death.

When using inputs that may have leading zeroes, you should strip those away before exposing them in a math context. The easiest way is to force a base 10 evaluation:

day=$(date +%d)
day=$((10#$day))

See the man page for other uses of base# in arithmetic expressions.

Functions

Bash functions are really more like procedures, or user-implemented shell commands. They cannot return values to the caller. All they can actually return is an exit status (0 to 255). They're usually created with the explicit intent to have side effects (change the values of variables, or write to stdout/stderr, etc.). They can also be used to create execution environments with temporarily changed variables, or (most commonly of all) to factor out a common chunk of code, so that it doesn't need to be repeated.

All functions are created in a single, global namespace, even if the function is created while inside another function.

Functions get their own positional parameters ($1, $2, $@), and may have locally scoped variables. They may be called recursively.

Bash uses dynamically scoped variables. If variable is referenced inside a function, but it isn't local to that function, bash will look in the caller's scope, and then the caller's caller's scope, and so on, until it finds a variable with that name, or reaches the global scope.

If you actually want to return a value to the caller, there are four main ways. First, you may store it in a variable that the caller can see; or second, write it to stdout and have the caller capture it with a command substitution. The latter is slow because it means bash has to fork() a new process, and the function runs in that new process (which is called a "subshell"). Nevertheless, the slow command substitution method is the most commonly used one, because it doesn't require much thought.

Examples:

usage() {
  echo "usage: foo [-ijq] [file ...]"
}

die() {
  printf '%s\n' "$1" >&2
  exit "${2:-1}"
}

rand() {
  # Return result in global variable 'r'.
  local max=$((32768 / $1 * $1))
  while (( (r=$RANDOM) >= max )); do :; done
  r=$(( r % $1 ))
}

ajoin() {
  local IFS="$1"; shift
  printf '%s\n' "$*"
}

The third method of returning values to the caller is to use "name references" (bash 4.3 or higher). Warning: these are not robust. There is a namespace collision issue which cannot be circumvented. They are essentially syntactic sugar for a limited eval. Nevertheless, they are sometimes good enough in real life (as long as you control the caller).

Name references are created with declare -n, and they are local variables with local names. Any reference to the variable by its local name triggers a search for a variable with the name of its content. This uses the same dynamic scope rules as normal variables. So, the obvious issues apply: the local name and the referenced name must be different. The referenced name should also not be a local variable of the function in which the nameref is being used.

Example:

rand2() {
  # Store the result in the variable named as argument 2.
  declare -n r=$2
  local max=$((32768 / $1 * $1))
  while (( (r=$RANDOM) >= max )); do :; done
  r=$(( r % $1 ))
}

# Don't call "rand2 20 r"!  You get errors.
# Don't call "rand2 20 max"!  You get an infinite loop.

The workaround for this is to make every local variable in the function (not just the nameref) have a name that the caller is unlikely to use.

rand2() {
  # Store the result in the variable named as argument 2.
  declare -n _rand2_r=$2
  local _rand2_max=$((32768 / $1 * $1))
  while (( (_rand2_r=$RANDOM) >= _rand2_max )); do :; done
  _rand2_r=$(( _rand2_r % $1 ))
}

The fourth method of returning a value to the caller is to store the value in a file, and have the caller read it from that file. This may actually be faster than the command substitution method, since it doesn't involve a fork()ed subshell. Short-lived temporary files can be quite efficient, as they will usually not touch actual storage hardware (depending on how quickly the script uses and removes them, and the operating system's configuration).

Exporting functions

Bash can export functions through the environment, using a special hack that encodes the function definition as a string and inserts it into the environment with a specially crafted name. The form of this name depends on whether your version of bash has been patched against the Shellshock vulnerability, and by whom. Therefore, you can't necessarily export functions between different versions of bash, unless you're confident they were both patched in the same way.

The most common use of exported functions (other than exploiting vulnerable CGI web pages) is to let find run one on each file:

foo() { ...; }
export -f foo
find . -type f -exec bash -c 'for f; do foo "$f"; done' x {} +


Basic concepts | Tool selection ->


CategoryShell

BashProgramming (last edited 2023-01-04 04:29:47 by GreyCat)