<- Basic concepts | Tool selection | Working with files ->

Tool selection

When programming in bash, you have a decently large set of powerful tools from which to choose. One of the main problems, then, is selecting the correct tool for the job.

Contents

Tool selection

To bash, or not to bash

The first issue you need to ask yourself is whether bash is even an appropriate language for the task. Bash excels at automating simple jobs, like setting up an environment to execute a single process, or iterating over the files in a directory. See BashWeaknesses for a list of things at which bash is quite bad. Use a different programming language if the task is beyond bash's capabilities.

Of course, this means you will need to have an understanding of those capabilties. This will come with experience. Don't be afraid to start a project in bash, only to learn that it's just not going to work, and then scrap it and start over in some other language. This is natural. The goal is to be able to reach that decision more quickly, so that you waste less time.

Some of the other scripting languages that you might consider:

Awk is a standard POSIX tool, extremely mature, and quite powerful for text file processing. Most veteran bash programmers don't appreciate its full potential. I won't cover much awk here, but you are encouraged to learn it.
Perl has been around a long time, and was enormously popular during Linux's infancy. Thus, it is widely available on Linux systems, and is a viable choice on all Unixes. It combines features from shells, sed, awk and other languages. Leans heavily on regular expressions. Extremely terse syntax, which some people find hard to read. You can think of it as "sed on steroids."
Tcl has been around even longer than perl. It uses a shell-like syntax, but adopts many features from LISP. You want an array of lists? Lists of lists? Clean shell syntax without quoting hell? Tcl is for you. Tcl should be much more popular than it is.
Python is the new kid on the block, but it has been embraced by several niche communities (academia, research). May be a great choice for certain tasks due to the many specialized modules available.

Strength reduction

One of the fundamental rules of programming is strength reduction, or in simple terms, "always use the weakest tool". (Wikipedia says it's just for compilers, but I disagree. It should be a core component of your strategy as a programmer, especially in bash, where the cost of a powerful tool may be orders of magnitude higher than the cost of a weaker tool that can do the same job.)

Consider some sets of tools that have overlapping uses:

perl vs. awk vs. sed vs. while read
regular expression vs. extended glob matching vs. glob matching vs. string comparison
sed in a command substitution vs. builtin parameter expansion
expr vs. builtin arithmetic

In the first set (perl vs. awk vs. sed vs. while read), if you're considering ways to operate on a line-oriented input text file, the choice will boil down to what you actually need to do to each line. If you just need to print the first field of each line, and it's a fairly large file, use awk '{print $1}'. It could be done in sed, but it would be much uglier; or in perl, but that might be much slower (and is not guaranteed to be universally available). You could also do it with a while read loop in bash, but that tends to be slower than awk for large input files. On the other hand, if the file just has 2-3 lines, bash might be faster.

Perl is a stronger tool than awk, which means it has a higher cost to use. You don't need to pay that cost for a simple job like printing one field of each line. Awk is stronger than sed, but not much more expensive to use. For this job, awk's elegance outweighs the nominal cost savings of sed. (That trade-off is a judgment call on your part.)

If your job is "remove all trailing whitespace from each line", sed 's/[[:space:]]*$//' is the natural choice. You could do it in awk or perl, but those have a higher cost, and the solution in awk would be uglier. Again, a while read loop in bash could also do it, but if the input is more than a few lines, sed will be faster.

Regular expression vs. glob matching tends to be a decision that people coming from perl and PHP forget they can even make. Most of the jobs that are suited to bash will not require regular expressions, so you can usually get by with simpler and faster globs. Never use a regex where a glob will do. Compare:

if [[ $input = a*.txt ]]

if [[ $input =~ ^a.*\.txt$ ]]

Not only is the glob easier to read; it's also significantly less expensive in CPU cycles.

wooledg:~$ TIMEFORMAT='%R %U %S'
wooledg:~$ a="somefile.txt"
wooledg:~$ time for ((i=1; i<=100000; i++)); do [[ $a = a*.txt ]]; done
0.292 0.192 0.100
wooledg:~$ time for ((i=1; i<=100000; i++)); do [[ $a =~ ^a.*\.txt$  ]]; done
0.647 0.644 0.004

The time difference may be even more dramatic with more complex regular expressions.

Builtin parameter expansions (Bash FAQ 73, Bash FAQ 100) may not have the full power of a $(sed '...' <<< "$string") or similar command substitution, but they're significantly cheaper (faster), and for the vast majority of string manipulations they are sufficient. Compare:

vowels=$(tr -dc AEIOUaeiou <<< "$string")

vowels=${string//[!AEIOUaeiou]/}

There is a similar (and extremely common) choice between the dirname or basename commands, and the ${file%/*} or ${file##*/} expansions (respectively). The parameter expansions are significantly faster, but they may not handle degenerate inputs like / in the same way as the external commands. Do you care how your script handles the / case? If not, you probably want the faster code. If you do care, then you could still use builtins, just by adding a check for that one special input.

The last example (expr vs. builtin arithmetic) shouldn't even be a decision anyone has to make any longer. Builtin arithmetic is vastly superior in every way, and the $(( )) expansion is POSIX compliant. The only place for expr is in legacy Bourne shell scripts.

Look at your input

Often the hardest part of any bash script will be reading (parsing) input files. If your script is going to read a file, you the programmer need to take some time to think about the input. Actually spend a few minutes looking at the input. What parts of it do you actually need? What parts can be ignored? Are you seeing the worst case scenario? How would your script need to change if the input file had this instead of that?

The problem becomes even harder if you consider that the input file format may change over time. How certain are you that the input file you see today will have the same format next week? Next year? Do you need your script to continue working for 1 year? 10 years? Is it better to write a simpler, leaner script today knowing that you may have to redo it in 6 months, or would it be better to spend 10 times as long building a "flexible" script today that may or may not accomodate a future format change? There are no universally right answers.

There is no simple chart that can tell you to use Tool X for Input File Y. You'll need to compare the benefits and costs of each of your tools against your input file. I can only give some general suggestions here.

Delimited fields

This is one of your best case scenarios: the input file is divided into lines, each of which is divided into fields, and there is a consistent delimiter between fields. Sometimes the delimiter is "one or more whitespace characters". Sometimes it's a single colon (:) as for example in /etc/passwd or /etc/group.

"One or more whitespace characters" is such a common delimiter that both bash and awk have special internal ways of handling this automatically. In both cases, this is the default delimiter, so you really have it easy. Consider this example input file:

Smith    Beatrice   1970-01-01 123456789
Jackson  Paul          1980-02-02   987654321
O'Leary  Sheamus 1977-07-07 7777777

The fields don't all line up, but for our purposes that doesn't matter.

while read -r last first birth ssn; do
  ...
done < "$file"

Colon-delimited files like /etc/group work very similarly. You only need to set the IFS variable for the read command:

while IFS=: read -r group pass gid users; do
  ...
done < "$file"

Or the -F option for awk:

awk -F: '...' "$file"

See Bash FAQ 1 for more examples of this sort.

In some cases, you may use non-obvious field delimiters to slice up the input. For example, you may want to extract a substring that has unique punctuation around it, such as square brackets. You could tell awk to use the square bracket characters as field delimiters:

wooledg:~$ awk -F '[][]' '{print $2}' <<< 'blah blah [gold] blah'
gold

This works as long as you're certain the square brackets can't appear anywhere else in the line. Of course, you could also use parameter expansions if the line is in a bash variable:

wooledg:~$ string='blah blah [gold] blah'
wooledg:~$ tmp=${string#*[}; echo "${tmp%]*}"
gold

Which one do you pick? Again, it depends. I'd pick awk if there's more than one line from which I want to extract the substrings, and all I want to do is dump them to stdout. I'd use expansions if I'm already reading a line at a time in a bash loop for some other reason.

Column-aligned fields

This type of input file is less common these days, but was enormously common a few decades ago. You may still see it, especially if you deal with data coming from mature systems (COBOL, mainframes, anything that involves a government or a bank).

LASTNAME            FIRSTNAME      BIRTHDATE SSN
Smith               John           1956-05-03333333333
Lopez Garcia        Manuel         1995-12-30444444444
KrishnabathabrahmaguRajesh         1974-08-08555555555
 Misentered         Person         1960-01-01666666666

The typical field-oriented tools (like while read -r a b c) won't work on this type of file. Your typical approach will be to read each line as a whole, and then split the line into substrings using column indexing. For example:

while IFS= read -r line; do
  last=${line:0:20}
  first=${line:20:15}
  birth=${line:35:10}
  ssn=${line:45:9}
done < "$file"

You'll probably also need to trim spaces from each field.

Comma-separated value

At first glance, these appear to be field-delimited text files. In the simplest (degenerate) case, that's exactly what they are. Unfortunately, in the general case the field delimiter is allowed to appear inside one of the fields, and then there will be some attempt to wrap the field in quotes. And then, there will be some attempt to escape literal quotes so that those can appear. None of the ways that this is done are really standard. In theory, RFC 4180 is supposed to describe them, but real life implementations (spreadsheets) often deviate from it.

If you have to deal with one of these files, I suggest looking at one of the other scripting languages, which has a library/package/module specifically written to parse these files. For example, in Tcl + Tcllib, there is a csv package.

HTML, XML, JSON

If you need to deal with any of these, the most important thing I can tell you is that you must not fall into the trap of thinking that you can treat these as text files. Don't try to extract fields from any of these formats using sed, awk, etc. XML and HTML, especially, are not regular languages, so they can't be parsed with regular expressions, or with any of the tools that are built around regular expressions.

Use specialized programs for these formats. For JSON, there are jq and jshon, which you can use in shell scripts. For XML or HTML, there are several evolving tools that I won't attempt to cover here; refer to Bash FAQ 113.