Pages:
- Tool selection
Tool selection
When programming in bash, you have a decently large set of powerful tools from which to choose. One of the main problems, then, is selecting the correct tool for the job.
Contents
To bash, or not to bash
The first issue you need to ask yourself is whether bash is even an appropriate language for the task. Bash excels at automating simple jobs, like setting up an environment to execute a single process, or iterating over the files in a directory. See BashWeaknesses for a list of things at which bash is quite bad. Use a different programming language if the task is beyond bash's capabilities.
Of course, this means you will need to have an understanding of those capabilties. This will come with experience. Don't be afraid to start a project in bash, only to learn that it's just not going to work, and then scrap it and start over in some other language. This is natural. The goal is to be able to reach that decision more quickly, so that you waste less time.
Strength reduction
One of the fundamental rules of programming is strength reduction, or in simple terms, "always use the weakest tool". (Wikipedia says it's just for compilers, but I disagree. It should be a core component of your strategy as a programmer, especially in bash, where the cost of a powerful tool may be orders of magnitude higher than the cost of a weaker tool that can do the same job.)
Consider some sets of tools that have overlapping uses:
- perl vs. awk vs. sed vs. while read
- regular expression vs. extended glob matching vs. glob matching vs. string comparison
expr vs. builtin arithmetic
In the first set (perl vs. awk vs. sed vs. while read), if you're considering ways to operate on a line-oriented input text file, the choice will boil down to what you actually need to do to each line. If you just need to print the first field of each line, and it's a fairly large file, use awk '{print $1}'. It could be done in sed, but it would be much uglier; or in perl, but that might be much slower (and is not guaranteed to be universally available). You could also do it with a while read loop in bash, but that tends to be slower than awk for large input files. On the other hand, if the file just has 2-3 lines, bash might be faster.
Perl is a stronger tool than awk, which means it has a higher cost to use. You don't need to pay that cost for a simple job like printing one field of each line. Awk is stronger than sed, but not much more expensive to use. For this job, awk's elegance outweighs the nominal cost savings of sed. (That trade-off is a judgment call on your part.)
If your job is "remove all trailing whitespace from each line", sed 's/[[:space:]]*$//' is the natural choice. You could do it in awk or perl, but those have a higher cost, and the solution in awk would be uglier. Again, a while read loop in bash could also do it, but if the input is more than a few lines, sed will be faster.
Regular expression vs. glob matching tends to be a decision that people coming from perl and PHP forget they can even make. Most of the jobs that are suited to bash will not require regular expressions, so you can usually get by with simpler and faster globs. Never use a regex where a glob will do. Compare:
if [[ $input = a*.txt ]]
if [[ $input =~ ^a.*\.txt$ ]]
Not only is the glob easier to read; it's also significantly less expensive in CPU cycles.
wooledg:~$ TIMEFORMAT='%R %U %S' wooledg:~$ a="somefile.txt" wooledg:~$ time for ((i=1; i<=100000; i++)); do [[ $a = a*.txt ]]; done 0.292 0.192 0.100 wooledg:~$ time for ((i=1; i<=100000; i++)); do [[ $a =~ ^a.*\.txt$ ]]; done 0.647 0.644 0.004
The time difference may be even more dramatic with more complex regular expressions.
The last example (expr vs. builtin arithmetic) shouldn't even be a decision anyone has to make any longer. Builtin arithmetic is vastly superior in every way, and the $(( )) expansion is POSIX compliant. The only place for expr is in legacy Bourne shell scripts.
Look at your input
Often the hardest part of any bash script will be reading (parsing) input files. If your script is going to read a file, you the programmer need to take some time to think about the input. Actually spend a few minutes looking at the input. What parts of it do you actually need? What parts can be ignored? Are you seeing the worst case scenario? How would your script need to change if the input file had this instead of that?
There is no simple chart that can tell you to use Tool X for Input File Y. You'll need to compare the benefits and costs of each of your tools against your input file. I can only give some general suggestions here.
Delimited fields
This is one of your best case scenarios: the input file is divided into lines, each of which is divided into fields, and there is a consistent delimiter between fields. Sometimes the delimiter is "one or more whitespace characters". Sometimes it's a single colon (:) as for example in /etc/passwd or /etc/group.
"One or more whitespace characters" is such a common delimiter that both bash and awk have special internal ways of handling this automatically. In both cases, this is the default delimiter, so you really have it easy. Consider this example input file:
Smith Beatrice 1970-01-01 123456789 Jackson Paul 1980-02-02 987654321 O'Leary Sheamus 1977-07-07 7777777
The fields don't all line up, but for our purposes that doesn't matter.
while read -r last first birth ssn; do ... done < "$file"
Colon-delimited files like /etc/group work very similarly. You only need to set the IFS variable for the read command:
while IFS=: read -r group pass gid users; do ... done < "$file"
See Bash FAQ 1 for more examples of this sort.
Column-aligned fields
This type of input file is less common these days, but was enormously common a few decades ago. You may still see it, especially if you deal with data coming from mature systems (COBOL, mainframes, anything that involves a government or a bank).
LASTNAME FIRSTNAME BIRTHDATE SSN Smith John 1956-05-03333333333 Lopez Garcia Manuel 1995-12-30444444444 KrishnabathabrahmaguRajesh 1974-08-08555555555 Misentered Person 1960-01-01666666666
The typical field-oriented tools (like while read -r a b c) won't work on this type of file. Your typical approach will be to read each line as a whole, and then split the line into substrings using column indexing. For example:
while IFS= read -r line; do last=${line:0:20} first=${line:20:15} birth=${line:35:10} ssn=${line:45:9} done < "$file"
You'll probably also need to trim spaces from each field.
<- Basic concepts | Tool selection |