Pages:

  1. Basic concepts

  2. Tool selection

  3. Working with files

Working with files

On the previous page, we looked at some input file formats, and considered the choice of various tools that can read them. But many scripts deal with the files themselves, rather than what's inside them.

Filenames

On Unix systems, filenames may contain whitespace. This includes the space character, obviously. It also includes tabs, carriage returns, newlines, and more. Unix filenames may contain every character except / and NUL, and / is obviously allowed in pathnames (which are filenames preceded by zero or more directory components, either relative like ./myscript or absolute like /etc/aliases).

It is a tragic mistake to write software that assumes filenames may be separated by spaces, or even newlines. Poorly written bash scripts are especially likely to be vulnerable to malicious or accidentally created unusual filenames. It's your job as the programmer to write scripts that don't fall over and die (or worse) when the user has a weird filename.

Iteration over filenames should be done by letting bash expand a glob, never by ParsingLs. If you need to iterate recursively, you can use the globstar option and a glob containing **, or you can use find. I won't duplicate the UsingFind page here; you are expected to have read it.

A single filename may be safely stored in a bash string variable. If you need to store multiple filenames for some reason, use an array variable. Never attempt to store multiple filenames in a string variable with whitespace between them. In most cases, you shouldn't need to store multiple filenames anyway. Usually you can just iterate over the files once, and don't need to store more than one filename at a time. Of course, this depends on what the script is doing.

Sometimes your script will need to read, or write, a file which contains a list of filenames, one per line. If this is an external demand imposed on you, then there's not much you can do about it. You'll have to deal with the fact that a filename containing a newline is going to break your script (or the thing reading your output file). If you're writing the file, you could choose to omit the unusual filename altogether (with or without an error message).

If you're using a file as an internal storage dump, you may safely store the list of filenames in a file if they are delimited by NUL characters instead of newlines. If they're in an array, this is trivial:

printf '%s\0' "${files[@]}" > "$outputfile"

To read such a file into an array, in bash 4.4:

mapfile -t -d '' files < "$inputfile"

Or in older bashes:

files=()
while IFS= read -r -d '' file; do
  files+=("$file")
done < "$inputfile"

This serialization works for any array, not just filenames. Bash arrays hold strings, and strings can't contain NUL bytes.

Opening and closing

(Introductory material: Redirection, FileDescriptor.)

Simple bash scripts will read from stdin and write to stdout/stderr, and never need to worry about opening and closing files. The caller will take care of that, usually by doing its own redirections.

Slightly more complex scripts may open the occasional file by name, usually a single output file for logging results. This may be done on a per-command basis:

myfunc >"$log" 2>&1

or by redirecting stdout/stderr once, at the top of the script:

exec >"$log" 2>&1
myfunc
anotherfunc

In the latter case, all commands executed by the script after exec inherit the redirected stdout/stderr, just as if the caller had launched the script with that redirection in the first place.

The exec command doubles as both "open" and "close" in shell scripts. When you open a file, you decide on a file descriptor number to use. This FD number will be what you use to read from/write to the file, and to close it. (Bash 4.1 lets you open files without hard-coding a FD number, instead using a variable to let bash tell you what FD number it assigned. We won't cover this here.)

Scripts may safely assume that they inherit FD 0, 1 and 2 from the caller. FD 3 and higher are therefore typically available for you to use. (If your caller is doing something special with open file descriptors, you'll need to learn about that and deal with it. For now, we'll assume no such special arrangements.)

Bash can open files in 4 different modes:

Opening a file for write will clobber (destroy the contents of) any existing file by that name, even if you don't actually write anything to that FD. You can set the noclobber option (set -C) if this is a concern. I've never actually seen that used in a real script. (It may be more common in interactive shells.)

Opening a file for append means every write to the file is preceded (atomically, magically) by a seek-to-end-of-file. This means two or more processes may open the file for append simultaneously, and each one's writes will appear at the end of the file as expected. (Do not attempt this with two processes opening a file for write. The semantics are entirely different.)

The read+write mode is normally used with network sockets, not regular files.

Closing a file is simple: exec 3>&- or exec 3<&- (either one should work).

Reading and writing with file descriptors

To read from an FD, you take a command that would normally read stdin, and you perform a redirection:

read -r -s -p 'Password: ' pwd <&3

To write to an FD, you do the same thing using stdout:

printf '%s\n' "$message" >&3

Here's a realistic example:

while read -r host <&3; do
  ssh "$host" ...
done 3<"$hostlist"

SSH slurps stdin, which would interfere with the reading of our hostlist file. So we do that reading on a separate FD, and voila. Note that you don't do this: while read -r host <"$hostlist"; do .... That would reopen the file every time we hit the top of the loop, and keep reading the same host over and over.

The placement of the "open" at the bottom of the loop may seem a bit weird if you're not used to bash programming. In fact, this syntax is really just a shortcut. If you prefer, you could write it out the long way:

exec 3<"$hostlist"
while read -r host <&3; do
  ssh "$host" ...
done
exec 3<&-

And here's an example using an output FD:

exec 3>>"$log"
log() { printf '%s\n' "$*" >&3; }

Each time the log function is called, a message will be written to the open FD. This is more efficient than putting >>"$log" at the end of each printf command, and easier to type.


<- Tool selection | Working with files |