Differences between revisions 9 and 10

<- Tool selection | Working with files | Collating with associative arrays ->

Working with files

On the previous page, we looked at some input file formats, and considered the choice of various tools that can read them. But many scripts deal with the files themselves, rather than what's inside them.

Contents

Working with files

Filenames

On Unix systems, filenames may contain whitespace characters. This includes the space character, obviously. It also includes tabs, carriage returns, newlines, and more. Unix filenames may contain every character except / and NUL, and / is obviously allowed in pathnames (which are filenames preceded by zero or more directory components, either relative like ./myscript or absolute like /etc/aliases).

In fact, it's even worse: Unix filenames don't consist of characters at all; they consist of bytes. A filename may not even be a valid character sequence in your locale's character encoding. (Some languages call these "byte arrays"; bash lacks that particular terminology, but if you're familiar with it from another language, then that's what they are.)

Since whitespace characters may be included in a filename, it is a tragic mistake to write software that assumes filenames may be separated by spaces, or even newlines. Poorly written bash scripts are especially likely to be vulnerable to malicious or accidentally created unusual filenames. It's your job as the programmer to write scripts that don't fall over and die (or worse) when the user has a weird filename.

Iteration over filenames should never be done by ParsingLs. Instead, let bash expand a glob. If you need to iterate recursively, you can use the globstar option and a glob containing **, or you can use find. I won't duplicate the UsingFind page here; you are expected to have read it. Later, we'll explore the glob-vs.-find choice in depth.

A single filename may be safely stored in a bash string variable. If you need to store multiple filenames for some reason, use an array variable. Never attempt to store multiple filenames in a string variable with whitespace between them. In most cases, you shouldn't need to store multiple filenames anyway. Usually you can just iterate over the files once, and don't need to store more than one filename at a time. Of course, this depends on what the script is doing.

Sometimes your script will need to read, or write, a file which contains a list of filenames, one per line. If this is an external demand imposed on you, then there's not much you can do about it. You'll have to deal with the fact that a filename containing a newline is going to break your script (or the thing reading your output file). If you're writing the file, you could choose to omit the unusual filename altogether (with or without an error message).

If you're using a file as an internal storage dump, you may safely store the list of filenames in a file if they are delimited by NUL characters instead of newlines. If they're in an array, this is trivial:

print0() { [ "$#" -eq 0 ] || printf '%s\0' "$@"; }
print0 "${files[@]}" > "$outputfile"

(the empty array case needs to be handled specially as printf '%s\0' without argument would print one empty record instead of nothing at all)

To read such a file into an array, in bash 4.4:

readarray -t -d '' files < "$inputfile"

Or in older bashes:

files=()
while IFS= LC_ALL=C read -r -d '' file; do
  files+=("$file")
done < "$inputfile"

The IFS= suppresses the trimming of leading/trailing whitespace characters that you'd get with the default value of $IFS. LC_ALL=C works around a bug in some versions of bash. readarray does not appear to have the same bugs that read does.

This serialization works for any array, not just filenames. Bash arrays hold C-like strings, and those can't contain NUL bytes.

Opening and closing

(Introductory material: Redirection, FileDescriptor.)

Simple bash scripts will read from stdin and write to stdout/stderr, and never need to worry about opening and closing files. The caller will take care of that, usually by doing its own redirections.

Slightly more complex scripts may open the occasional file by name, usually a single output file for logging results. This may be done on a per-command basis:

myfunc >"$log" 2>&1

or by redirecting stdout/stderr once, at the top of the script:

exec >"$log" 2>&1
myfunc
anotherfunc

In the latter case, all commands executed by the script after exec inherit the redirected stdout/stderr, just as if the caller had launched the script with that redirection in the first place.

The exec command doubles as both "open" and "close" in shell scripts. When you open a file, you decide on a file descriptor number to use. This FD number will be what you use to read from/write to the file, and to close it. (Bash 4.1 lets you open files without hard-coding a FD number, instead using a variable to let bash tell you what FD number it assigned. We won't cover this here.)

Scripts may safely assume that they inherit FD 0, 1 and 2 from the caller. FD 3 and higher are therefore typically available for you to use. (If your caller is doing something special with open file descriptors, you'll need to learn about that and deal with it. For now, we'll assume no such special arrangements.)

Bash and sh can open files in 4 different modes:

Read: exec 3<"$file"
Write: exec 3>"$file"
Append: exec 3>>"$file"
Read+write (without truncation): exec 3<>"$file"

Opening a file for write will clobber (truncate, destroy the contents of) any existing file by that name, even if you don't actually write anything to that FD. You can set the noclobber option (set -C) if this is a concern. I've never actually seen that used in a real script. (It may be more common in interactive shells.)

Opening a file for append means every write to the file is preceded (atomically, magically) by a seek-to-end-of-file. This means two or more processes may open the file for append simultaneously, and each one's writes will appear at the end of the file as expected. (Do not attempt this with two processes opening a file for write. The semantics are entirely different.)

The read+write mode is more commonly used for bidirectional streams such as network sockets. It can be useful for regular files, not so much because it's read+write but because contrary to >, it skips truncation allowing you to overwrite a part of the file's contents.

Closing a file descriptor is simple: exec 3>&- or exec 3<&- (either one should work regardless of how the file was opened).

Reading and writing with file descriptors

To read from an FD, you take a command that would normally read stdin, and you perform a redirection:

IFS= LC_ALL=C read -r -s -p 'Password: ' pwd <&3

There, read still reads on its stdin (fd 0), but after it has been temporarily redirected to the same resource as on fd 3. Though read specifically can also be told to read from fd 3 directly with -u (a non-standard extension from ksh):

IFS= LC_ALL=C read -r -u 3 -s -p 'Password: ' pwd

To write to an FD, you do the same thing using stdout:

printf '%s\n' "$message" >&3

Here's a realistic example:

while IFS= read -r host <&3; do
  ssh "$host" ...
done 3<"$hostlist"

ssh without -n slurps stdin, which would interfere with the reading of our hostlist file. So we do that reading on a separate FD, and voilà. Note that you don't do this: while IFS= read -r host <"$hostlist"; do .... That would reopen the file every time we hit the top of the loop, and keep reading the same host over and over.

The placement of the "open" at the bottom of the loop may seem a bit weird if you're not used to bash programming. In fact, this syntax is really just a shortcut. If you prefer, you could write it out the long way:

exec 3<"$hostlist"
while read -r host <&3; do
  ssh "$host" ...
done
exec 3<&-

(no exactly the same as you end up with fd 3 closed, while when redirecting the loop, fd 3 would be restored to what it was before after the loop finishes)

And here's an example using an output FD:

exec 3>>"$log"
log() {
  local IFS=' '
  printf '%s\n' "$*" >&3
}

Each time the log function is called, a message will be written to the open FD. This is more efficient than putting >>"$log" at the end of each printf command, and easier to type.

Operating on files in bulk

As we discussed earlier, there are two fundamental ways you can operate on multiple files: expanding a glob, or UsingFind. When using find, there are actually two approaches you can take: you can use -exec to have find perform some action, or you can read the names in your script.

Which tool and which approach you use depends on what your script needs to do. Ultimately you as the programmer must make all such decisions. I can only present some common guidelines:

If your script needs to store information about files, then find -exec is probably not the approach you want. find performs its actions as a child process of your script, so you don't actually know anything about what it's doing. If you need to store information, then you will want to process the filenames yourself, which means you either read find's output, or you go with a glob.
If you need to select files based on any metadata other than their names (owner, permissions, etc.) then you definitely want find.
If you don't want to recurse, then you probably want to use a glob. find always recurses though the -prune predicate can tell it to skip recursion. GNU find has a nonstandard extension (since copied by many other implementations) that lets you control the minimum and maximum recursion depth to make it easier, but in a portable script, that won't be an option.
Bash's globs can recurse (as of bash 4.0 and the globstar option; though it was buggy before 5.0), but if you need to target systems with older versions of bash, recursion is going to mean find.

Using a glob is simple. A glob expands to a list of filenames, which is a thing that exists only ephemerally, for the duration of the command that contains the expansion. Normally this is exactly what you want.

shopt -s nullglob; shopt -u failglob
for f in *.mp3; do
  ...
done

The list that results from expanding *.mp3 lives somewhere in bash's dynamic memory regions. It's not accessible to you, and you don't need it to be, because your loop is just handling one file at a time.

If for some reason you want to store this list, you can use an array.

shopt -s nullglob; shopt -u failglob
files=(*.mp3)

This is typically only required if you want to do something like counting the number of files, or iterating over the list multiple times, or determining the first or last file in the expansion. (Glob expansions are sorted according to the rules of your locale, specifically the LC_COLLATE variable. If you wanted to get the first or last file when sorted by some other criteria, such as modification time, that is an entirely separate problem, enormously more difficult.)

Remember, you can also use extended globs if those will help you. For example, !(*~) would expand to all of the files that don't end with ~. Recall that if you intend to enable extglob in a script, you must do it early in the script, not inside of a function or other compound command that attempts to use extended globs.

When using a glob expansion in a loop or storing it in an array, you also generally want to enable the nullglob option without which if there's no match, you loop once over (or store) the literal value of the glob pattern. As nullglob unfortunately doesn't take precedence over failglob, you may need to disable it as well in case it was enabled earlier.

When using find, as mentioned earlier, you have two basic choices: let find act on the files via -exec, or retrieve the names within your script. Each approach has its merits, so it's useful for you to understand both of them.

Conceptually, retrieving the names is simpler, because it shares the same basic structure as the for loop using a glob. However, find does not produce a list; it produces a data stream, which we have to parse. Therefore we don't use for. We use while read instead.

while IFS= LC_ALL=C read -r -d '' f; do
  ...
done < <(find . -type f -print0)

Remember, pathnames may contain newlines, so the only delimiter that can safely separate pathnames in a stream is the NUL byte. Most find implementations now have the -print0 predicate to delimit the stream this way (it's now standard as of the 2024 edition of the POSIX standard) though you may still find older systems where that's not available. If you need to target systems that have the older find, this workaround is more portable:

while IFS= LC_ALL=C read -r -d '' f; do
  ...
done < <(find . -type f -exec printf '%s\0' {} +)

This is less efficient than -print0 of course.

In both cases, the read command does our parsing for us. We tell it to expect a NUL delimiter between files with the -d '' option. The -r option suppresses backslash mangling, setting IFS= suppresses leading/trailing space trimming, setting LC_ALL=C works around a bug in bash 5.0 or newer. This basic template for reading a NUL delimited stream is extremely important, and you should be absolutely sure you understand it.

If you want to store find results in an array, you can use this same template, and simply put an array element assignment inside the loop. In bash 4.4, readarray (or its misnamed mapfile alias) was also given the -d option, which you may use if you're targeting such systems:

readarray -t -d '' files < <(find ... -print0)

Storing an entire hierarchy of filenames in an array shouldn't be a common choice, but it's there if you need it.

That leaves the more difficult approach: using -exec to delegate actions to a grandchild process. If the delegation is simple, then this may not actually be so difficult, but if we want to do anything subtle or complicated, then this becomes an interesting tool.

The fundamental point you must remember is that {} has to appear directly before + with no intervening arguments. So, for example, you can do this:

find ... -exec dos2unix {} +

But you cannot do this:

find ... -exec mv {} /destination +    # Does not work.

Any time you want to run something that has the {} in the middle, or which would have multiple instances of {}, or which needs to manipulate the filename, you can -exec a shell and let the shell process each filename as an argument. In effect, you are writing a script within a script. The basic templates for this look like:

find ... -exec bash -c '... "$@" ...' bash {} +

find ... -exec bash -c 'for f do ...; done' bash {} +

You can use sh -c if your mini-script doesn't rely on bash features. Remember, the argument that immediately follows bash -c script becomes argument 0 ($0) of the script, so you need to put a placeholder argument there. I'm repeating the shell interpreter in this document. While it can be literally any string you like, it's important to pick something that identifies the command that is being used such as bash/sh here as the value will also be used in error messages by the shell. Values such as _ or x would result in confusing error messages. find puts a sub-list of filenames where the {} is, and those become the script's positional parameters ("$@"). find may choose to do this multiple times, if there are lots of files, so you will end up with one grandchild shell process for each such chunk of files.

Some examples:

find ... -exec sh -c 'mv -- "$@" /destination' sh {} +

find ... -exec sh -c '
  for f do
    dir=${f%/*} file=${f##*/}
    mkdir -p -- "/destination/$dir" &&
      convert ... "$f" ... "/destination/$dir/${file%.*}.png"
  done
' sh {} +

Remember that you've got an outer layer of single quotes around your mini-script, so you can't use single quotes inside it, unless you write them as '\''. It's best to avoid writing anything that needs such a level of complexity. If you reach that level, you can put the mini-script in an actual script (a separate file), and -exec that directly. Or, you can write it as a function, and export it, and let the grandchild bash -c process import it automatically.

export -f myfunc
find ... -exec bash -c 'for f do myfunc "$f"; done' bash {} +

Finally, I leave you with this example which synthesizes many of the techniques we've already discussed:

rlart() {
  # Recursive version of "ls -lart".  Show files sorted by mtime, recursively.
  # Requires GNU find and GNU sort.
  printf '%s\0' "${@-.}" | find -files0-from - -type f -printf '%T@@%TFT%TT@%Tz@%p\0' |
    sort -zn |
    while LC_ALL=C IFS=@ read -rd '' _ mtime tz file; do
      printf '%s\n' "${mtime%.*}$tz $file"
    done
}

Sorting is something bash can't do internally; scripts are expected to call sort(1) instead. So we need to provide a stream that sort can sort how we want. I use GNU find's -printf option to format the fields of the data stream for each pathname, using an explicit NUL delimiter. GNU sort has a -z option to accept an input stream with NUL delimiters, so everything works together.

How did I come up with this? Simply break the problem down into steps:

We're going to need to use find because we're recursing. GNU find can produce output in any format. We want to see the modification date & time and full pathname, so read the man page and figure out what syntax to use for those.
To be able to deal with arbitrary pathname arguments, we can't use find "$@" (not even find -- "$@") which wouldn't work for pathnames that start with - (or some other values such as ! or ( which are the names of some of find's predicates), so we pass the list (defaulting to .) NUL-delimited on find's stdin which since version 4.9 GNU find can read with -files0-from -.
GNU sort can sort the input on whatever field we like. What field should we provide to make sorting as easy as possible? Obviously the Unix last modification timestamp (as seconds since epoch) is the easiest, so we'll add that too. (sorting on the human-readable mtime field wouldn't work properly in timezones that implement daylight saving)
We don't actually want to see the Unix timestamp in the final output, so we'll need to remove it after the sort.
And hey, it turns out -printf doesn't have a way to print seconds without annoying fractions out to 10 decimal places. We want to remove those too. So we might as well combine these two removals into a single clean-up step.

The -printf format that I chose produces output like:

1491425037.8232634170@2017-04-05T16:43:57.8232634170@-0500@.bashrc

Note we separate the fields with @. Using a whitespace character (such as the ones found in the default value of $IFS) wouldn't work properly for pathnames that start with such characters because of the special way IFS-splitting handles those.

We want to remove the entire first field, and make a modification to the second field and concatenate the third (a timestamp without timezone offset is ambiguous). We want to leave everything after the third field untouched, no matter what crazy characters or non-characters it has. This matches up very nicely with the shell's read command (not so nicely with awk), so I chose a while read loop to do the final clean-up.

On the sample above that will give us:

2017-04-05T16:43:57-0500 .bashrc

Using a standard (ISO8601) unambiguous timestamp format.

You may have noticed that we're producing a newline-delimited output stream, which is a problem if one of the filenames contains newlines. This command is only intended to be used by a human. The output is not meant to be parsed by anything less sophisticated than a human brain. This means that it serves the same purpose as ls, as I noted in the comments. It shares the same newline limitation. We live in an imperfect world, so sometimes we need imperfect tools.