<- Tool selection | Working with files | Collating with associative arrays ->

Working with files

On the previous page, we looked at some input file formats, and considered the choice of various tools that can read them. But many scripts deal with the files themselves, rather than what's inside them.

Filenames

On Unix systems, filenames may contain whitespace. This includes the space character, obviously. It also includes tabs, carriage returns, newlines, and more. Unix filenames may contain every character except / and NUL, and / is obviously allowed in pathnames (which are filenames preceded by zero or more directory components, either relative like ./myscript or absolute like /etc/aliases).

It is a tragic mistake to write software that assumes filenames may be separated by spaces, or even newlines. Poorly written bash scripts are especially likely to be vulnerable to malicious or accidentally created unusual filenames. It's your job as the programmer to write scripts that don't fall over and die (or worse) when the user has a weird filename.

Iteration over filenames should be done by letting bash expand a glob, never by ParsingLs. If you need to iterate recursively, you can use the globstar option and a glob containing **, or you can use find. I won't duplicate the UsingFind page here; you are expected to have read it. Later, we'll explore the glob-vs.-find choice in depth.

A single filename may be safely stored in a bash string variable. If you need to store multiple filenames for some reason, use an array variable. Never attempt to store multiple filenames in a string variable with whitespace between them. In most cases, you shouldn't need to store multiple filenames anyway. Usually you can just iterate over the files once, and don't need to store more than one filename at a time. Of course, this depends on what the script is doing.

Sometimes your script will need to read, or write, a file which contains a list of filenames, one per line. If this is an external demand imposed on you, then there's not much you can do about it. You'll have to deal with the fact that a filename containing a newline is going to break your script (or the thing reading your output file). If you're writing the file, you could choose to omit the unusual filename altogether (with or without an error message).

If you're using a file as an internal storage dump, you may safely store the list of filenames in a file if they are delimited by NUL characters instead of newlines. If they're in an array, this is trivial:

printf '%s\0' "${files[@]}" > "$outputfile"

To read such a file into an array, in bash 4.4:

mapfile -t -d '' files < "$inputfile"

Or in older bashes:

files=()
while IFS= read -r -d '' file; do
  files+=("$file")
done < "$inputfile"

This serialization works for any array, not just filenames. Bash arrays hold strings, and strings can't contain NUL bytes.

Opening and closing

(Introductory material: Redirection, FileDescriptor.)

Simple bash scripts will read from stdin and write to stdout/stderr, and never need to worry about opening and closing files. The caller will take care of that, usually by doing its own redirections.

Slightly more complex scripts may open the occasional file by name, usually a single output file for logging results. This may be done on a per-command basis:

myfunc >"$log" 2>&1

or by redirecting stdout/stderr once, at the top of the script:

exec >"$log" 2>&1
myfunc
anotherfunc

In the latter case, all commands executed by the script after exec inherit the redirected stdout/stderr, just as if the caller had launched the script with that redirection in the first place.

The exec command doubles as both "open" and "close" in shell scripts. When you open a file, you decide on a file descriptor number to use. This FD number will be what you use to read from/write to the file, and to close it. (Bash 4.1 lets you open files without hard-coding a FD number, instead using a variable to let bash tell you what FD number it assigned. We won't cover this here.)

Scripts may safely assume that they inherit FD 0, 1 and 2 from the caller. FD 3 and higher are therefore typically available for you to use. (If your caller is doing something special with open file descriptors, you'll need to learn about that and deal with it. For now, we'll assume no such special arrangements.)

Bash can open files in 4 different modes:

Opening a file for write will clobber (destroy the contents of) any existing file by that name, even if you don't actually write anything to that FD. You can set the noclobber option (set -C) if this is a concern. I've never actually seen that used in a real script. (It may be more common in interactive shells.)

Opening a file for append means every write to the file is preceded (atomically, magically) by a seek-to-end-of-file. This means two or more processes may open the file for append simultaneously, and each one's writes will appear at the end of the file as expected. (Do not attempt this with two processes opening a file for write. The semantics are entirely different.)

The read+write mode is normally used with network sockets, not regular files.

Closing a file is simple: exec 3>&- or exec 3<&- (either one should work).

Reading and writing with file descriptors

To read from an FD, you take a command that would normally read stdin, and you perform a redirection:

read -r -s -p 'Password: ' pwd <&3

To write to an FD, you do the same thing using stdout:

printf '%s\n' "$message" >&3

Here's a realistic example:

while read -r host <&3; do
  ssh "$host" ...
done 3<"$hostlist"

SSH slurps stdin, which would interfere with the reading of our hostlist file. So we do that reading on a separate FD, and voila. Note that you don't do this: while read -r host <"$hostlist"; do .... That would reopen the file every time we hit the top of the loop, and keep reading the same host over and over.

The placement of the "open" at the bottom of the loop may seem a bit weird if you're not used to bash programming. In fact, this syntax is really just a shortcut. If you prefer, you could write it out the long way:

exec 3<"$hostlist"
while read -r host <&3; do
  ssh "$host" ...
done
exec 3<&-

And here's an example using an output FD:

exec 3>>"$log"
log() { printf '%s\n' "$*" >&3; }

Each time the log function is called, a message will be written to the open FD. This is more efficient than putting >>"$log" at the end of each printf command, and easier to type.

Operating on files in bulk

As we discussed earlier, there are two fundamental ways you can operate on multiple files: expanding a glob, or UsingFind. When using find, there are actually two approaches you can take: you can use -exec to have find perform some action, or you can read the names in your script.

Which tool and which approach you use depends on what your script needs to do. Ultimately you as the programmer must make all such decisions. I can only present some common guidelines:

Using a glob is simple. A glob expands to a list of filenames, which is a thing that exists only ephemerally, for the duration of the command that contains the expansion. Normally this is exactly what you want.

for f in *.mp3; do
  ...
done

The list that results from expansing *.mp3 lives somewhere in bash's dynamic memory regions. It's not accessible to you, and you don't need it to be, because your loop is just handling one file at a time.

If for some reason you want to store this list, you can use an array.

files=(*.mp3)

This is typically only required if you want to do something like counting the number of files, or iterating over the list multiple times, or determining the first or last file in the expansion. (Glob expansions are sorted according to the rules of your locale, specifically the LC_COLLATE variable. If you wanted to get the first or last file when sorted by some other criteria, such as modification time, that is an entirely separate problem, enormously more difficult.)

Remember, you can also use extended globs if those will help you. For example, !(*~) would expand to all of the files that don't end with ~. Recall that if you intend to enable extglob in a script, you must do it early in the script, not inside of a function or other compound command that attempts to use extended globs.

When using find, as mentioned earlier, you have two basic choices: let find act on the files via -exec, or retrieve the names within your script. Each approach has its merits, so it's useful for you to understand both of them.

Conceptually, retrieving the names is simpler, because it shares the same basic structure as the for loop using a glob. However, find does not produce a list; it produces a data stream, which we have to parse. Therefore we don't use for. We use while read instead.

while IFS= read -r -d '' f; do
  ...
done < <(find . -type f -print0)

Remember, pathnames may contain newlines, so the only delimiter that can safely separate pathnames in a stream is the NUL byte. GNU and BSD find commands have the -print0 option to delimit the stream this way, and you may use that as long as you are only targeting such systems. If you need to target systems that have only POSIX find, this workaround is portable:

while IFS= read -r -d '' f; do
  ...
done < <(find . -type f -exec printf %s\\0 {} +)

This is notably less efficient than -print0 of course.

In both cases, the read command does our parsing for us. We tell it to use a NUL delimiter between files with the -d '' option, which is an undocumented feature of bash, but supported by the maintainer. The -r option suppresses backslash mangling, and setting IFS= suppresses leading/trailing space trimming. This basic template for reading a NUL delimited stream is extremely important, and you should be absolutely sure you understand it.

If you want to store find results in an array, you can use this same template, and simply put an array element assignment inside the loop. In bash 4.4, mapfile was also given the -d option, which you may use if you're targeting such systems:

mapfile -t -d '' files < <(find ... -print0)

Storing an entire hierarchy of filenames in an array shouldn't be a common choice, but it's there if you need it.

That leaves the more difficult approach: using -exec to delegate actions to a grandchild process. If the delegation is simple, then this may not actually be so difficult, but if we want to do anything subtle or complicated, then this becomes an interesting tool.

The fundamental point you must remember is that {} has to appear directly before + with no intervening arguments. So, for example, you can do this:

find ... -exec dos2unix {} +

But you cannot do this:

find ... -exec mv {} /destination +    # Does not work.

Any time you want to run something that has the {} in the middle, or which would have multiple instances of {}, or which needs to manipulate the filename, you can -exec a shell and let the shell process each filename as an argument. In effect, you are writing a script within a script. The basic templates for this looks like:

find ... -exec bash -c '... "$@" ...' x {} +

or

find ... -exec bash -c 'for f; do ...; done' x {} +

You can use sh -c if your mini-script doesn't rely on bash features. Remember, the argument that immediately follows bash -c script becomes argument 0 ($0) of the script, so you need to put a placeholder argument there. I'm using x in this document, but it can be literally any string you like. find puts a sub-list of filenames where the {} is, and those become the script's positional parameters ("$@"). find may choose to do this multiple times, if there are lots of files, so you will end up with one grandchild shell process for each such chunk of files.

Some examples:

find ... -exec sh -c 'mv -- "$@" /destination' x {} +

find ... -exec sh -c '
  for f; do
    dir=${f%/*} file=${f##*/}
    mkdir -p "/destination/$dir"
    convert ... "$f" ... "/destination/$dir/${file%.*}.png"
  done
' x {} +

Remember that you've got an outer layer of single quotes around your mini-script, so you can't use single quotes inside it, unless you write them as '\''. It's best to avoid writing anything that needs such a level of complexity. If you reach that level, you can put the mini-script in an actual script (a separate file), and -exec that directly. Or, you can write it as a function, and export it, and let the grandchild bash -c process import it automatically.

export -f myfunc
find ... -exec bash -c 'for f; do myfunc "$f"; done' x {} +

Finally, I leave you with this example which synthesizes many of the techniques we've already discussed:

rlart() {
  # Recursive version of "ls -lart".  Show files sorted by mtime, recursively.
  # Requires GNU find and GNU sort.
  find "${1:-.}" -type f -printf '%T@ %TY-%Tm-%Td %TT %p\0' |
    sort -zn |
    while read -rd '' _ day time path; do
      printf '%s %s %s\n' "$day" "${time%.*}" "$path"
    done
}

Sorting is something bash can't do internally; scripts are expected to call sort(1) instead. So we need to provide a stream that sort can sort how we want. I use GNU find's -printf option to format the fields of the data stream for each pathname, using an explicit NUL delimiter. GNU sort has a -z option to accept an input stream with NUL delimiters, so everything works together.

How did I come up with this? Simply break the problem down into steps:

  1. We're going to need to use find because we're recursing. GNU find can produce output in any format. We want to see the modification date & time and full pathname, so read the man page and figure out what syntax to use for those.

  2. GNU sort can sort the input on whatever field we like. What field should we provide to make sorting as easy as possible? Obviously the Unix timestamp (seconds since epoch) is the easiest, so we'll add that too. (I could have chosen to sort on the human-readable date and time fields. I didn't.)

  3. We don't actually want to see the Unix timestamp in the final output, so we'll need to remove it after the sort.

  4. And hey, it turns out -printf doesn't have a way to print seconds without annoying fractions out to 10 decimal places. We want to remove those too. So we might as well combine these two removals into a single clean-up step.

The -printf format that I chose produces output like:

1491425037.8232634170 2017-04-05 16:43:57.8232634170 .bashrc

We want to remove the entire first field (plus the whitespace after it), and make a modification to the third field. We want to leave everything after the third field untouched, no matter what crazy internal whitespace it has. This matches up very nicely with the shell's read command (not so nicely with awk), so I chose a while read loop to do the final clean-up.

You may have noticed that we're producing a newline-delimited output stream, which is a problem if one of the filenames contains newlines. This command is only intended to be used by a human. The output is not meant to be parsed by anything less sophisticated than a human brain. This means that it serves the same purpose as ls, as I noted in the comments. It shares the same newline limitation. We live in an imperfect world, so sometimes we need imperfect tools.


<- Tool selection | Working with files | Collating with associative arrays ->

GreyCat/BashProgramming/03 (last edited 2017-05-26 14:36:01 by GreyCat)