## page was renamed from GreyCat/BashProgramming/03 <- [[../02|Tool selection]] | '''Working with files''' | [[../04|Collating with associative arrays]] -> = Working with files = On the previous page, we looked at some input file formats, and considered the choice of various tools that can read them. But many scripts deal with the files themselves, rather than what's inside them. <> == Filenames == On Unix systems, filenames may contain whitespace characters. This includes the space character, obviously. It also includes tabs, carriage returns, newlines, and more. Unix filenames may contain ''every character except / and NUL'', and / is obviously allowed in ''pathnames'' (which are filenames preceded by zero or more directory components, either relative like ./myscript or absolute like /etc/aliases). In fact, it's even worse: Unix filenames don't consist of characters at all; they consist of ''bytes''. A filename may not even be a valid character sequence in your locale's character encoding. (Some languages call these "byte arrays"; bash lacks that particular terminology, but if you're familiar with it from another language, then that's what they are.) Since whitespace characters may be included in a filename, it is a tragic mistake to write software that assumes filenames may be separated by spaces, or even newlines. Poorly written bash scripts are especially likely to be vulnerable to malicious or accidentally created unusual filenames. It's your job as the programmer to write scripts that don't fall over and die (or worse) when the user has a weird filename. Iteration over filenames should '''never''' be done by ParsingLs. Instead, let bash expand a [[glob]]. If you need to iterate recursively, you can use the globstar option and a glob containing `**`, or you can [[UsingFind|use find]]. I won't duplicate the UsingFind page here; you are expected to have read it. Later, we'll explore the glob-vs.-find choice in depth. A ''single'' filename may be safely stored in a bash string variable. If you need to store multiple filenames for some reason, use an array variable. '''Never''' attempt to store multiple filenames in a string variable with whitespace between them. In most cases, you shouldn't need to store multiple filenames anyway. Usually you can just iterate over the files once, and don't need to store more than one filename at a time. Of course, this depends on what the script is doing. Sometimes your script will need to read, or write, a file which contains a list of filenames, one per line. If this is an external demand imposed on you, then there's not much you can do about it. You'll have to deal with the fact that a filename containing a newline is going to break your script (or the thing reading your output file). If you're writing the file, you could choose to omit the unusual filename altogether (with or without an error message). If you're using a file as an internal storage dump, you may safely store the list of filenames in a file if they are delimited by NUL characters instead of newlines. If they're in an array, this is trivial: {{{ print0() { [ "$#" -eq 0 ] || printf '%s\0' "$@"; } print0 "${files[@]}" > "$outputfile" }}} (the empty array case needs to be handled specially as `printf '%s\0'` without argument would print '''one''' empty record instead of nothing at all) To read such a file into an array, in bash 4.4: {{{ readarray -t -d '' files < "$inputfile" }}} Or in older bashes: {{{ files=() while IFS= LC_ALL=C read -r -d '' file; do files+=("$file") done < "$inputfile" }}} The `IFS=` suppresses the trimming of leading/trailing whitespace characters that you'd get with the default value of `$IFS`. `LC_ALL=C` works around a bug in some versions of bash. `readarray` does not appear to have the same bugs that `read` does. This serialization works for ''any'' array, not just filenames. Bash arrays hold C-like strings, and those can't contain NUL bytes. == Opening and closing == (Introductory material: [[Redirection]], FileDescriptor.) Simple bash scripts will read from stdin and write to stdout/stderr, and never need to worry about opening and closing files. The caller will take care of that, usually by doing its own redirections. Slightly more complex scripts may open the occasional file by name, usually a single output file for logging results. This may be done on a per-command basis: {{{ myfunc >"$log" 2>&1 }}} or by redirecting stdout/stderr once, at the top of the script: {{{ exec >"$log" 2>&1 myfunc anotherfunc }}} In the latter case, all commands executed by the script after `exec` inherit the redirected stdout/stderr, just as if the caller had launched the script with that redirection in the first place. The `exec` command doubles as both "open" and "close" in shell scripts. When you open a file, you decide on a file descriptor number to use. This FD number will be what you use to read from/write to the file, and to close it. (Bash 4.1 lets you open files without hard-coding a FD number, instead using a variable to let bash tell you what FD number it assigned. We won't cover this here.) Scripts may safely assume that they inherit FD 0, 1 and 2 from the caller. FD 3 and higher are therefore typically available for you to use. (If your caller is doing something special with open file descriptors, you'll need to learn about that and deal with it. For now, we'll assume no such special arrangements.) Bash and sh can open files in 4 different modes: * '''Read''': exec 3<"$file" * '''Write''': exec 3>"$file" * '''Append''': exec 3>>"$file" * '''Read+write''' (without truncation): exec 3<>"$file" Opening a file for write will clobber (truncate, destroy the contents of) any existing file by that name, even if you don't actually write anything to that FD. You can set the ''noclobber'' option (`set -C`) if this is a concern. I've never actually seen that used in a real script. (It may be more common in interactive shells.) Opening a file for append means every write to the file is preceded (atomically, magically) by a seek-to-end-of-file. This means two or more processes may open the file for append simultaneously, and each one's writes will appear at the end of the file as expected. (Do ''not'' attempt this with two processes opening a file for write. The semantics are entirely different.) The read+write mode is more commonly used for bidirectional streams such as network sockets. It can be useful for regular files, not so much because it's read+write but because contrary to `>`, it skips truncation allowing you to overwrite a part of the file's contents. Closing a file descriptor is simple: `exec 3>&-` or `exec 3<&-` (either one should work regardless of how the file was opened). == Reading and writing with file descriptors == To read from an FD, you take a command that would normally read stdin, and you perform a redirection: {{{ IFS= LC_ALL=C read -r -s -p 'Password: ' pwd <&3 }}} There, `read` still reads on its stdin (fd 0), but after it has been temporarily redirected to the same resource as on fd 3. Though `read` specifically can also be told to read from fd 3 directly with `-u` (a non-standard extension from ksh): {{{ IFS= LC_ALL=C read -r -u 3 -s -p 'Password: ' pwd }}} To write to an FD, you do the same thing using stdout: {{{ printf '%s\n' "$message" >&3 }}} Here's a realistic example: {{{ while IFS= read -r host <&3; do ssh "$host" ... done 3<"$hostlist" }}} `ssh` without `-n` slurps stdin, which would interfere with the reading of our hostlist file. So we do that reading on a separate FD, and voilà. Note that you ''don't'' do this: `while IFS= read -r host <"$hostlist"; do ...`. That would reopen the file every time we hit the top of the loop, and keep reading the same host over and over. The placement of the "open" at the bottom of the loop may seem a bit weird if you're not used to bash programming. In fact, this syntax is really just a shortcut. If you prefer, you could write it out the long way: {{{ exec 3<"$hostlist" while read -r host <&3; do ssh "$host" ... done exec 3<&- }}} (no exactly the same as you end up with fd 3 closed, while when redirecting the loop, fd 3 would be restored to what it was before after the loop finishes) And here's an example using an output FD: {{{ exec 3>>"$log" log() { local IFS=' ' printf '%s\n' "$*" >&3 } }}} Each time the `log` function is called, a message will be written to the open FD. This is more efficient than putting `>>"$log"` at the end of each printf command, and easier to type. == Operating on files in bulk == As we discussed earlier, there are two fundamental ways you can operate on multiple files: expanding a [[glob]], or UsingFind. When using `find`, there are actually two approaches you can take: you can use `-exec` to have `find` perform some action, or you can read the names in your script. Which tool and which approach you use depends on what your script needs to do. Ultimately you as the programmer must make all such decisions. I can only present some common guidelines: * If your script needs to store information about files, then `find -exec` is probably not the approach you want. `find` performs its actions as a child process of your script, so you don't actually ''know'' anything about what it's doing. If you need to store information, then you will want to process the filenames yourself, which means you either read `find`'s output, or you go with a glob. * If you need to select files based on any metadata other than their names (owner, permissions, etc.) then you definitely want `find`. * If you ''don't'' want to recurse, then you probably want to use a glob. `find` always recurses though the `-prune` predicate can tell it to skip recursion. GNU `find` has a nonstandard extension (since copied by many other implementations) that lets you control the minimum and maximum recursion depth to make it easier, but in a portable script, that won't be an option. * Bash's globs ''can'' recurse (as of bash 4.0 and the `globstar` option; though it was buggy before 5.0), but if you need to target systems with older versions of bash, recursion is going to mean `find`. Using a glob is simple. A glob expands to a ''list'' of filenames, which is a thing that exists only ephemerally, for the duration of the command that contains the expansion. Normally this is exactly what you want. {{{ shopt -s nullglob; shopt -u failglob for f in *.mp3; do ... done }}} The list that results from expanding `*.mp3` lives somewhere in bash's dynamic memory regions. It's not accessible to you, and you don't need it to be, because your loop is just handling one file at a time. If for some reason you want to store this list, you can use an array. {{{ shopt -s nullglob; shopt -u failglob files=(*.mp3) }}} This is typically only required if you want to do something like counting the number of files, or iterating over the list multiple times, or determining the first or last file in the expansion. (Glob expansions are sorted according to the rules of your [[locale]], specifically the `LC_COLLATE` variable. If you wanted to get the first or last file when sorted by some other criteria, such as modification time, that is an entirely separate problem, enormously more difficult.) Remember, you can also use ''extended globs'' if those will help you. For example, `!(*~)` would expand to all of the files that ''don't'' end with `~`. Recall that if you intend to enable `extglob` in a script, you must do it ''early'' in the script, ''not'' inside of a function or other compound command that attempts to use extended globs. When using a glob expansion in a loop or storing it in an array, you also generally want to enable the `nullglob` option without which if there's no match, you loop once over (or store) the literal value of the glob pattern. As `nullglob` unfortunately doesn't take precedence over `failglob`, you may need to disable it as well in case it was enabled earlier. When using `find`, as mentioned earlier, you have two basic choices: let `find` act on the files via `-exec`, or retrieve the names within your script. Each approach has its merits, so it's useful for you to understand both of them. Conceptually, retrieving the names is simpler, because it shares the same basic structure as the for loop using a glob. However, `find` '''does not''' produce a list; it produces a data stream, which we have to parse. Therefore we don't use `for`. We use `while read` instead. {{{ while IFS= LC_ALL=C read -r -d '' f; do ... done < <(find . -type f -print0) }}} Remember, pathnames may contain newlines, so the ''only'' delimiter that can safely separate pathnames in a stream is the NUL byte. Most `find` implementations now have the `-print0` predicate to delimit the stream this way (it's now standard as of the 2024 edition of the POSIX standard) though you may still find older systems where that's not available. If you need to target systems that have the older `find`, this workaround is more portable: {{{ while IFS= LC_ALL=C read -r -d '' f; do ... done < <(find . -type f -exec printf '%s\0' {} +) }}} This is less efficient than `-print0` of course. In both cases, the `read` command does our parsing for us. We tell it to expect a NUL delimiter between files with the `-d ''` option. The `-r` option suppresses backslash mangling, setting `IFS=` suppresses leading/trailing space trimming, setting `LC_ALL=C` works around a bug in bash 5.0 or newer. This basic template for reading a NUL delimited stream is extremely important, and you should be absolutely sure you understand it. If you want to store `find` results in an array, you can use this same template, and simply put an array element assignment inside the loop. In bash 4.4, `readarray` (or its misnamed `mapfile` alias) was also given the `-d` option, which you may use if you're targeting such systems: {{{ readarray -t -d '' files < <(find ... -print0) }}} Storing an entire hierarchy of filenames in an array shouldn't be a ''common'' choice, but it's there if you need it. That leaves the more difficult approach: using `-exec` to delegate actions to a grandchild process. If the delegation is simple, then this may not actually be so difficult, but if we want to do anything subtle or complicated, then this becomes an ''interesting'' tool. The fundamental point you must remember is that `{}` has to appear ''directly'' before `+` with no intervening arguments. So, for example, you can do this: {{{ find ... -exec dos2unix {} + }}} But you '''cannot''' do this: {{{ find ... -exec mv {} /destination + # Does not work. }}} Any time you want to run something that has the `{}` in the middle, or which would have multiple instances of `{}`, or which needs to manipulate the filename, you can `-exec` a shell and let the shell process each filename as an argument. In effect, you are writing a script within a script. The basic templates for this look like: {{{ find ... -exec bash -c '... "$@" ...' bash {} + }}} or {{{ find ... -exec bash -c 'for f do ...; done' bash {} + }}} You can use `sh -c` if your mini-script doesn't rely on bash features. Remember, the argument that immediately follows `bash -c script` becomes argument 0 (`$0`) of the script, so you need to put a placeholder argument there. I'm repeating the shell interpreter in this document. While it can be literally any string you like, it's important to pick something that identifies the command that is being used such as `bash`/`sh` here as the value will also be used in error messages by the shell. Values such as `_` or `x` would result in confusing error messages. `find` puts a sub-list of filenames where the `{}` is, and those become the script's positional parameters (`"$@"`). `find` may choose to do this multiple times, if there are lots of files, so you will end up with one grandchild shell process for each such chunk of files. Some examples: {{{ find ... -exec sh -c 'mv -- "$@" /destination' sh {} + }}} {{{ find ... -exec sh -c ' for f do dir=${f%/*} file=${f##*/} mkdir -p -- "/destination/$dir" && convert ... "$f" ... "/destination/$dir/${file%.*}.png" done ' sh {} + }}} Remember that you've got an outer layer of single quotes around your mini-script, so you can't use single quotes inside it, unless you write them as `'\''`. It's best to avoid writing anything that needs such a level of complexity. If you reach that level, you can put the mini-script in an ''actual'' script (a separate file), and `-exec` that directly. Or, you can write it as a function, and export it, and let the grandchild `bash -c` process import it automatically. {{{ export -f myfunc find ... -exec bash -c 'for f do myfunc "$f"; done' bash {} + }}} Finally, I leave you with this example which synthesizes many of the techniques we've already discussed: {{{ rlart() { # Recursive version of "ls -lart". Show files sorted by mtime, recursively. # Requires GNU find and GNU sort. printf '%s\0' "${@-.}" | find -files0-from - -type f -printf '%T@@%TFT%TT@%Tz@%p\0' | sort -zn | while LC_ALL=C IFS=@ read -rd '' _ mtime tz file; do printf '%s\n' "${mtime%.*}$tz $file" done } }}} Sorting is something bash can't do internally; scripts are expected to call `sort(1)` instead. So we need to provide a stream that `sort` can sort how we want. I use GNU `find`'s `-printf` option to format the fields of the data stream for each pathname, using an explicit NUL delimiter. GNU `sort` has a `-z` option to accept an input stream with NUL delimiters, so everything works together. How did I come up with this? Simply break the problem down into steps: 1. We're going to need to use `find` because we're recursing. GNU `find` can produce output in any format. We want to see the modification date & time and full pathname, so read the man page and figure out what syntax to use for those. 1. To be able to deal with arbitrary pathname arguments, we can't use `find "$@"` (not even `find -- "$@"`) which wouldn't work for pathnames that start with `-` (or some other values such as `!` or `(` which are the names of some of `find`'s predicates), so we pass the list (defaulting to `.`) NUL-delimited on `find`'s `stdin` which since version 4.9 GNU `find` can read with `-files0-from -`. 1. GNU `sort` can sort the input on whatever field we like. What field should we provide to make sorting as easy as possible? Obviously the Unix last modification timestamp (as seconds since epoch) is the easiest, so we'll add that too. (sorting on the human-readable mtime field wouldn't work properly in timezones that implement daylight saving) 1. We don't actually want to ''see'' the Unix timestamp in the final output, so we'll need to remove it after the sort. 1. And hey, it turns out `-printf` doesn't have a way to print seconds without annoying fractions out to 10 decimal places. We want to remove those too. So we might as well combine these two removals into a single clean-up step. The `-printf` format that I chose produces output like: {{{ 1491425037.8232634170@2017-04-05T16:43:57.8232634170@-0500@.bashrc }}} Note we separate the fields with `@`. Using a whitespace character (such as the ones found in the default value of `$IFS`) wouldn't work properly for pathnames that start with such characters because of the special way IFS-splitting handles those. We want to remove the entire first field, and make a modification to the second field and concatenate the third (a timestamp without timezone offset is ambiguous). We want to leave everything after the third field untouched, no matter what crazy characters or non-characters it has. This matches up very nicely with the shell's `read` command (not so nicely with awk), so I chose a `while read` loop to do the final clean-up. On the sample above that will give us: {{{ 2017-04-05T16:43:57-0500 .bashrc }}} Using a standard (ISO8601) unambiguous timestamp format. You may have noticed that we're producing a newline-delimited output stream, which is a problem if one of the filenames contains newlines. This command is only intended to be used by a human. The output is not meant to be parsed by anything less sophisticated than a human brain. This means that it serves the same purpose as `ls`, as I noted in the comments. It shares the same newline limitation. We live in an imperfect world, so sometimes we need imperfect tools. ---- <- [[../02|Tool selection]] | '''Working with files''' | [[../04|Collating with associative arrays]] ->