## page was renamed from GreyCat/BashProgramming/03 <- [[../02|Tool selection]] | '''Working with files''' | [[../04|Collating with associative arrays]] -> = Working with files = On the previous page, we looked at some input file formats, and considered the choice of various tools that can read them. But many scripts deal with the files themselves, rather than what's inside them. <> == Filenames == On Unix systems, filenames may contain whitespace. This includes the space character, obviously. It also includes tabs, carriage returns, newlines, and more. Unix filenames may contain ''every character except / and NUL'', and / is obviously allowed in ''pathnames'' (which are filenames preceded by zero or more directory components, either relative like ./myscript or absolute like /etc/aliases). It is a tragic mistake to write software that assumes filenames may be separated by spaces, or even newlines. Poorly written bash scripts are especially likely to be vulnerable to malicious or accidentally created unusual filenames. It's your job as the programmer to write scripts that don't fall over and die (or worse) when the user has a weird filename. Iteration over filenames should be done by letting bash expand a [[glob]], '''never''' by ParsingLs. If you need to iterate recursively, you can use the globstar option and a glob containing `**`, or you can [[UsingFind|use find]]. I won't duplicate the UsingFind page here; you are expected to have read it. Later, we'll explore the glob-vs.-find choice in depth. A ''single'' filename may be safely stored in a bash string variable. If you need to store multiple filenames for some reason, use an array variable. '''Never''' attempt to store multiple filenames in a string variable with whitespace between them. In most cases, you shouldn't need to store multiple filenames anyway. Usually you can just iterate over the files once, and don't need to store more than one filename at a time. Of course, this depends on what the script is doing. Sometimes your script will need to read, or write, a file which contains a list of filenames, one per line. If this is an external demand imposed on you, then there's not much you can do about it. You'll have to deal with the fact that a filename containing a newline is going to break your script (or the thing reading your output file). If you're writing the file, you could choose to omit the unusual filename altogether (with or without an error message). If you're using a file as an internal storage dump, you may safely store the list of filenames in a file if they are delimited by NUL characters instead of newlines. If they're in an array, this is trivial: {{{ printf '%s\0' "${files[@]}" > "$outputfile" }}} To read such a file into an array, in bash 4.4: {{{ mapfile -t -d '' files < "$inputfile" }}} Or in older bashes: {{{ files=() while IFS= read -r -d '' file; do files+=("$file") done < "$inputfile" }}} This serialization works for ''any'' array, not just filenames. Bash arrays hold strings, and strings can't contain NUL bytes. == Opening and closing == (Introductory material: [[Redirection]], FileDescriptor.) Simple bash scripts will read from stdin and write to stdout/stderr, and never need to worry about opening and closing files. The caller will take care of that, usually by doing its own redirections. Slightly more complex scripts may open the occasional file by name, usually a single output file for logging results. This may be done on a per-command basis: {{{ myfunc >"$log" 2>&1 }}} or by redirecting stdout/stderr once, at the top of the script: {{{ exec >"$log" 2>&1 myfunc anotherfunc }}} In the latter case, all commands executed by the script after `exec` inherit the redirected stdout/stderr, just as if the caller had launched the script with that redirection in the first place. The `exec` command doubles as both "open" and "close" in shell scripts. When you open a file, you decide on a file descriptor number to use. This FD number will be what you use to read from/write to the file, and to close it. (Bash 4.1 lets you open files without hard-coding a FD number, instead using a variable to let bash tell you what FD number it assigned. We won't cover this here.) Scripts may safely assume that they inherit FD 0, 1 and 2 from the caller. FD 3 and higher are therefore typically available for you to use. (If your caller is doing something special with open file descriptors, you'll need to learn about that and deal with it. For now, we'll assume no such special arrangements.) Bash and sh can open files in 4 different modes: * '''Read''': exec 3<"$file" * '''Write''': exec 3>"$file" * '''Append''': exec 3>>"$file" * '''Read+write''': exec 3<>"$file" Opening a file for write will clobber (destroy the contents of) any existing file by that name, even if you don't actually write anything to that FD. You can set the ''noclobber'' option (`set -C`) if this is a concern. I've never actually seen that used in a real script. (It may be more common in interactive shells.) Opening a file for append means every write to the file is preceded (atomically, magically) by a seek-to-end-of-file. This means two or more processes may open the file for append simultaneously, and each one's writes will appear at the end of the file as expected. (Do ''not'' attempt this with two processes opening a file for write. The semantics are entirely different.) The read+write mode is normally used with network sockets, not regular files. Closing a file is simple: `exec 3>&-` or `exec 3<&-` (either one should work). == Reading and writing with file descriptors == To read from an FD, you take a command that would normally read stdin, and you perform a redirection: {{{ read -r -s -p 'Password: ' pwd <&3 }}} To write to an FD, you do the same thing using stdout: {{{ printf '%s\n' "$message" >&3 }}} Here's a realistic example: {{{ while read -r host <&3; do ssh "$host" ... done 3<"$hostlist" }}} SSH slurps stdin, which would interfere with the reading of our hostlist file. So we do that reading on a separate FD, and voila. Note that you ''don't'' do this: `while read -r host <"$hostlist"; do ...`. That would reopen the file every time we hit the top of the loop, and keep reading the same host over and over. The placement of the "open" at the bottom of the loop may seem a bit weird if you're not used to bash programming. In fact, this syntax is really just a shortcut. If you prefer, you could write it out the long way: {{{ exec 3<"$hostlist" while read -r host <&3; do ssh "$host" ... done exec 3<&- }}} And here's an example using an output FD: {{{ exec 3>>"$log" log() { printf '%s\n' "$*" >&3; } }}} Each time the `log` function is called, a message will be written to the open FD. This is more efficient than putting `>>"$log"` at the end of each printf command, and easier to type. == Operating on files in bulk == As we discussed earlier, there are two fundamental ways you can operate on multiple files: expanding a [[glob]], or UsingFind. When using `find`, there are actually two approaches you can take: you can use `-exec` to have `find` perform some action, or you can read the names in your script. Which tool and which approach you use depends on what your script needs to do. Ultimately you as the programmer must make all such decisions. I can only present some common guidelines: * If your script needs to store information about files, then `find -exec` is probably not the approach you want. `find` performs its actions as a child process of your script, so you don't actually ''know'' anything about what it's doing. If you need to store information, then you will want to process the filenames yourself, which means you either read `find`'s output, or you go with a glob. * If you need to select files based on any metadata other than their names (owner, permissions, etc.) then you definitely want `find`. * If you ''don't'' want to recurse, then you probably want to use a glob. `find` always recurses. GNU `find` has a nonstandard extension that lets you control this, but in a portable script, that won't be an option. * Bash's globs ''can'' recurse (as of bash 4.0 and the `globstar` option), but if you need to target systems with older versions of bash, recursion is going to mean `find`. Using a glob is simple. A glob expands to a ''list'' of filenames, which is a thing that exists only ephemerally, for the duration of the command that contains the expansion. Normally this is exactly what you want. {{{ for f in *.mp3; do ... done }}} The list that results from expanding `*.mp3` lives somewhere in bash's dynamic memory regions. It's not accessible to you, and you don't need it to be, because your loop is just handling one file at a time. If for some reason you want to store this list, you can use an array. {{{ files=(*.mp3) }}} This is typically only required if you want to do something like counting the number of files, or iterating over the list multiple times, or determining the first or last file in the expansion. (Glob expansions are sorted according to the rules of your [[locale]], specifically the `LC_COLLATE` variable. If you wanted to get the first or last file when sorted by some other criteria, such as modification time, that is an entirely separate problem, enormously more difficult.) Remember, you can also use ''extended globs'' if those will help you. For example, `!(*~)` would expand to all of the files that ''don't'' end with `~`. Recall that if you intend to enable `extglob` in a script, you must do it ''early'' in the script, ''not'' inside of a function or other compound command that attempts to use extended globs. When using `find`, as mentioned earlier, you have two basic choices: let `find` act on the files via `-exec`, or retrieve the names within your script. Each approach has its merits, so it's useful for you to understand both of them. Conceptually, retrieving the names is simpler, because it shares the same basic structure as the for loop using a glob. However, `find` '''does not''' produce a list; it produces a data stream, which we have to parse. Therefore we don't use `for`. We use `while read` instead. {{{ while IFS= read -r -d '' f; do ... done < <(find . -type f -print0) }}} Remember, pathnames may contain newlines, so the ''only'' delimiter that can safely separate pathnames in a stream is the NUL byte. GNU and BSD `find` commands have the `-print0` option to delimit the stream this way, and you may use that as long as you are only targeting such systems. If you need to target systems that have only POSIX `find`, this workaround is portable: {{{ while IFS= read -r -d '' f; do ... done < <(find . -type f -exec printf %s\\0 {} +) }}} This is notably less efficient than `-print0` of course. In both cases, the `read` command does our parsing for us. We tell it to use a NUL delimiter between files with the `-d ''` option, which is an undocumented feature of bash, but [[https://lists.gnu.org/archive/html/bug-bash/2016-01/msg00121.html|supported by the maintainer]]. The `-r` option suppresses backslash mangling, and setting `IFS=` suppresses leading/trailing space trimming. This basic template for reading a NUL delimited stream is extremely important, and you should be absolutely sure you understand it. If you want to store `find` results in an array, you can use this same template, and simply put an array element assignment inside the loop. In bash 4.4, `mapfile` was also given the `-d` option, which you may use if you're targeting such systems: {{{ mapfile -t -d '' files < <(find ... -print0) }}} Storing an entire hierarchy of filenames in an array shouldn't be a ''common'' choice, but it's there if you need it. That leaves the more difficult approach: using `-exec` to delegate actions to a grandchild process. If the delegation is simple, then this may not actually be so difficult, but if we want to do anything subtle or complicated, then this becomes an ''interesting'' tool. The fundamental point you must remember is that `{}` has to appear ''directly'' before `+` with no intervening arguments. So, for example, you can do this: {{{ find ... -exec dos2unix {} + }}} But you '''cannot''' do this: {{{ find ... -exec mv {} /destination + # Does not work. }}} Any time you want to run something that has the `{}` in the middle, or which would have multiple instances of `{}`, or which needs to manipulate the filename, you can `-exec` a shell and let the shell process each filename as an argument. In effect, you are writing a script within a script. The basic templates for this look like: {{{ find ... -exec bash -c '... "$@" ...' x {} + }}} or {{{ find ... -exec bash -c 'for f; do ...; done' x {} + }}} You can use `sh -c` if your mini-script doesn't rely on bash features. Remember, the argument that immediately follows `bash -c script` becomes argument 0 (`$0`) of the script, so you need to put a placeholder argument there. I'm using `x` in this document, but it can be literally any string you like. `find` puts a sub-list of filenames where the `{}` is, and those become the script's positional parameters (`"$@"`). `find` may choose to do this multiple times, if there are lots of files, so you will end up with one grandchild shell process for each such chunk of files. Some examples: {{{ find ... -exec sh -c 'mv -- "$@" /destination' x {} + }}} {{{ find ... -exec sh -c ' for f; do dir=${f%/*} file=${f##*/} mkdir -p "/destination/$dir" convert ... "$f" ... "/destination/$dir/${file%.*}.png" done ' x {} + }}} Remember that you've got an outer layer of single quotes around your mini-script, so you can't use single quotes inside it, unless you write them as `'\''`. It's best to avoid writing anything that needs such a level of complexity. If you reach that level, you can put the mini-script in an ''actual'' script (a separate file), and `-exec` that directly. Or, you can write it as a function, and export it, and let the grandchild `bash -c` process import it automatically. {{{ export -f myfunc find ... -exec bash -c 'for f; do myfunc "$f"; done' x {} + }}} Finally, I leave you with this example which synthesizes many of the techniques we've already discussed: {{{ rlart() { # Recursive version of "ls -lart". Show files sorted by mtime, recursively. # Requires GNU find and GNU sort. find "${1:-.}" -type f -printf '%T@ %TY-%Tm-%Td %TT %p\0' | sort -zn | while read -rd '' _ day time path; do printf '%s %s %s\n' "$day" "${time%.*}" "$path" done } }}} Sorting is something bash can't do internally; scripts are expected to call `sort(1)` instead. So we need to provide a stream that `sort` can sort how we want. I use GNU `find`'s `-printf` option to format the fields of the data stream for each pathname, using an explicit NUL delimiter. GNU `sort` has a `-z` option to accept an input stream with NUL delimiters, so everything works together. How did I come up with this? Simply break the problem down into steps: 1. We're going to need to use `find` because we're recursing. GNU `find` can produce output in any format. We want to see the modification date & time and full pathname, so read the man page and figure out what syntax to use for those. 1. GNU `sort` can sort the input on whatever field we like. What field should we provide to make sorting as easy as possible? Obviously the Unix timestamp (seconds since epoch) is the easiest, so we'll add that too. (I ''could'' have chosen to sort on the human-readable date and time fields. I didn't.) 1. We don't actually want to ''see'' the Unix timestamp in the final output, so we'll need to remove it after the sort. 1. And hey, it turns out `-printf` doesn't have a way to print seconds without annoying fractions out to 10 decimal places. We want to remove those too. So we might as well combine these two removals into a single clean-up step. The `-printf` format that I chose produces output like: {{{ 1491425037.8232634170 2017-04-05 16:43:57.8232634170 .bashrc }}} We want to remove the entire first field (plus the whitespace after it), and make a modification to the third field. We want to leave everything after the third field untouched, no matter what crazy internal whitespace it has. This matches up very nicely with the shell's `read` command (not so nicely with awk), so I chose a `while read` loop to do the final clean-up. You may have noticed that we're producing a newline-delimited output stream, which is a problem if one of the filenames contains newlines. This command is only intended to be used by a human. The output is not meant to be parsed by anything less sophisticated than a human brain. This means that it serves the same purpose as `ls`, as I noted in the comments. It shares the same newline limitation. We live in an imperfect world, so sometimes we need imperfect tools. ---- <- [[../02|Tool selection]] | '''Working with files''' | [[../04|Collating with associative arrays]] ->