6451
Comment: "Now that we've seen the problem" -- Make sure the problem is obvious enough, even to newbies - hopefully.
|
9856
restore correct way to get file size with `wc`.
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Why you shouldn't parse the output of `ls` = The `ls(1)` command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a ''list'' of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. In its default mode, if standard output isn't a terminal, `ls` separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of `ls` that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with `ls`. |
= Why you shouldn't parse the output of ls(1) = The `ls(1)` command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a ''list'' of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. There are [[http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html|proposals]] to try and "fix" this within POSIX, but they won't help in dealing with the current situation (see also [[http://www.dwheeler.com/essays/filenames-in-shell.html|how to deal with filenames correctly]]). In its default mode, if standard output isn't a terminal, `ls` separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of `ls` that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with `ls`. |
Line 14: | Line 13: |
Line 20: | Line 18: |
$ ls -l | cat | $ ls -l |
Line 23: | Line 21: |
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a newline |
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a?newline |
Line 27: | Line 24: |
The problem is that from the output of `ls`, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell. | |
Line 28: | Line 26: |
The problem is that from the output of `ls`, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell. | Also notice how `ls` sometimes garbles your filename data (in our case, it turned the newline character in between the words "`a`" and "`newline`" into a question mark. Some systems put a `\n` instead.). On some systems it doesn't do this when its output isn't a terminal, on others it always mangles the filename. All in all, you really can't and shouldn't trust the output of `ls` to be a true representation of the filenames that you want to work with. So don't. |
Line 32: | Line 30: |
As mentioned previously, sometimes people are trying to use `ls` to get some specific piece of metadata, either for a single file, or for multiple files. With a single file, it's actually not too bad: | == Enumerating files or doing stuff with files == When people try to use `ls` to get a list of filenames (either all files, or files that match a [[glob]], or files sorted in some way) things fail disastrously. If you just want to iterate over all the files in the current directory, use a `for` loop and a [[glob]]: |
Line 35: | Line 36: |
read _ _ owner _ < <(ls -l "$file") | # Good! for f in *; do [[ -e $f ]] || continue ... done |
Line 38: | Line 43: |
(See [[BashFAQ/024|Bash FAQ 24]] for the rationale behind the ProcessSubstitution there.) | Consider also using "shopt -s nullglob" so that an empty directory won't give you a literal '*'. |
Line 40: | Line 45: |
This is a little bit messier than non-portable means of acquiring the same data (like the `stat(1)` command), but even here, we still run into problems. The way `ls` reports timestamps, for example, is an unholy disaster. | {{{ # Good! (Bash-only) shopt -s nullglob for f in *; do ... done }}} '''Never''' do these: {{{ # BAD! Don't do this! for f in $(ls); do ... done }}} {{{ # BAD! Don't do this! for f in $(find . -maxdepth 1); do # find is just as bad as ls in this context ... done }}} {{{ # BAD! Don't do this! arr=($(ls)) # Word-splitting and globbing here, same mistake as above for f in "${arr[@]}"; do ... done }}} {{{ # BAD! Don't do this! (The function itself is correct.) f() { local f for f; do ... done } f $(ls) # Word-splitting and globbing here, same mistake as above. }}} See BashPitfalls and DontReadLinesWithFor for more details. Things get more difficult if you wanted some specific sorting that only `ls` can do, such as ordering by `mtime`. If you want the oldest or newest file in a directory, don't use `ls -t | head -1` -- read [[BashFAQ/099|Bash FAQ 99]] instead. If you truly need a list of ''all'' the files in a directory in order by `mtime` so that you can process them in sequence, switch to perl, and have your perl program do its own directory opening and sorting. Then do the processing in the perl program, or -- worst case scenario -- have the perl program spit out the filenames with NUL delimiters. Even better, put the modification time ''in'' the filename, in YYYYMMDD format, so that [[glob]] order is also mtime order. Then you don't need `ls` or perl or anything. (The ''vast'' majority of cases where people want the oldest or newest file in a directory can be solved just by doing this.) You could patch `ls` to support a `--null` option and submit the patch to your OS vendor. That should have been done about 15 years ago. Of course, the reason that wasn't done is because very few people ''really'' need the sorting of `ls` in their scripts. Mostly, when people want a list of filenames, they use [[UsingFind|find(1)]] instead, because they don't care about the order. And BSD/GNU `find` has had the ability to terminate filenames with NULs for a very long time. So, instead of this: {{{ # Bad! Don't! ls | while read filename; do ... done }}} Try this: {{{ # Be aware that this does not do the same as above. This goes recursive and lists only on normal files (i.e. no dirs or symlinks). It may work for some situation but is not at all a replacement for the above. find . -type f -print0 | while IFS= read -r -d '' filename; do ... done }}} Even better, most people don't really want a list of filenames. They want to ''do things'' to files instead. The list is just an [[XyProblem|intermediate step to accomplishing some real goal]], such as ''change www.mydomain.com to mydomain.com in every *.html file''. `find` can pass filenames directly to another command. There is usually no need to write the filenames out in a straight line and then rely on some other program to read the stream and separate the names back out. == Getting metadata on a file == If you're after the file's size, the portable method is to use `wc` instead: {{{ # POSIX size=$(wc -c < "$file") }}} However note that some implementations of `wc` will read the file fully instead of detecting stdin is a regular file and get the information from fstat(2). Other metadata is often hard to get at in a portable way. `stat` is not available on every platform, and when it is, it often takes a completely different syntax of arguments. There is no way to use `stat` in a way that it won't break for the next POSIX system you run the script on. Though, if you're ok with that, both the GNU implementations of `stat(1)` and `find(1)` (via the `-printf` option) are very good ways to get file information, depending upon whether you want it for a single file or multiple files. AST find also has `-printf`, but again with incompatible formats, and it's much less common than GNU find. {{{ # GNU size=$(stat -c %s -- "$file") (( totalSize = $(find . -maxdepth 1 -type f -printf %s+)0 )) }}} If all else fails, you can try to parse certain metadata out of `ls -l`'s output. Two big warnings: run `ls` with only '''one file at a time''' (remember, you can't reliably know where to start parsing metadata of the second file since there's no good delimiter - and no, a newline is not a good delimiter) and '''don't parse the timestamp or beyond''' (the time stamp is usually formatted in a very platform and locale-dependent manner and can thus not be parsed reliably). {{{ read mode links owner _ < <(ls -ld -- "$file") }}} Note that the mode string is also often platform-specific. Eg. OS X adds an @ for files with xattrs and a + for files with extended security information. In case you don't believe us, here's why not to try to parse the timestamp: |
Line 53: | Line 153: |
Line 58: | Line 157: |
However, if we wanted to get metadata from ''more than one file'' in the same `ls` command, we run into the same problem we had before -- files can have newlines in their names, which screws up our output. Imagine how code like this would break if we have a file with a newline in its name: | If we wanted to get metadata from ''more than one file'' in the same `ls` command, we run into the same problem we had before -- files can have newlines in their names, which screws up our output. Imagine how code like this would break if we have a file with a newline in its name: |
Line 66: | Line 165: |
Line 69: | Line 167: |
See [[BashFAQ/087|Bash FAQ 87]] for some ways of getting file metadata without parsing `ls` output at all. Now, the bigger problem is when people try to use `ls` to get a list of filenames (either all files, or files that match a [[glob]], or files sorted in some way). This is where things fail disastrously. If you just want to iterate over all the files in the current directory, use a `for` loop and a [[glob]]: {{{ for f in *; do ... done }}} See BashPitfalls for more details. '''Never''' do this: {{{ # BAD! Don't do this! for f in `ls`; do ... done }}} Again, BashPitfalls will tell you why, if you don't already know. Things get more difficult if you wanted some specific sorting that only `ls` can do, such as ordering by `mtime`. If you want the oldest or newest file in a directory, don't use `ls -t | head -1` -- read [[BashFAQ/099|Bash FAQ 99]] instead. If you truly need a list of ''all'' the files in a directory in order by `mtime` so that you can process them in sequence, switch to perl, and have your perl program do its own directory opening and sorting. Then do the processing in the perl program, or -- worst case scenario -- have the perl program spit out the filenames with NUL delimiters. Or patch `ls` to support a `--null` option and submit the patch to your OS vendor. That should have been done about 15 years ago. Of course, the reason that wasn't done was because very few people ''really'' need the sorting of `ls` in their scripts. Mostly, if people want a list of filenames, they use [[UsingFind|find(1)]] instead. And BSD/GNU `find` has had the ability to terminate filenames with NULs for a very long time. (Even better, most people don't really want a list of filenames. They want to ''do things'' to files instead. The list is just an intermediate step to accomplishing some real goal, such as "change www.mydomain.com to mydomain.com in every *.html file". `find` can process files without ever writing their names out in a straight line and then relying on some other process to read data from a straight line and separate the names back out.) So, instead of this: {{{ # Bad! Don't! ls | while read filename; do ... done }}} Try this: {{{ find . -type f -print0 | while IFS= read -r -d '' filename; do ... done }}} |
If all of this sounds like a big bag of hurt to you, you're right. It probably isn't worth trying to dodge all this lack of standardization. See [[BashFAQ/087|Bash FAQ 87]] for some ways of getting file metadata without parsing `ls` output at all. |
Why you shouldn't parse the output of ls(1)
The ls(1) command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. There are proposals to try and "fix" this within POSIX, but they won't help in dealing with the current situation (see also how to deal with filenames correctly). In its default mode, if standard output isn't a terminal, ls separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of ls that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls.
$ touch 'a space' $'a\nnewline' $ echo "don't taze me, bro" > a $ ls | cat a a newline a space
This output appears to indicate that we have two files called a, one called newline and one called a space.
Using ls -l we can see that this isn't true at all:
$ ls -l total 8 -rw-r----- 1 lhunath lhunath 19 Mar 27 10:47 a -rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a?newline -rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a space
The problem is that from the output of ls, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.
Also notice how ls sometimes garbles your filename data (in our case, it turned the newline character in between the words "a" and "newline" into a question mark. Some systems put a \n instead.). On some systems it doesn't do this when its output isn't a terminal, on others it always mangles the filename. All in all, you really can't and shouldn't trust the output of ls to be a true representation of the filenames that you want to work with. So don't.
Now that we've seen the problem, let's explore various ways of coping with it. As usual, we have to start by figuring out what we actually want to do.
Enumerating files or doing stuff with files
When people try to use ls to get a list of filenames (either all files, or files that match a glob, or files sorted in some way) things fail disastrously.
If you just want to iterate over all the files in the current directory, use a for loop and a glob:
# Good! for f in *; do [[ -e $f ]] || continue ... done
Consider also using "shopt -s nullglob" so that an empty directory won't give you a literal '*'.
# Good! (Bash-only) shopt -s nullglob for f in *; do ... done
Never do these:
# BAD! Don't do this! for f in $(ls); do ... done
# BAD! Don't do this! for f in $(find . -maxdepth 1); do # find is just as bad as ls in this context ... done
# BAD! Don't do this! arr=($(ls)) # Word-splitting and globbing here, same mistake as above for f in "${arr[@]}"; do ... done
# BAD! Don't do this! (The function itself is correct.) f() { local f for f; do ... done } f $(ls) # Word-splitting and globbing here, same mistake as above.
See BashPitfalls and DontReadLinesWithFor for more details.
Things get more difficult if you wanted some specific sorting that only ls can do, such as ordering by mtime. If you want the oldest or newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ 99 instead. If you truly need a list of all the files in a directory in order by mtime so that you can process them in sequence, switch to perl, and have your perl program do its own directory opening and sorting. Then do the processing in the perl program, or -- worst case scenario -- have the perl program spit out the filenames with NUL delimiters.
Even better, put the modification time in the filename, in YYYYMMDD format, so that glob order is also mtime order. Then you don't need ls or perl or anything. (The vast majority of cases where people want the oldest or newest file in a directory can be solved just by doing this.)
You could patch ls to support a --null option and submit the patch to your OS vendor. That should have been done about 15 years ago.
Of course, the reason that wasn't done is because very few people really need the sorting of ls in their scripts. Mostly, when people want a list of filenames, they use find(1) instead, because they don't care about the order. And BSD/GNU find has had the ability to terminate filenames with NULs for a very long time.
So, instead of this:
# Bad! Don't! ls | while read filename; do ... done
Try this:
# Be aware that this does not do the same as above. This goes recursive and lists only on normal files (i.e. no dirs or symlinks). It may work for some situation but is not at all a replacement for the above. find . -type f -print0 | while IFS= read -r -d '' filename; do ... done
Even better, most people don't really want a list of filenames. They want to do things to files instead. The list is just an intermediate step to accomplishing some real goal, such as change www.mydomain.com to mydomain.com in every *.html file. find can pass filenames directly to another command. There is usually no need to write the filenames out in a straight line and then rely on some other program to read the stream and separate the names back out.
Getting metadata on a file
If you're after the file's size, the portable method is to use wc instead:
# POSIX size=$(wc -c < "$file")
However note that some implementations of wc will read the file fully instead of detecting stdin is a regular file and get the information from fstat(2).
Other metadata is often hard to get at in a portable way. stat is not available on every platform, and when it is, it often takes a completely different syntax of arguments. There is no way to use stat in a way that it won't break for the next POSIX system you run the script on. Though, if you're ok with that, both the GNU implementations of stat(1) and find(1) (via the -printf option) are very good ways to get file information, depending upon whether you want it for a single file or multiple files. AST find also has -printf, but again with incompatible formats, and it's much less common than GNU find.
# GNU size=$(stat -c %s -- "$file") (( totalSize = $(find . -maxdepth 1 -type f -printf %s+)0 ))
If all else fails, you can try to parse certain metadata out of ls -l's output. Two big warnings: run ls with only one file at a time (remember, you can't reliably know where to start parsing metadata of the second file since there's no good delimiter - and no, a newline is not a good delimiter) and don't parse the timestamp or beyond (the time stamp is usually formatted in a very platform and locale-dependent manner and can thus not be parsed reliably).
read mode links owner _ < <(ls -ld -- "$file")
Note that the mode string is also often platform-specific. Eg. OS X adds an @ for files with xattrs and a + for files with extended security information.
In case you don't believe us, here's why not to try to parse the timestamp:
# Debian unstable: $ ls -l -rw-r--r-- 1 wooledg wooledg 240 2007-12-07 11:44 file1 -rw-r--r-- 1 wooledg wooledg 1354 2009-03-13 12:10 file2 # OpenBSD 4.4: $ ls -l -rwxr-xr-x 1 greg greg 1080 Nov 10 2006 file1 -rw-r--r-- 1 greg greg 1020 Mar 15 13:57 file2
On OpenBSD, as on most versions of Unix, ls shows the timestamps in three fields -- month, day, and year-or-time, with the last field being the time (hours:minutes) if the file is less than 6 months old, or the year if the file is more than 6 months old. On Debian unstable, with a fairly recent version of GNU coreutils, ls shows the timestamps in two fields, with the first being Y-M-D and the second being H:M, no matter how old the file is. So, it should be pretty obvious we never want to have to parse the output of ls if we want a timestamp from a file. But for the fields before that, it's usually pretty reliable.
(Note: some versions of ls don't print the group ownership of a file by default, and require a -g flag to do so. Others print the group by default, and -g suppresses it. You've been warned.)
If we wanted to get metadata from more than one file in the same ls command, we run into the same problem we had before -- files can have newlines in their names, which screws up our output. Imagine how code like this would break if we have a file with a newline in its name:
# Don't do this { read 'perms[1]' 'links[1]' 'owner[1]' 'group[1]' _ read 'perms[2]' 'links[2]' 'owner[2]' 'group[2]' _ } < <(ls -l "$file1" "$file2")
Similar code that uses two separate ls calls would probably be OK, since the second read command would be guaranteed to start reading at the beginning of an ls command's output, instead of possibly in the middle of a filename.
If all of this sounds like a big bag of hurt to you, you're right. It probably isn't worth trying to dodge all this lack of standardization. See Bash FAQ 87 for some ways of getting file metadata without parsing ls output at all.