Differences between revisions 2 and 7 (spanning 5 versions)
Revision 2 as of 2009-03-19 13:00:07
Size: 5831
Editor: GreyCat
Comment: category, change header slightly
Revision 7 as of 2009-08-29 08:25:40
Size: 6935
Editor: JariAalto
Comment: Prefer POSIX $() to backticks ``
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Why you shouldn't parse the output of `ls` = = Why you shouldn't parse the output of ls(1) =
Line 14: Line 14:

This output appears to indicate that we have two files called `a`, one called `newline` and one called `a space`.

Using `ls -l` we can see that this isn't true at all:

{{{
$ ls -l
total 8
-rw-r----- 1 lhunath lhunath 19 Mar 27 10:47 a
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a?newline
-rw-r----- 1 lhunath lhunath 0 Mar 27 10:47 a space
}}}

The problem is that from the output of `ls`, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.

Also notice how `ls` sometimes garbles your filename data (in our case, it turned the newline character in between the words "`a`" and "`newline`" into a question mark). Generally speaking, though, it doesn't do this when its output isn't a terminal (which is why it didn't do it for the first example which piped `ls` into `cat`). All in all, you really can't and shouldn't trust the output of `ls` to be a true representation of the filenames that you want to work with. So don't.
Line 70: Line 86:
for f in `ls`; do for f in $(ls); do

Why you shouldn't parse the output of ls(1)

The ls(1) command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. In its default mode, if standard output isn't a terminal, ls separates filenames with newlines. This is fine until you have a file with a newline in its name. And since I don't know of any implementation of ls that allows you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls.

$ touch 'a space' $'a\nnewline'
$ echo "don't taze me, bro" > a
$ ls | cat
a
a
newline
a space

This output appears to indicate that we have two files called a, one called newline and one called a space.

Using ls -l we can see that this isn't true at all:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our case, it turned the newline character in between the words "a" and "newline" into a question mark). Generally speaking, though, it doesn't do this when its output isn't a terminal (which is why it didn't do it for the first example which piped ls into cat). All in all, you really can't and shouldn't trust the output of ls to be a true representation of the filenames that you want to work with. So don't.

Now that we've seen the problem, let's explore various ways of coping with it. As usual, we have to start by figuring out what we actually want to do.

As mentioned previously, sometimes people are trying to use ls to get some specific piece of metadata, either for a single file, or for multiple files. With a single file, it's actually not too bad:

read _ _ owner _ < <(ls -l "$file")

(See Bash FAQ 24 for the rationale behind the ProcessSubstitution there.)

This is a little bit messier than non-portable means of acquiring the same data (like the stat(1) command), but even here, we still run into problems. The way ls reports timestamps, for example, is an unholy disaster.

# Debian unstable:
$ ls -l
-rw-r--r-- 1 wooledg wooledg       240 2007-12-07 11:44 file1
-rw-r--r-- 1 wooledg wooledg      1354 2009-03-13 12:10 file2

# OpenBSD 4.4:
$ ls -l
-rwxr-xr-x  1 greg  greg  1080 Nov 10  2006 file1
-rw-r--r--  1 greg  greg  1020 Mar 15 13:57 file2

On OpenBSD, as on most versions of Unix, ls shows the timestamps in three fields -- month, day, and year-or-time, with the last field being the time (hours:minutes) if the file is less than 6 months old, or the year if the file is more than 6 months old. On Debian unstable, with a fairly recent version of GNU coreutils, ls shows the timestamps in two fields, with the first being Y-M-D and the second being H:M, no matter how old the file is. So, it should be pretty obvious we never want to have to parse the output of ls if we want a timestamp from a file. But for the fields before that, it's usually pretty reliable.

(Note: some versions of ls don't print the group ownership of a file by default, and require a -g flag to do so. Others print the group by default, and -g suppresses it. You've been warned.)

However, if we wanted to get metadata from more than one file in the same ls command, we run into the same problem we had before -- files can have newlines in their names, which screws up our output. Imagine how code like this would break if we have a file with a newline in its name:

# Don't do this
{ read 'perms[1]' 'links[1]' 'owner[1]' 'group[1]' _
  read 'perms[2]' 'links[2]' 'owner[2]' 'group[2]' _
} < <(ls -l "$file1" "$file2")

Similar code that uses two separate ls calls would probably be OK, since the second read command would be guaranteed to start reading at the beginning of an ls command's output, instead of possibly in the middle of a filename.

See Bash FAQ 87 for some ways of getting file metadata without parsing ls output at all.

Now, the bigger problem is when people try to use ls to get a list of filenames (either all files, or files that match a glob, or files sorted in some way). This is where things fail disastrously.

If you just want to iterate over all the files in the current directory, use a for loop and a glob:

for f in *; do
 ...
done

See BashPitfalls for more details. Never do this:

# BAD!  Don't do this!
for f in $(ls); do
 ...
done

Again, BashPitfalls will tell you why, if you don't already know.

Things get more difficult if you wanted some specific sorting that only ls can do, such as ordering by mtime. If you want the oldest or newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ 99 instead. If you truly need a list of all the files in a directory in order by mtime so that you can process them in sequence, switch to perl, and have your perl program do its own directory opening and sorting. Then do the processing in the perl program, or -- worst case scenario -- have the perl program spit out the filenames with NUL delimiters.

Or patch ls to support a --null option and submit the patch to your OS vendor. That should have been done about 15 years ago.

Of course, the reason that wasn't done was because very few people really need the sorting of ls in their scripts. Mostly, if people want a list of filenames, they use find(1) instead. And BSD/GNU find has had the ability to terminate filenames with NULs for a very long time.

(Even better, most people don't really want a list of filenames. They want to do things to files instead. The list is just an intermediate step to accomplishing some real goal, such as "change www.mydomain.com to mydomain.com in every *.html file". find can process files without ever writing their names out in a straight line and then relying on some other process to read data from a straight line and separate the names back out.)

So, instead of this:

# Bad!  Don't!
ls | while read filename; do
  ...
done

Try this:

find . -type f -print0 | while IFS= read -r -d '' filename; do
  ...
done


CategoryShell

ParsingLs (last edited 2023-08-12 13:05:09 by StephaneChazelas)