Differences between revisions 34 and 35
Revision 34 as of 2023-06-15 16:42:57
Size: 12677
Editor: GreyCat
Comment: add ls --zero example
Revision 35 as of 2023-08-12 12:58:00
Size: 13865
Comment: need -t to remove the delimiter (for the day bash adds support for storing NULs in its variables). Avoid the mapfile misnomer. And more corrections/additions after review.
Deletions are marked like this. Additions are marked like this.
Line 43: Line 43:
    [[ -e $f ]] || continue     [ -e "$f" ] || [ -L "$f" ] || continue
Line 141: Line 141:
If all else fails, you can try to parse '''some''' metadata out of `ls -l`'s output. Two big warnings: If all else fails, you can try to parse '''some''' metadata out of `ls -l`'s output. A few warnings:
Line 145: Line 145:
 1. '''Do not forget the `-d` option''' without which if ever the file was of type ''directory'', the contents of that directory would be listed instead '''and the `--` delimiter''' to avoid problems with file names starting with `-`.
 1. '''Set the locale to C/POSIX for `ls`''' as the output format is unspecified outside of that locale. In particular the timestamp format is generally locale dependant, but anything else could.
 1. Remember that `read`'s splitting behaviour depends on the current value of `$IFS`
 1. Prefer the numeric output for ''owner'' and ''group'' with `-n` instead of `-l` as whilst very uncommon, user and group names could contain whitespace. User and group names may also be truncated.
Line 149: Line 154:
read -r mode links owner _ < <(ls -ld -- "$file") IFS=' ' read -r mode links owner _ < <(LC_ALL=C ls -nd -- "$file")
Line 188: Line 193:
} < <(ls -l "$file1" "$file2") } < <(ls -ld -- "$file1" "$file2")
Line 203: Line 208:
mapfile -d '' -n 5 sorted < <(ls --zero -tr)
(( ${#sorted[@]} )) && rm -- "${sorted[@]}"
}}}

Less recent (''circa'' 2016) versions of GNU coreutils have a `--quoting-style` option with various choices, most of which are useless for bash scripting purposes, and ''all'' of which are useless for human readability.

We mention it here because ''one'' of the quoting style options
''is actually useful'' when combined with bash's `eval` command. Specifically, `--quoting-style=shell-escape` produces output that bash (but ''not'' POSIX sh) can parse back into filenames.
readarray -t -d '' -n 5 sorted < <(ls --zero -tr)
(( ${#sorted[@]} == 0 )) || rm -- "${sorted[@]}"
}}}

Less recent (''circa'' 2016) versions of GNU coreutils have a `--quoting-style` option with various choices.

One
of them ''is actually useful'' when combined with bash's `eval` command. Specifically, `--quoting-style=shell-always` produces output that Bourne-like shells can parse back into filenames.
Line 213: Line 218:
$ ls --quoting-style=shell-escape
 yyy zzz 'zzz'$'\n''yyy'
}}}
$ ls --quoting-style=shell-always
'yyy' 'zzz' 'zzz?yyy'
$ ls --quoting-style=shell-always | cat
'yyy'
'zzz'
'zzz
yyy'
}}}

It uses always uses single quotes to quote file names (with singles quotes themselves rendered as `\'` outside of quotes) which is the only safe quoting method.

Note that some control characters are still rendered as `?` when the output goes to the terminal, but that doesn't happen for redirected output (like when piped to `cat` as seen above or more generally when the output is post-processed).
Line 223: Line 237:
eval "sorted=( $(ls -rt --quoting-style=shell-escape) )" eval "sorted=( $(ls -rt --quoting-style=shell-always) )"
Line 227: Line 241:
printf '<%s>\n' "${sorted[@]:0:5}"

# Or we can send them into xargs -0:
printf '%s\0' "${sorted[@]:0:5}" | xargs -0 something
(( ${#sorted[@]} == 0 )) || printf '<%s>\n' "${sorted[@]:0:5}"

# Or we can send them into xargs -r0:
print0() {
  [ "$#" -eq 0 ] ||
printf '%s\0' "$@"
}
print0
"${sorted[@]:0:5}" | xargs -r0 something
Line 235: Line 252:
In coreutils 8.25, `--quoting-style=shell-escape` became the default when `ls` is printing to a terminal, but ironically, ''not'' when printing to a pipe (e.g. when you're trying to use `ls` in scripts). You must request it explicitly in scripts. GNU `ls` also supports `--quoting-style=shell-escape` (which in version 8.25 became the default when `ls` is printing to a terminal), but that one is not as safe as it produces output that is not always quoted or uses quoting operators that are not portable or unsafe when used in some locales.

Why you shouldn't parse the output of ls(1)

The ls(1) command is pretty good at showing you the attributes of a single file (at least in some cases), but when you ask it for a list of files, there's a huge problem: Unix allows almost any character in a filename, including whitespace, newlines, commas, pipe symbols, and pretty much anything else you'd ever try to use as a delimiter except NUL. There are proposals to try and "fix" this within POSIX, but they won't help in dealing with the current situation (see also how to deal with filenames correctly). In its default mode, if standard output isn't a terminal, ls separates filenames with newlines. This is fine until you have a file with a newline in its name. Since very few implementations of ls allow you to terminate filenames with NUL characters instead of newlines, this leaves us unable to get a list of filenames safely with ls -- at least, not portably.

$ touch 'a space' $'a\nnewline'
$ echo "don't taze me, bro" > a
$ ls | cat
a
a
newline
a space

This output appears to indicate that we have two files called a, one called newline and one called a space.

Using ls -l we can see that this isn't true at all:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output of ls, neither you or the computer can tell what parts of it constitute a filename. Is it each word? No. Is it each line? No. There is no correct answer to this question other than: you can't tell.

Also notice how ls sometimes garbles your filename data (in our case, it turned the newline character in between the words "a" and "newline" into a question mark. Some systems put a \n instead.). On some systems it doesn't do this when its output isn't a terminal, while on others it always mangles the filename. All in all, you really can't and shouldn't trust the output of ls to be a true representation of the filenames that you want to work with.

Now that we've seen the problem, let's explore various ways of coping with it. As usual, we have to start by figuring out what we actually want to do.

Enumerating files or doing stuff with files

When people try to use ls to get a list of filenames (either all files, or files that match a glob, or files sorted in some way) things fail disastrously.

If you just want to iterate over all the files in the current directory, use a for loop and a glob:

# Good!
for f in *; do
    [ -e "$f" ] || [ -L "$f" ] || continue
    ...
done

Consider also using "shopt -s nullglob" so that an empty directory won't give you a literal '*'.

# Good! (Bash-only)
shopt -s nullglob
for f in *; do
    ...
done

Never do these:

# BAD! Don't do this!
for f in $(ls); do
    ...
done

# BAD! Don't do this!
for f in $(find . -maxdepth 1); do # find is just as bad as ls in this context
    ...
done

# BAD! Don't do this!
arr=($(ls)) # Word-splitting and globbing here, same mistake as above
for f in "${arr[@]}"; do
    ...
done

# BAD! Don't do this! (The function itself is correct.)
f() {
    local f
    for f; do
        ...
    done
}

f $(ls) # Word-splitting and globbing here, same mistake as above.

See BashPitfalls and DontReadLinesWithFor for more details.

Things get more difficult if you wanted some specific sorting that only ls can do, such as ordering by mtime. If you want the oldest or newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ 99 instead. If you truly need a list of all the files in a directory in order by mtime so that you can process them in sequence, switch to perl, and have your perl program do its own directory opening and sorting. Then do the processing in the perl program, or -- worst case scenario -- have the perl program spit out the filenames with NUL delimiters.

Even better, put the modification time in the filename, in YYYYMMDD format, so that glob order is also mtime order. Then you don't need ls or perl or anything. (The vast majority of cases where people want the oldest or newest file in a directory can be solved just by doing this.)

You could patch ls to support a --null option and submit the patch to your OS vendor. That should have been done about 15 years ago. (In fact, people tried, and it was rejected! See below.)

Of course, the reason that wasn't done is because very few people really need the sorting of ls in their scripts. Mostly, when people want a list of filenames, they use find(1) instead, because they don't care about the order. And BSD/GNU find has had the ability to terminate filenames with NULs for a very long time.

So, instead of this:

# Bad!  Don't!
ls | while read filename; do
  ...
done

Try this:

# Be aware that this does not do the same as above. This goes recursive and lists only on normal files (i.e. no dirs or symlinks). It may work for some situation but is not at all a replacement for the above.
find . -type f -print0 | while IFS= read -r -d '' filename; do
  ...
done

Even better, most people don't really want a list of filenames. They want to do things to files instead. The list is just an intermediate step to accomplishing some real goal, such as change www.mydomain.com to mydomain.com in every *.html file. find can pass filenames directly to another command. There is usually no need to write the filenames out in a straight line and then rely on some other program to read the stream and separate the names back out.

Getting metadata on a file

If you're after the file's size, the portable method is to use wc instead:

# POSIX
size=$(wc -c < "$file")

Most implementations of wc will detect that stdin is a regular file, and get the size by calling fstat(2). However, this is not guaranteed. Some implementations may actually read all the bytes.

Other metadata is often hard to get at in a portable way. stat(1) is not available on every platform, and when it is, it often takes a completely different syntax of arguments. There is no way to use stat in a way that it won't break for the next POSIX system you run the script on. Though, if you're OK with that, both the GNU implementations of stat(1) and find(1) (via the -printf option) are very good ways to get file information, depending upon whether you want it for a single file or multiple files. AST find also has -printf, but again with incompatible formats, and it's much less common than GNU find.

# GNU
size=$(stat -c %s -- "$file")
(( totalSize = $(find . -maxdepth 1 -type f -printf %s+)0 ))

If all else fails, you can try to parse some metadata out of ls -l's output. A few warnings:

  1. Run ls with only one file at a time (remember, you can't reliably tell where the first filename ends, because there is no good delimiter -- and no, a newline is not a good enough delimiter -- so there's no way to tell where the second file's metadata starts).

  2. Don't parse the time/date stamp or beyond (the time/date fields are usually formatted in a very platform- and locale-dependent manner and thus cannot be parsed reliably).

  3. Do not forget the -d option without which if ever the file was of type directory, the contents of that directory would be listed instead and the -- delimiter to avoid problems with file names starting with -.

  4. Set the locale to C/POSIX for ls as the output format is unspecified outside of that locale. In particular the timestamp format is generally locale dependant, but anything else could.

  5. Remember that read's splitting behaviour depends on the current value of $IFS

  6. Prefer the numeric output for owner and group with -n instead of -l as whilst very uncommon, user and group names could contain whitespace. User and group names may also be truncated.

This much is relatively safe:

IFS=' ' read -r mode links owner _ < <(LC_ALL=C ls -nd -- "$file")

Note that the mode string is also often platform-specific. E.g. OS X adds an @ for files with xattrs and a + for files with extended security information. GNU sometimes adds a . or + character. So, you may need to limit the mode field to the first 10 characters, depending on what you're doing with it.

mode=${mode:0:10}

In case you don't believe us, here's why not to try to parse the timestamp:

# OpenBSD 4.4:
$ ls -l
-rwxr-xr-x  1 greg  greg  1080 Nov 10  2006 file1
-rw-r--r--  1 greg  greg  1020 Mar 15 13:57 file2

# Debian unstable (2009):
$ ls -l
-rw-r--r-- 1 wooledg wooledg       240 2007-12-07 11:44 file1
-rw-r--r-- 1 wooledg wooledg      1354 2009-03-13 12:10 file2

On OpenBSD, as on most versions of Unix, ls shows the timestamps in three fields -- month, day, and year-or-time, with the last field being the time (hours:minutes) if the file is less than 6 months old, or the year if the file is more than 6 months old.

On Debian unstable (circa 2009), with a contemporary version of GNU coreutils, ls showed the timestamps in two fields, with the first being Y-M-D and the second being H:M, no matter how old the file is.

So, it should be pretty obvious we never want to have to parse the output of ls if we want a timestamp from a file. You'd have to write code to handle all three of the time/date formats shown above, and possibly more.

But for the fields before the date/time, it's usually pretty reliable.

(Note: some versions of ls don't print the group ownership of a file by default, and require a -g flag to do so. Others print the group by default, and -g suppresses it. You've been warned.)

If we wanted to get metadata from more than one file in the same ls command, we run into the same problem we had before -- files can have newlines in their names, which screws up our output. Imagine how code like this would break if we have a file with a newline in its name:

# Don't do this
{ read 'perms[1]' 'links[1]' 'owner[1]' 'group[1]' _
  read 'perms[2]' 'links[2]' 'owner[2]' 'group[2]' _
} < <(ls -ld -- "$file1" "$file2")

Similar code that uses two separate ls calls would probably be OK, since the second read command would be guaranteed to start reading at the beginning of an ls command's output, instead of possibly in the middle of a filename.

If all of this sounds like a big bag of hurt to you, you're right. It probably isn't worth trying to dodge all this lack of standardization. See Bash FAQ 87 for some ways of getting file metadata without parsing ls output at all.

Notes on GNU coreutils ls

A patch to add a -0 option (analogous to find -print0) in GNU coreutils was rejected in 2014. However, in a surprise reversal, a --zero option has been added in GNU coreutils 9.0 (2021). If you're fortunate enough to be writing for platforms with ls --zero, you get to use that for tasks like "delete the 5 oldest files in this directory".

# Bash 4.4 and coreutils 9.0
# Delete the 5 oldest files in the current directory.
readarray -t -d '' -n 5 sorted < <(ls --zero -tr)
(( ${#sorted[@]} == 0 )) || rm -- "${sorted[@]}"

Less recent (circa 2016) versions of GNU coreutils have a --quoting-style option with various choices.

One of them is actually useful when combined with bash's eval command. Specifically, --quoting-style=shell-always produces output that Bourne-like shells can parse back into filenames.

$ touch zzz yyy $'zzz\nyyy'
$ ls --quoting-style=shell-always
'yyy'  'zzz'  'zzz?yyy'
$ ls --quoting-style=shell-always | cat
'yyy'
'zzz'
'zzz
yyy'

It uses always uses single quotes to quote file names (with singles quotes themselves rendered as \' outside of quotes) which is the only safe quoting method.

Note that some control characters are still rendered as ? when the output goes to the terminal, but that doesn't happen for redirected output (like when piped to cat as seen above or more generally when the output is post-processed).

Combining with eval, we can solve certain kinds of problems, like get the 5 oldest files in this directory. Of course, eval must be used with care.

# Bash + recent (since ~2016) GNU coreutils

# Get all the files, in sorted order by mtime.
eval "sorted=( $(ls -rt --quoting-style=shell-always) )"

# First 5 array elements are the 5 oldest files.
# We can display them to a human:
(( ${#sorted[@]} == 0 )) || printf '<%s>\n' "${sorted[@]:0:5}"

# Or we can send them into xargs -r0:
print0() {
  [ "$#" -eq 0 ] || printf '%s\0' "$@"
}
print0 "${sorted[@]:0:5}" | xargs -r0 something

# Or whatever we want to do with them

GNU ls also supports --quoting-style=shell-escape (which in version 8.25 became the default when ls is printing to a terminal), but that one is not as safe as it produces output that is not always quoted or uses quoting operators that are not portable or unsafe when used in some locales.


CategoryShell

ParsingLs (last edited 2023-08-12 13:05:09 by StephaneChazelas)