Differences between revisions 2 and 5 (spanning 3 versions)
Revision 2 as of 2008-11-22 14:09:11
Size: 2271
Editor: localhost
Comment: converted to 1.6 markup
Revision 5 as of 2018-07-06 17:47:06
Size: 3268
Editor: GreyCat
Comment: xargs -0 alternative
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Line 6: Line 5:
 {{{
 /* C */
 execlp("ls", "ls", "-l", "dir1", "dir2", (char *) NULL);}}}
 . {{{
/* C */
execlp("ls", "ls", "-l", "dir1", "dir2", (char *) NULL);
}}}
Line 14: Line 14:
 {{{
 $ grep foo /usr/include/sys/*.h
 bash: /usr/bin/grep: Arg list too long}}}
 . {{{
$ grep foo /usr/include/sys/*.h
bash: /usr/bin/grep: Arg list too long
}}}
Line 22: Line 23:
The most robust alternative is to use a Bash [[BashFAQ/005|array]] and a loop to process the array in chunks: That said, the '''GNU''' version of `xargs` has a `-0` option that lets us feed NUL-terminated arguments to it, and when reading in this mode, it doesn't fall over and explode when it sees whitespace or quote characters. So, we could feed it a list thus:
Line 24: Line 25:
 {{{
 # Bash
 files=(/usr/include/*.h /usr/include/sys/*.h)
 for ((i=0; i<${#files[*]}; i+=100)); do
 . {{{
# Requires GNU xargs
printf '%s\0' /usr/include/sys/*.h |
xargs -0 grep foo /dev/null
}}}

Or, if recursion is acceptable (or desirable), you may use [[UsingFind|find]] directly:

 . {{{
find /usr/include/sys -name '*.h' -exec grep foo /dev/null {} +
}}}

If recursion is unacceptable but you have GNU `find`, you can use this non-portable alternative:

 . {{{
# Requires GNU find
find /usr/include/sys -name '*.h' -maxdepth 1 -exec grep foo /dev/null {} +
}}}

(Recall that `grep` will only print filenames if it receives more than one filename to process. Thus, we pass it `/dev/null` as a filename, to ensure that it ''always'' has at least two filenames, even if the `-exec` only passes it one name.)

The most general alternative is to use a Bash [[BashFAQ/005|array]] and a loop to process the array in chunks:

 . {{{
# Bash
files=(/usr/include/*.h /usr/include/sys/*.h)
for ((i=0; i<${#files[*]}; i+=100)); do
Line 29: Line 53:
 done}}} done
}}}

I'm getting "Argument list too long". How can I process a large list in chunks?

First, let's review some background material. When a process wants to run another process, it fork()s a child, and the child calls one of the exec* family of system calls (e.g. execve()), giving the name or path of the new process's program file; the name of the new process; the list of arguments for the new process; and, in some cases, a set of environment variables. Thus:

  • /* C */
    execlp("ls", "ls", "-l", "dir1", "dir2", (char *) NULL);

There is (generally) no limit to the number of arguments that can be passed this way, but on most systems, there is a limit to the total size of the list. For more details, see http://www.in-ulm.de/~mascheck/various/argmax/ .

If you try to pass too many filenames (for instance) in a single program invocation, you'll get something like:

  • $ grep foo /usr/include/sys/*.h
    bash: /usr/bin/grep: Arg list too long

There are various tricks you could use to work around this in an ad hoc manner (change directory to /usr/include/sys first, and use grep foo *.h to shorten the length of each filename...), but what if you need something absolutely robust?

Some people like to use xargs here, but it has some serious issues. It treats whitespace and quote characters in its input as word delimiters, making it incapable of handling filenames properly. (See UsingFind for a discussion of this.)

That said, the GNU version of xargs has a -0 option that lets us feed NUL-terminated arguments to it, and when reading in this mode, it doesn't fall over and explode when it sees whitespace or quote characters. So, we could feed it a list thus:

  • # Requires GNU xargs
    printf '%s\0' /usr/include/sys/*.h |
    xargs -0 grep foo /dev/null

Or, if recursion is acceptable (or desirable), you may use find directly:

  • find /usr/include/sys -name '*.h' -exec grep foo /dev/null {} +

If recursion is unacceptable but you have GNU find, you can use this non-portable alternative:

  • # Requires GNU find
    find /usr/include/sys -name '*.h' -maxdepth 1 -exec grep foo /dev/null {} +

(Recall that grep will only print filenames if it receives more than one filename to process. Thus, we pass it /dev/null as a filename, to ensure that it always has at least two filenames, even if the -exec only passes it one name.)

The most general alternative is to use a Bash array and a loop to process the array in chunks:

  • # Bash
    files=(/usr/include/*.h /usr/include/sys/*.h)
    for ((i=0; i<${#files[*]}; i+=100)); do
       grep foo "${files[@]:i:100}" /dev/null
    done

Here, we've chosen to process 100 elements at a time; this is arbitrary, of course, and you could set it higher or lower depending on the anticipated size of each element vs. the target system's getconf ARG_MAX value. If you want to get fancy, you could do arithmetic using ARG_MAX and the size of the largest element, but you still have to introduce "fudge factors" for the size of the environment, etc. It's easier just to choose a conservative value and hope for the best.

BashFAQ/095 (last edited 2018-07-06 17:47:06 by GreyCat)