I'm getting "Argument list too long". How can I process a large list in chunks?
First, let's review some background material. When a process wants to run another process, it fork()s a child, and the child calls one of the exec* family of system calls (e.g. execve()), giving the name or path of the new process's program file; the name of the new process; the list of arguments for the new process; and, in some cases, a set of environment variables. Thus:
/* C */ execlp("ls", "ls", "-l", "dir1", "dir2", (char *) NULL);
There is (generally) no limit to the number of arguments that can be passed this way, but on most systems, there is a limit to the total size of the list. For more details, see http://www.in-ulm.de/~mascheck/various/argmax/ .
If you try to pass too many filenames (for instance) in a single program invocation, you'll get something like:
$ grep foo /usr/include/sys/*.h bash: /usr/bin/grep: Arg list too long
There are various tricks you could use to work around this in an ad hoc manner (change directory to /usr/include/sys first, and use grep foo *.h to shorten the length of each filename...), but what if you need something absolutely robust?
Some people like to use xargs here, but it has some serious issues. It treats whitespace and quote characters in its input as word delimiters, making it incapable of handling filenames properly. (See UsingFind for a discussion of this.)
That said, the GNU version of xargs has a -0 option that lets us feed NUL-terminated arguments to it, and when reading in this mode, it doesn't fall over and explode when it sees whitespace or quote characters. So, we could feed it a list thus:
# Requires GNU xargs printf '%s\0' /usr/include/sys/*.h | xargs -0 grep foo /dev/null
Or, if recursion is acceptable (or desirable), you may use find directly:
find /usr/include/sys -name '*.h' -exec grep foo /dev/null {} +
If recursion is unacceptable but you have GNU find, you can use this non-portable alternative:
# Requires GNU find find /usr/include/sys -name '*.h' -maxdepth 1 -exec grep foo /dev/null {} +
(Recall that grep will only print filenames if it receives more than one filename to process. Thus, we pass it /dev/null as a filename, to ensure that it always has at least two filenames, even if the -exec only passes it one name.)
The most general alternative is to use a Bash array and a loop to process the array in chunks:
# Bash files=(/usr/include/*.h /usr/include/sys/*.h) for ((i=0; i<${#files[*]}; i+=100)); do grep foo "${files[@]:i:100}" /dev/null done
Here, we've chosen to process 100 elements at a time; this is arbitrary, of course, and you could set it higher or lower depending on the anticipated size of each element vs. the target system's getconf ARG_MAX value. If you want to get fancy, you could do arithmetic using ARG_MAX and the size of the largest element, but you still have to introduce "fudge factors" for the size of the environment, etc. It's easier just to choose a conservative value and hope for the best.