Differences between revisions 16 and 33 (spanning 17 versions)
Revision 16 as of 2007-08-10 15:03:58
Size: 12111
Editor: GreyCat
Comment: old UUOK-9 page is now 404; find a replacement for it
Revision 33 as of 2009-09-16 18:29:46
Size: 25564
Editor: GreyCat
Comment: slight change in the locking/mutex answer
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
''First, let's get the easy stuff out of the way.''

= Things you were supposed to know BEFORE you entered #bash =

{{{<JoeNewbie> How do I kill a process by name? I need to get the PID out of ps aux | grep ....
}}}
<<TableOfContents>>

<<Anc
hor(basics)>>
= The basics =

== How do I kill a process by name? I need to get the PID out of ps aux | grep .... ==
Line 14: Line 14:
{{{<JoeNewbie> How do I run a job in the background?
}}}

{{{command &}}}

{{{<JoeNewbie> My script runs a job in the background. How do I get its PID?
}}}
If you just wanted to check for the ''existence'' of a process by name, use {{{pgrep}}}.

Please note that checking/killing processes by name is ''insecure'', because processes can lie about their names, and names are not guaranteed to be unique. The rest of this page will explain things in greater depth, and provide alternatives.

== How do I run a job in the background? ==

{{{
command &
}}}

By the way, `&` is a command separator in bash and other Bourne shells; it's syntactically the same as `;` and can be used in place of `;` but not ''in addition to'' `;`. Thus, you can write this:

{{{
command one & command two & command three &
}}}

Or:

{{{
for i in 1 2 3; do command $i & done
}}}

== My script runs a job in the background. How do I get its PID? ==
Line 24: Line 40:
{{{<JoeNewbie> OK, I have its PID. How do I check that it's still running?
}}}


{{{kill -0 $PID}}} will check to see whether a signal is deliverable (''i.e.'', the process still exists). If you need to check on a single child process asynchronously, that's the best, most portable, most efficient solution. You might also be able to use the {{{wait}}} shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the {{{select(2)}}} or {{{poll(2)}}} system calls. If you need to manage a complex suite of child processes and events, don't try to do it in a shell script.

{{{<JoeN
ewbie> I want to run something in the background and then log out
}}}
{{{
myjo
b &
jobpid=$!
}}}

==
OK, I have its PID. How do I check that it's still running? ==

{{{kill -0 $PID}}} will check to see whether a signal is deliverable (''i.e.'', the process still exists). If you need to check on a single child process asynchronously, that's the most portable solution. You might also be able to use the {{{wait}}} shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the {{{select(2)}}} or {{{poll(2)}}} system calls. If you need to manage a complex suite of child processes and events, don't try to do it in a shell script.  (That said, there are a few tricks in the [[#advanced|advanced]] section of this page.)

== I want to run something in the
background and then log out. ==
Line 40: Line 59:
{{{sleep 1000 {{{
sleep 1000
Line 43: Line 63:
disown}}}

{{{<JoeNewbie> I'm trying to kill -9 my job but blah blah blah...
}}}
disown
}}}

If you need to logout of an ssh session with background jobs still running, make sure their file descriptors have been redirected so they aren't holding the terminal open, or [[BashFAQ/063|the ssh client may hang]].

== I'm trying to kill -9 my job but blah blah blah... ==
Line 56: Line 78:
If you don't understand why this is a case of slicing bread with a howitzer, read [http://partmaps.org/era/unix/award.html#uuk9letter The Useless Use of Kill -9 Award]. If you don't understand why this is a case of slicing bread with a howitzer, read [[http://partmaps.org/era/unix/award.html#uuk9letter|The Useless Use of Kill -9 Award]].

== Make SURE you have run and understood these commands: ==
 * {{{help kill}}}
 * {{{help trap}}}
 * {{{man pkill}}}
 * {{{man pgrep}}}
Line 60: Line 88:
= Things that actually need answers =

{{{<JoeNewbie> I want to run two jobs in the background, and then wait until they both finish.
}}}
<<Anchor(advanced)>>
= Advanced questions =

== I want to run two jobs in the background, and then wait until they both finish. ==
Line 67: Line 95:
{{{job1 & {{{
job1 &
Line 69: Line 98:
wait}}}

{{{<JoeNewbie> How can I check to see if my game server is still running?
<JoeNewbie>
I'll put a script in crontab, and if it's not running, I'll restart it...}}}

We get that question (in various forms) ''way'' too often. A user has some daemon with a bug, and rather than fix the bug (which admittedly lies well outside the scope of a normal system administrator's purview), they simply want to restart it whenever it dies. And yes, one could probably write a bash script that would try to parse the output of {{{ps}}} (or preferably {{{pgrep}}} if your system has it), and try to ''guess'' which process ID belongs to the daemon we want, and try to ''guess'' whether it's not there any more. But that's haphazard and dangerous. There are much better ways.

Most Unix systems already ''have'' a feature that allows you to respawn dead processes: {{{init}}} and {{{inittab}}}. If you want to make a new daemon instance pop up whenever the old one dies, typically all you need to do is put an appropriate line into {{{/etc/inittab}}} with the "respawn" action in column 3, and your process's invocation in column 4.

Some Unix systems don't have {{{inittab}}}, and some system administrators might want finer control over the daemons and their logging. Those people may want to look into [http://cr.yp.to/daemontools.html daemontools], or [http://smarden.org/runit/ runit].
wait
}}}

There is no way to wait for more than one, but not all, of your children, unfortunately. It's "all, one, or none".

There is also no way to wait for "child process foo to end, OR something else to happen", other than setting a `trap`, which will only help if "something else to happen" is a signal being sent to the script.

There is also no way to wait for a process that is not your child. You can't hang around the schoolyard and pick up someone else's kids.

==
How can I check to see if my game server is still running?  I'll put a script in crontab, and if it's not running, I'll restart it... ==

We get that question (in various forms) ''way'' too often. A user has some daemon, and they want to restart it whenever it dies. Yes, one could probably write a bash script that would try to parse the output of {{{ps}}} (or preferably {{{pgrep}}} if your system has it), and try to ''guess'' which process ID belongs to the daemon we want, and try to ''guess'' whether it's not there any more. But that's haphazard and dangerous. There are much better ways.

Most Unix systems already ''have'' a feature that allows you to respawn dead processes: {{{init}}} and {{{inittab}}}. If you want to make a new daemon instance pop up whenever the old one dies, typically all you need to do is put an appropriate line into {{{/etc/inittab}}} with the "respawn" action in column 3, and your process's invocation in column 4.  Then run `telinit q` or your system's equivalent to make init re-read its `inittab`.

Some Unix systems don't have {{{inittab}}}, and some system administrators might want finer control over the daemons and their logging. Those people may want to look into [[http://cr.yp.to/daemontools.html|daemontools]], or [[http://smarden.org/runit/|runit]].
Line 82: Line 117:
The problem with this is that any sane method of managing a daemon requires that you ''keep track of it after starting it''. If {{{init}}} is told to respawn a command, it simply launches that command as a child, then uses the {{{wait()}}} system call; and when the child exits, the parent can spawn another one. Daemontools works the same way: a user-supplied {{{run}}} script establishes the environment, and then {{{exec}}}s the process, thereby giving the daemontools supervisor direct parental authority over the process, including standard input and output, etc.

If a process double-forks itself into the background, it breaks the connection to its parent -- intentionally. This makes it unmanageable; the parent can no longer receive the child's output, and can no longer {{{wait()}}} for the child in order to be informed of its death. And the parent won't even know the new daemon's process ID, so it can't even keep track of it with a simple {{{kill -0}}}.

So, the Unix/BSD people came up with workarounds... they created "PID files", in which a long-running daemon would write its process ID, since the parent had no other way to determine it. But PID files are not reliable. A daemon could have died, and then some other process could have taken over its PID, rendering the PID file useless. Or the PID file could simply get deleted, or corrupted. They came up with {{{pgrep}}} and {{{pkill}}} to attempt to track down processes by name instead of by number... but what if the process doesn't have a unique name? What if there's more than one of it at a time, for example, with {{{nfsd}}} or Apache?

These workarounds and tricks are only in place because of the ''original'' hack of self-backgrounding. Get rid of ''that'', and everything else becomes easy! Init or daemontools or runit can just control the child process directly. And even the most raw beginner could write their own wrapper script:
The problem with this is that any sane method of managing a daemon requires that you ''keep track of it after starting it''. If {{{init}}} is told to respawn a command, it simply launches that command as a child, then uses the {{{wait()}}} system call; so, when the child exits, the parent can spawn another one. Daemontools works the same way: a user-supplied {{{run}}} script establishes the environment, and then {{{exec}}}s the process, thereby giving the daemontools supervisor direct parental authority over the process, including standard input and output, etc.

If a process double-forks itself into the background (the way `inetd` and `sendmail` and `named` do), it breaks the connection to its parent -- intentionally. This makes it unmanageable; the parent can no longer receive the child's output, and can no longer {{{wait()}}} for the child in order to be informed of its death. And the parent won't even know the new daemon's process ID. The child has run away from home without even leaving a note.

So, the Unix/BSD people came up with workarounds... they created "PID files", in which a long-running daemon would write its process ID, since the parent had no other way to determine it. But PID files are not reliable. A daemon could have died, and then some other process could have taken over its PID, rendering the PID file useless. Or the PID file could simply get deleted, or corrupted. They came up with {{{pgrep}}} and {{{pkill}}} to attempt to track down processes by name instead of by number... but what if the process doesn't have a unique name? What if there's more than one of it at a time, like {{{nfsd}}} or Apache?

These workarounds and tricks are only in place because of the ''original'' hack of self-backgrounding. Get rid of ''that'', and everything else becomes easy! Init or daemontools or runit can just control the child process directly. And even the most raw beginner could write their own [[WrapperScript|wrapper script]]:
Line 92: Line 127:
   while true; do    while :; do
Line 101: Line 136:
{{{<JoeNewbie> How do I make sure only one copy of my script can run at a time?
}}}


First, ask yourself ''why'' you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than [:BashFAQ#faq62:generating a new temporary file in a secure manner] each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably use file locking, by rewriting your application in a language that supports it.
If all else fails, you can try using [[http://cr.yp.to/daemontools/fghack.html|fghack]] (from the daemontools package) to prevent the self-backgrounding.

== Ho
w do I make sure only one copy of my script can run at a time? ==

First, ask yourself ''why'' you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than [[BashFAQ/062|generating a new temporary file in a secure manner]] each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably use file locking, by rewriting your application in a language that supports it.
Line 108: Line 144:
Unfortunately, bash has no facility for locking a file. You can [:BashFAQ#45:use a ''directory'' as a lock], but you cannot lock a file directly.

'' I believe you can use {{{(set -C; >lockfile)}}} to atomically create a lockfile, please verify this. (see:
[http://wooledge.org/mywiki/BashFAQ/045 BashFAQ/045]) --Andy753421''

You can run any program or shell script under the [http://cr.yp.to/daemontools/setlock.html setlock] program from the daemontools package. Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once. Here's an example where we want to make sure that only one "sleep" is running at a given time.
Unfortunately, bash has no facility for locking a file. [[BashFAQ/045|Bash FAQ #45]] contains examples of using a directory, a symlink, etc. as a means of mutual exclusion; but you cannot lock a file directly.

 ''I believe you can use {{{(set -C; >lockfile)}}} to atomically create a lockfile, please verify this. (see: [[
BashFAQ/045|Bash FAQ #45]]) --Andy753421''

You could also run your program or shell script under the [[http://cr.yp.to/daemontools/setlock.html|setlock]] program from the daemontools package. Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once. Here's an example where we want to make sure that only one "sleep" is running at a given time.
Line 117: Line 153:
$ setlock -nX lockfile sleep 100  $ setlock -nX lockfile sleep 100
Line 123: Line 159:
{{{<JoeNewbie> I want to process a bunch of files, and when one finishes, I want to start the next.
<JoeNewbie> And I want to make sure there are exactly 5 jobs running at a time.}}}

Well, a C program may have the luxury of forking 5 children and managing them closely using {{{select()}}} or similar, to assign the next file in line to whichever child is ready to handle it. That level of detail just isn't practical in a shell script.

In a script, you're typically better off dividing the job into 5 "equal" parts, and then just launching them all in parallel. Here's an example:

{{{
== I want to process a bunch of files in parallel, and when one finishes, I want to start the next. And I want to make sure there are exactly 5 jobs running at a time. ==

Many `xargs` allow running tasks in parallel, including FreeBSD, OpenBSD and GNU (but not POSIX):

{{{
find . -print0 | xargs -0 -n 1 -P 4 command
}}}

A C program could fork 5 children and manage them closely using {{{select()}}} or similar, to assign the next file in line to whichever child is ready to handle it. But bash has nothing equivalent to `select` or `poll`.

In a script, you're reduced to lesser solutions. One way is to divide the job into 5 "equal" parts, and then just launch them all in parallel. Here's an example:

{{{#!nl
Line 160: Line 201:
See [:BashFAQ#faq1:reading a file line-by-line] and [:BashFAQ#faq5:arrays] and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is ''close enough'' for many purposes. Again, if you need something more sophisticated than this, you're looking at the wrong language.
See [[BashFAQ/001|reading a file line-by-line]] and [[BashFAQ/005|arrays]] and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is ''close enough'' for many purposes.

Another approach involves using a [[NamedPipes|named pipe]] to tell a "manager" when a job is finished, so it can launch the next job. Here is an example of that approach:

{{{#!nl
#!/bin/bash

# FD 3 will be tied to a named pipe.
mkfifo pipe; exec 3<>pipe

# This is the job we're running.
s() {
  echo Sleeping $1
  sleep $1
}

# Start off with 3 instances of it.
# Each time an instance terminates, write a newline to the named pipe.
{ s 5; echo >&3; } &
{ s 7; echo >&3; } &
{ s 8; echo >&3; } &

# Each time we get a line from the named pipe, launch another job.
while read; do
  { s $((RANDOM%5+7)); echo >&3; } &
done <&3
}}}

If you need something more sophisticated than these, you're probably looking at the wrong language.

<<Anchor(theory)>>
= On processes, environments and inheritance =

Every process on a Unix system has a parent process (except `init`), from which it inherits certain things. A process can change some of these things, and not others. You cannot change things inside another process other than by being its parent, or attaching (attacking?) it with a debugger.

It is of paramount importance that you understand this model if you plan to use or administer a Unix system successfully. For example, a user with 10 windows open might wonder why he can't tell all of his shells to change the contents of their PATH variable, short of going to each one individually and running a command. And even then, the changed PATH variable won't be set in the user's window manager or desktop environment, which means any ''new'' windows he creates will still get the old variable.

The solution, of course, is that the user needs to edit a shell [[DotFiles|dot file]], then logout and back in, so that his top-level processes will get the new variable, and can pass it along to their children.

Likewise, a system administrator might want to tell her `in.ftpd` to use a default [[Permissions#umask|umask]] of 002 instead of whatever it's currently using. Achieving that goal will require an understanding of how `in.ftpd` is launched on her system, either as a child of `inetd` or as a standalone daemon with some sort of [[BootScript|boot script]]; making the appropriate modifications; and restarting the appropriate daemons, if any.

So, let's take a closer look at how processes are created.

The Unix process creation model revolves around two system calls: `fork()` and `exec()`. (There is actually a family of related system calls that begin with `exec` which all behave in slightly different manners, but we'll treat them all equally for now.) `fork()` creates a child process which is a ''duplicate'' of the parent who called `fork()` (with a few exceptions). The parent receives the child process's PID (Process ID) number as the return value of the `exit()` function, while the child gets a "0" to tell it that it's the child. `exec()` replaces the current process with a different program.

So, the usual sequence is:

 * A program calls `fork()` and checks the return value of the system call. If the status is greater than 0, then it's the parent process, so it calls `wait()` on the child process ID (unless we want it to continue running while the child runs in the background).
 * If the status is 0, then it's the child process, so it calls `exec()` to do whatever it's supposed to be doing.
 * But before that, the child might decide to `close()` some file descriptors, `open()` new ones, set environment variables, change resource limits, and so on. All of these changes will remain in effect after the `exec()` and will affect the task that is executed.
 * If the return value of `fork()` is negative, something bad happened (we ran out of memory, or the process table filled up, etc.).

Let's take an example of a shell command:

{{{
echo hello world 1>&2
}}}

The process executing this is a shell, which reads commands and executes them. It uses the standard `fork()`/`exec()` model to do so. Let's show it step by step:

 * The parent shell calls `fork()`.
 * The parent gets the child's process ID as the return value of `fork()` and waits for it to terminate.
 * The child gets a 0 from `fork()` so it knows it's the child.
 * The child is supposed to redirect standard output to standard error (due to the `1>&2` directive). It does this now:
  * Close file descriptor 1.
  * Duplicate file descriptor 2, and make sure the duplicate is FD 1.
 * The child calls `exec("echo", "echo", "hello", "world", (char *)NULL)` or something similar to execute the command. (Here, we're assuming `echo` is an external command.)
 * Once the `echo` terminates, the parent's `wait` call also terminates, and the parent resumes normal operation.

There are other things the child of the shell might do before executing the final command. For example, it might set environment variables:

{{{
http_proxy=http://tempproxy:3128/ lynx http://someURL/
}}}

In this case, the child will put `http_proxy=http://tempproxy:3128/` into the environment before calling `exec()`.

A child process inherits many things from its parent:

 * Open file descriptors. The child gets copies of these, referring to the same files.
 * Environment variables. The child gets its own copies of these, and [[BashFAQ/060|changes made by the child do not affect the parent's copy]].
 * Current working directory. If the child changes its working directory, [[BashFAQ/060|the parent will never know about it]].
 * User ID, group ID and supplementary groups. A child process is spawned with the same privileges as its parent. Unless the child process is running with superuser UID (UID 0), it cannot change these privileges.
 * System resource limits. The child inherits the limits of its parent. A process that runs as superuser UID can raise its resource limits (`setrlimit(2)`). A process running as non-superuser can only lower its resource limits; it can't raise them.
 * [[Permissions#umask|umask]].

An active Unix system may be perceived as a ''tree'' of processes, with parent/child relationships shown as vertical ("branch") connections between nodes. For example,

{{{
 (init)
    |
 (login)
    |
         startx
           |
         xinit
           |
     bash .xinitrc
     / | \
 rxvt rxvt fvwm2
  | | \
 bash screen \____________________
       / | \ | | \
    bash bash bash xclock xload firefox ...
           | |
         mutt rtorrent
}}}

This is a simplified version of an actual set of processes run by one user on a real system. I have omitted many, to keep things readable. The root of the tree, shown as `(init)`, as well as the first child process `(login)`, are running as root (superuser UID 0). Here is how this scenario came about:

 * The kernel (Linux in this case) is hard-coded to run `/sbin/init` as process number 1 when it has finished its startup. `init` never dies; it is the ultimate ancestor of every process on the system.
 * `init` reads `/etc/inittab` which tells it to spawn some `getty` processes on some of the Linux virtual terminal devices (among other things).
 * Each `getty` process presents a bit of information plus a login prompt.
 * After reading a username, `getty` `exec()`s `/bin/login` to read the password. (Thus, `getty` no longer appears in the tree; it has replaced itself.)
 * If the password is valid, `login` `fork()`s the user's login shell (in this case bash). Presumably, it hangs around (instead of using `exec()`) because it wants to do some clean-up after the user's shell has terminated.
 * The user types `exec startx` at the bash shell prompt. This causes bash to `exec()` `startx` (and therefore the login shell no longer appears in the tree).
 * `startx` is a wrapper that launches an X session, which includes an X server process (not shown -- it runs as root), and a whole slew of client programs. On this particular system, `.xinitrc` in the user's home directory is a script that tells which X client programs to run.
 * Two `rxvt` terminal emulators are launched from the `.xinitrc` file (in the background using `&`), and each of them runs a new copy of the user's shell, bash.
  * In one of them, the user has typed `exec screen` (or something similar) to replace bash with screen. Screen, in turn, has three bash child processes of its own, two of which have terminal-based programs running in them (mutt, rtorrent).
 * The user's window manager, `fvwm2`, is run in the foreground by the `.xinitrc` script. A window manager or desktop environment is usually the last thing run by the `.xinitrc` script; when the WM or DE terminates, the script terminates, and brings down the whole session.
 * The window manager runs several processes of its own (xclock, xload, firefox, ...). It typically has a menu, or icons, or a control panel, or some other means of launching new programs. We will not cover window manager configurations here.

Other parts of a Unix system use similar process trees to accomplish their goals, although few of them are quite as deep or complex as an X session. For example, `inetd` runs as a daemon which listens on several UDP and TCP ports, and launches programs (`ftpd`, `telnetd`, etc.) when it receives network connections. `lpd` runs as a managing daemon for printer jobs, and will launch children to handle individual jobs when a printer is ready. `sshd` listens for incoming SSH connections, and launches children when it receives them. Some electronic mail systems (particularly [[CategoryQmail|qmail]]) use relatively large numbers of small processes working together.

Understanding the relationship among a set of processes is vital to administering a system. For example, suppose you would like to change the way your FTP service behaves. You've located a configuration file that it is known to read at startup time, and you've changed it. Now what? You could reboot the entire system to be sure your change takes effect, but most people consider that overkill. Generally, people prefer to restart only the minimal number of processes, thereby causing the least amount of disruption to the other services and the other users of the system.

So, you need to understand how your FTP service starts up. Is it a standalone daemon? If so, you probably have some system-specific way of restarting it (either by running a BootScript, or manually killing and restarting it, or perhaps by issuing some special service management command). More commonly, an FTP service runs under the control of `inetd`. If this is the case, you don't need to restart anything at all. `inetd` will launch a fresh FTP service daemon every time it receives a connection, and the fresh daemon will read the changed configuration file every time.

On the other hand, suppose your FTP service doesn't have its own configuration file that lets you make the change you want (for example, changing its umask for the default [[Permissions]] of uploaded files). In this case, you know that it inherits its umask from `inetd`, which in turn gets its umask from whatever boot script launched it. If you would like to change FTP's umask in this scenario, you would have to edit `inetd`'s boot script, and then kill and restart `inetd` so that the FTP service daemons (`inetd`'s children) will inherit the new value. And by doing this, you are also changing the default umask of every ''other'' service that `inetd` manages! Is that acceptable? Only you can answer that. If not, then you may have to change how your FTP service runs, possibly moving it to a standalone daemon. This is a system administrator's job.

----
CategoryShell CategoryUnix

This is still a work in progress. Expect some rough edges.

The basics

How do I kill a process by name? I need to get the PID out of ps aux | grep ....

No, you don't. There's a command called pkill that does exactly what you're trying to do. You might also take a look at the command killall if you're on a legacy GNU/Linux system, but be warned: killall on some systems kills every process on the entire system. It's best to avoid it unless you really need it.

(Mac OS X comes with killall but not pkill. To get pkill, go to http://proctools.sourceforge.net/.)

If you just wanted to check for the existence of a process by name, use pgrep.

Please note that checking/killing processes by name is insecure, because processes can lie about their names, and names are not guaranteed to be unique. The rest of this page will explain things in greater depth, and provide alternatives.

How do I run a job in the background?

command &

By the way, & is a command separator in bash and other Bourne shells; it's syntactically the same as ; and can be used in place of ; but not in addition to ;. Thus, you can write this:

command one & command two & command three &

Or:

for i in 1 2 3; do command $i & done

My script runs a job in the background. How do I get its PID?

The $! special parameter holds the PID of the most recently executed background job. You can use that later on in your script to keep track of the job, terminate it, record it in a PID file (shudder), or whatever.

myjob &
jobpid=$!

OK, I have its PID. How do I check that it's still running?

kill -0 $PID will check to see whether a signal is deliverable (i.e., the process still exists). If you need to check on a single child process asynchronously, that's the most portable solution. You might also be able to use the wait shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the select(2) or poll(2) system calls. If you need to manage a complex suite of child processes and events, don't try to do it in a shell script. (That said, there are a few tricks in the advanced section of this page.)

I want to run something in the background and then log out.

If you want to be able to reconnect to it later, use screen. Launch screen, then run whatever you want to run in the foreground, and detach screen with Ctrl-A d. You can reattach to screen (as long as you didn't reboot the server) with screen -x. You can even attach multiple times, and each attached terminal will see (and control) the same thing. This is also great for remote teaching situations.

If you can't or don't want to do that, the traditional approach still works: nohup something &

Bash also has a disown command, if you want to log out with a background job running, and you forgot to nohup it initially.

sleep 1000
Ctrl-Z
bg
disown

If you need to logout of an ssh session with background jobs still running, make sure their file descriptors have been redirected so they aren't holding the terminal open, or the ssh client may hang.

I'm trying to kill -9 my job but blah blah blah...

Woah! Stop right there! Do not use kill -9, ever. For any reason. Unless you wrote the program to which you're sending the SIGKILL, and know that you can clean up the mess it leaves. Because you're debugging it.

If a process is not responding to normal signals, it's probably in "state D" (as shown on ps u), which means it's currently executing a system call. If that's the case, you're probably looking at a dead hard drive, or a dead NFS server, or a kernel bug, or something else along those lines. And you won't be able to kill the process anyway, SIGKILL or not.

If the process is ignoring normal SIGTERMs, then get the source code and fix it!

If you have an employee whose first instinct any time a job needs to be terminated is to break out the fucking howitzers, then fire him. Now.

If you don't understand why this is a case of slicing bread with a howitzer, read The Useless Use of Kill -9 Award.

Make SURE you have run and understood these commands:

  • help kill

  • help trap

  • man pkill

  • man pgrep

OK, now let's move on to the interesting stuff....

Advanced questions

I want to run two jobs in the background, and then wait until they both finish.

By default, wait waits for all of your shell's children to exit.

job1 &
job2 &
wait

There is no way to wait for more than one, but not all, of your children, unfortunately. It's "all, one, or none".

There is also no way to wait for "child process foo to end, OR something else to happen", other than setting a trap, which will only help if "something else to happen" is a signal being sent to the script.

There is also no way to wait for a process that is not your child. You can't hang around the schoolyard and pick up someone else's kids.

How can I check to see if my game server is still running? I'll put a script in crontab, and if it's not running, I'll restart it...

We get that question (in various forms) way too often. A user has some daemon, and they want to restart it whenever it dies. Yes, one could probably write a bash script that would try to parse the output of ps (or preferably pgrep if your system has it), and try to guess which process ID belongs to the daemon we want, and try to guess whether it's not there any more. But that's haphazard and dangerous. There are much better ways.

Most Unix systems already have a feature that allows you to respawn dead processes: init and inittab. If you want to make a new daemon instance pop up whenever the old one dies, typically all you need to do is put an appropriate line into /etc/inittab with the "respawn" action in column 3, and your process's invocation in column 4. Then run telinit q or your system's equivalent to make init re-read its inittab.

Some Unix systems don't have inittab, and some system administrators might want finer control over the daemons and their logging. Those people may want to look into daemontools, or runit.

This leads into the issue of self-daemonizing programs. There was a trend during the 1980s for Unix daemons such as inetd to put themselves into the background automatically. It seems to be particularly common on BSD systems, although it's widespread across all flavors of Unix.

The problem with this is that any sane method of managing a daemon requires that you keep track of it after starting it. If init is told to respawn a command, it simply launches that command as a child, then uses the wait() system call; so, when the child exits, the parent can spawn another one. Daemontools works the same way: a user-supplied run script establishes the environment, and then execs the process, thereby giving the daemontools supervisor direct parental authority over the process, including standard input and output, etc.

If a process double-forks itself into the background (the way inetd and sendmail and named do), it breaks the connection to its parent -- intentionally. This makes it unmanageable; the parent can no longer receive the child's output, and can no longer wait() for the child in order to be informed of its death. And the parent won't even know the new daemon's process ID. The child has run away from home without even leaving a note.

So, the Unix/BSD people came up with workarounds... they created "PID files", in which a long-running daemon would write its process ID, since the parent had no other way to determine it. But PID files are not reliable. A daemon could have died, and then some other process could have taken over its PID, rendering the PID file useless. Or the PID file could simply get deleted, or corrupted. They came up with pgrep and pkill to attempt to track down processes by name instead of by number... but what if the process doesn't have a unique name? What if there's more than one of it at a time, like nfsd or Apache?

These workarounds and tricks are only in place because of the original hack of self-backgrounding. Get rid of that, and everything else becomes easy! Init or daemontools or runit can just control the child process directly. And even the most raw beginner could write their own wrapper script:

   #!/bin/sh
   while :; do
      /my/game/server -foo -bar -baz >> /var/log/mygameserver 2>&1
   done

Then simply arrange for that to be executed at boot time, with a simple & to put it in the background, and voila! An instant one-shot respawn.

Most modern software packages no longer require self-backgrounding; even for those where it's the default behavior (for compatibility with older versions), there's often a switch or a set of switches which allows one to control the process. For instance, Samba's smbd now has a -F switch specifically for use with daemontools and other such programs.

If all else fails, you can try using fghack (from the daemontools package) to prevent the self-backgrounding.

How do I make sure only one copy of my script can run at a time?

First, ask yourself why you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than generating a new temporary file in a secure manner each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably use file locking, by rewriting your application in a language that supports it.

The naive answer to this question, which is given all too frequently by well-meaning but inexperienced scripters, would be to run some variant of ps -ef | grep -v grep | grep "$(basename "$0")" | wc -l to count how many copies of the script are in existence at the moment. I won't even attempt to describe how horribly wrong that approach is... if you can't see it for yourself, you'll simply have to take my word for it.

Unfortunately, bash has no facility for locking a file. Bash FAQ #45 contains examples of using a directory, a symlink, etc. as a means of mutual exclusion; but you cannot lock a file directly.

  • I believe you can use (set -C; >lockfile) to atomically create a lockfile, please verify this. (see: Bash FAQ #45) --Andy753421

You could also run your program or shell script under the setlock program from the daemontools package. Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once. Here's an example where we want to make sure that only one "sleep" is running at a given time.

$ setlock -nX lockfile sleep 100 &
[1] 1169
$ setlock -nX lockfile sleep 100
setlock: fatal: unable to lock lockfile: temporary failure

If environmental restrictions require the use of a shell script, then you may be stuck using that. Otherwise, you should seriously consider rewriting the functionality you require in a more powerful language.

I want to process a bunch of files in parallel, and when one finishes, I want to start the next. And I want to make sure there are exactly 5 jobs running at a time.

Many xargs allow running tasks in parallel, including FreeBSD, OpenBSD and GNU (but not POSIX):

find . -print0 | xargs -0 -n 1 -P 4 command

A C program could fork 5 children and manage them closely using select() or similar, to assign the next file in line to whichever child is ready to handle it. But bash has nothing equivalent to select or poll.

In a script, you're reduced to lesser solutions. One way is to divide the job into 5 "equal" parts, and then just launch them all in parallel. Here's an example:

   1 #!/usr/local/bin/bash
   2 # Read all the files (from a text file, 1 per line) into an array.
   3 tmp=$IFS IFS=$'\n' files=($(< inputlist)) IFS=$tmp
   4 
   5 # Here's what we plan to do to them.
   6 do_it() {
   7    for f; do [[ -f $f ]] && my_job "$f"; done
   8 }
   9 
  10 # Divide the list into 5 sub-lists.
  11 i=0 n=0 a=() b=() c=() d=() e=()
  12 while ((i < ${#files[*]})); do
  13     a[n]=${files[i]}
  14     b[n]=${files[i+1]}
  15     c[n]=${files[i+2]}
  16     d[n]=${files[i+3]}
  17     e[n]=${files[i+4]}
  18     ((i+=5, n++))
  19 done
  20 
  21 # Process the sub-lists in parallel
  22 do_it "${a[@]}" > a.out 2>&1 &
  23 do_it "${b[@]}" > b.out 2>&1 &
  24 do_it "${c[@]}" > c.out 2>&1 &
  25 do_it "${d[@]}" > d.out 2>&1 &
  26 do_it "${e[@]}" > e.out 2>&1 &
  27 wait

See reading a file line-by-line and arrays and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is close enough for many purposes.

Another approach involves using a named pipe to tell a "manager" when a job is finished, so it can launch the next job. Here is an example of that approach:

   1 #!/bin/bash
   2 
   3 # FD 3 will be tied to a named pipe.
   4 mkfifo pipe; exec 3<>pipe
   5 
   6 # This is the job we're running.
   7 s() {
   8   echo Sleeping $1
   9   sleep $1
  10 }
  11 
  12 # Start off with 3 instances of it.
  13 # Each time an instance terminates, write a newline to the named pipe.
  14 { s 5; echo >&3; } &
  15 { s 7; echo >&3; } &
  16 { s 8; echo >&3; } &
  17 
  18 # Each time we get a line from the named pipe, launch another job.
  19 while read; do
  20   { s $((RANDOM%5+7)); echo >&3; } &
  21 done <&3

If you need something more sophisticated than these, you're probably looking at the wrong language.

On processes, environments and inheritance

Every process on a Unix system has a parent process (except init), from which it inherits certain things. A process can change some of these things, and not others. You cannot change things inside another process other than by being its parent, or attaching (attacking?) it with a debugger.

It is of paramount importance that you understand this model if you plan to use or administer a Unix system successfully. For example, a user with 10 windows open might wonder why he can't tell all of his shells to change the contents of their PATH variable, short of going to each one individually and running a command. And even then, the changed PATH variable won't be set in the user's window manager or desktop environment, which means any new windows he creates will still get the old variable.

The solution, of course, is that the user needs to edit a shell dot file, then logout and back in, so that his top-level processes will get the new variable, and can pass it along to their children.

Likewise, a system administrator might want to tell her in.ftpd to use a default umask of 002 instead of whatever it's currently using. Achieving that goal will require an understanding of how in.ftpd is launched on her system, either as a child of inetd or as a standalone daemon with some sort of boot script; making the appropriate modifications; and restarting the appropriate daemons, if any.

So, let's take a closer look at how processes are created.

The Unix process creation model revolves around two system calls: fork() and exec(). (There is actually a family of related system calls that begin with exec which all behave in slightly different manners, but we'll treat them all equally for now.) fork() creates a child process which is a duplicate of the parent who called fork() (with a few exceptions). The parent receives the child process's PID (Process ID) number as the return value of the exit() function, while the child gets a "0" to tell it that it's the child. exec() replaces the current process with a different program.

So, the usual sequence is:

  • A program calls fork() and checks the return value of the system call. If the status is greater than 0, then it's the parent process, so it calls wait() on the child process ID (unless we want it to continue running while the child runs in the background).

  • If the status is 0, then it's the child process, so it calls exec() to do whatever it's supposed to be doing.

  • But before that, the child might decide to close() some file descriptors, open() new ones, set environment variables, change resource limits, and so on. All of these changes will remain in effect after the exec() and will affect the task that is executed.

  • If the return value of fork() is negative, something bad happened (we ran out of memory, or the process table filled up, etc.).

Let's take an example of a shell command:

echo hello world 1>&2

The process executing this is a shell, which reads commands and executes them. It uses the standard fork()/exec() model to do so. Let's show it step by step:

  • The parent shell calls fork().

  • The parent gets the child's process ID as the return value of fork() and waits for it to terminate.

  • The child gets a 0 from fork() so it knows it's the child.

  • The child is supposed to redirect standard output to standard error (due to the 1>&2 directive). It does this now:

    • Close file descriptor 1.
    • Duplicate file descriptor 2, and make sure the duplicate is FD 1.
  • The child calls exec("echo", "echo", "hello", "world", (char *)NULL) or something similar to execute the command. (Here, we're assuming echo is an external command.)

  • Once the echo terminates, the parent's wait call also terminates, and the parent resumes normal operation.

There are other things the child of the shell might do before executing the final command. For example, it might set environment variables:

http_proxy=http://tempproxy:3128/ lynx http://someURL/

In this case, the child will put http_proxy=http://tempproxy:3128/ into the environment before calling exec().

A child process inherits many things from its parent:

  • Open file descriptors. The child gets copies of these, referring to the same files.
  • Environment variables. The child gets its own copies of these, and changes made by the child do not affect the parent's copy.

  • Current working directory. If the child changes its working directory, the parent will never know about it.

  • User ID, group ID and supplementary groups. A child process is spawned with the same privileges as its parent. Unless the child process is running with superuser UID (UID 0), it cannot change these privileges.
  • System resource limits. The child inherits the limits of its parent. A process that runs as superuser UID can raise its resource limits (setrlimit(2)). A process running as non-superuser can only lower its resource limits; it can't raise them.

  • umask.

An active Unix system may be perceived as a tree of processes, with parent/child relationships shown as vertical ("branch") connections between nodes. For example,

        (init)
           |
        (login)
           |
         startx
           |
         xinit
           |
     bash .xinitrc
     /     |    \
 rxvt    rxvt   fvwm2
  |        |        \
 bash   screen       \____________________
       /   |  \              |      |     \
    bash bash  bash        xclock  xload  firefox ...
           |     |
         mutt  rtorrent

This is a simplified version of an actual set of processes run by one user on a real system. I have omitted many, to keep things readable. The root of the tree, shown as (init), as well as the first child process (login), are running as root (superuser UID 0). Here is how this scenario came about:

  • The kernel (Linux in this case) is hard-coded to run /sbin/init as process number 1 when it has finished its startup. init never dies; it is the ultimate ancestor of every process on the system.

  • init reads /etc/inittab which tells it to spawn some getty processes on some of the Linux virtual terminal devices (among other things).

  • Each getty process presents a bit of information plus a login prompt.

  • After reading a username, getty exec()s /bin/login to read the password. (Thus, getty no longer appears in the tree; it has replaced itself.)

  • If the password is valid, login fork()s the user's login shell (in this case bash). Presumably, it hangs around (instead of using exec()) because it wants to do some clean-up after the user's shell has terminated.

  • The user types exec startx at the bash shell prompt. This causes bash to exec() startx (and therefore the login shell no longer appears in the tree).

  • startx is a wrapper that launches an X session, which includes an X server process (not shown -- it runs as root), and a whole slew of client programs. On this particular system, .xinitrc in the user's home directory is a script that tells which X client programs to run.

  • Two rxvt terminal emulators are launched from the .xinitrc file (in the background using &), and each of them runs a new copy of the user's shell, bash.

    • In one of them, the user has typed exec screen (or something similar) to replace bash with screen. Screen, in turn, has three bash child processes of its own, two of which have terminal-based programs running in them (mutt, rtorrent).

  • The user's window manager, fvwm2, is run in the foreground by the .xinitrc script. A window manager or desktop environment is usually the last thing run by the .xinitrc script; when the WM or DE terminates, the script terminates, and brings down the whole session.

  • The window manager runs several processes of its own (xclock, xload, firefox, ...). It typically has a menu, or icons, or a control panel, or some other means of launching new programs. We will not cover window manager configurations here.

Other parts of a Unix system use similar process trees to accomplish their goals, although few of them are quite as deep or complex as an X session. For example, inetd runs as a daemon which listens on several UDP and TCP ports, and launches programs (ftpd, telnetd, etc.) when it receives network connections. lpd runs as a managing daemon for printer jobs, and will launch children to handle individual jobs when a printer is ready. sshd listens for incoming SSH connections, and launches children when it receives them. Some electronic mail systems (particularly qmail) use relatively large numbers of small processes working together.

Understanding the relationship among a set of processes is vital to administering a system. For example, suppose you would like to change the way your FTP service behaves. You've located a configuration file that it is known to read at startup time, and you've changed it. Now what? You could reboot the entire system to be sure your change takes effect, but most people consider that overkill. Generally, people prefer to restart only the minimal number of processes, thereby causing the least amount of disruption to the other services and the other users of the system.

So, you need to understand how your FTP service starts up. Is it a standalone daemon? If so, you probably have some system-specific way of restarting it (either by running a BootScript, or manually killing and restarting it, or perhaps by issuing some special service management command). More commonly, an FTP service runs under the control of inetd. If this is the case, you don't need to restart anything at all. inetd will launch a fresh FTP service daemon every time it receives a connection, and the fresh daemon will read the changed configuration file every time.

On the other hand, suppose your FTP service doesn't have its own configuration file that lets you make the change you want (for example, changing its umask for the default Permissions of uploaded files). In this case, you know that it inherits its umask from inetd, which in turn gets its umask from whatever boot script launched it. If you would like to change FTP's umask in this scenario, you would have to edit inetd's boot script, and then kill and restart inetd so that the FTP service daemons (inetd's children) will inherit the new value. And by doing this, you are also changing the default umask of every other service that inetd manages! Is that acceptable? Only you can answer that. If not, then you may have to change how your FTP service runs, possibly moving it to a standalone daemon. This is a system administrator's job.


CategoryShell CategoryUnix

ProcessManagement (last edited 2023-08-09 06:29:52 by ormaaj)