Differences between revisions 3 and 21 (spanning 18 versions)
Revision 3 as of 2006-09-08 17:39:41
Size: 5727
Editor: GreyCat
Comment: typo
Revision 21 as of 2008-01-08 16:55:48
Size: 12632
Editor: GreyCat
Comment: Add reference to pgrep. Sigh. Some people are really ignorant.
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:

''First, let's get the easy stuff out of the way.''

= Things you were supposed to know BEFORE you entered #bash =

{{{<JoeNewbie> How do I kill a process by name? I need to get the PID out of ps aux | grep ....
}}}

No, you don't. There's a command called {{{pkill}}} that does exactly what you're trying to do. You might also take a look at the command {{{killall}}} if you're on a legacy GNU/Linux system, but '''be warned''': {{{killall}}} on some systems kills '''every''' process on the entire system. It's best to avoid it unless you ''really'' need it.

(Mac OS X comes with {{{killall}}} but not {{{pkill}}}. To get {{{pkill}}}, go to http://proctools.sourceforge.net/.)

If you just wanted to check for the ''existence'' of a process by name, use {{{pgrep}}}.

{{{<JoeNewbie> How do I run a job in the background?
}}}

{{{command &}}}

{{{<JoeNewbie> My script runs a job in the background. How do I get its PID?
}}}

The {{{$!}}} special parameter holds the PID of the most recently executed background job. You can use that later on in your script to keep track of the job, terminate it, record it in a PID file ''(shudder)'', or whatever.

{{{<JoeNewbie> OK, I have its PID. How do I check that it's still running?
}}}

{{{kill -0 $PID}}} will check to see whether a signal is deliverable (''i.e.'', the process still exists). If you need to check on a single child process asynchronously, that's the best, most portable, most efficient solution. You might also be able to use the {{{wait}}} shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the {{{select(2)}}} or {{{poll(2)}}} system calls. If you need to manage a complex suite of child processes and events, don't try to do it in a shell script.

{{{<JoeNewbie> I want to run something in the background and then log out
}}}

If you want to be able to reconnect to it later, use {{{screen}}}. Launch screen, then run whatever you want to run in the foreground, and detach screen with '''Ctrl-A d'''. You can reattach to screen (as long as you didn't reboot the server) with {{{screen -x}}}. You can even attach multiple times, and each attached terminal will see (and control) the same thing. This is also great for remote teaching situations.

If you can't or don't want to do that, the traditional approach still works: {{{nohup something &}}}

Bash also has a {{{disown}}} command, if you want to log out with a background job running, and you forgot to {{{nohup}}} it initially.

{{{sleep 1000
Ctrl-Z
bg
disown}}}

{{{<JoeNewbie> I'm trying to kill -9 my job but blah blah blah...
}}}

Woah! '''Stop right there!''' Do ''not'' use {{{kill -9}}}, ever. For any reason. Unless you ''wrote'' the program to which you're sending the SIGKILL, and ''know'' that you can clean up the mess it leaves. Because you're debugging it.

If a process is not responding to normal signals, it's probably in "state D" (as shown on {{{ps u}}}), which means it's currently executing a system call. If that's the case, you're probably looking at a dead hard drive, or a dead NFS server, or a kernel bug, or something else along those lines. And you won't be able to kill the process ''anyway'', SIGKILL or not.

If the process is ignoring normal SIGTERMs, then ''get the source code and fix it''!

If you have an employee whose first instinct any time a job needs to be terminated is to break out the fucking howitzers, then fire him. Now.

If you don't understand why this is a case of slicing bread with a howitzer, read [http://partmaps.org/era/unix/award.html#uuk9letter The Useless Use of Kill -9 Award].

=== Make SURE you have run and understood these commands: ===
{{{help kill}}}

{{{help trap}}}

{{{man pkill}}}

{{{man pgrep}}}

''OK, now let's move on to the interesting stuff....''

= Things that actually need answers =

{{{<JoeNewbie> I want to run two jobs in the background, and then wait until they both finish.
}}}

By default, {{{wait}}} waits for all of your shell's children to exit.

{{{job1 &
job2 &
wait}}}
Line 33: Line 112:
If all else fails, you can try using [http://cr.yp.to/daemontools/fghack.html fghack] (from the daemontools package) to prevent the self-backgrounding.
Line 36: Line 117:
First, ask yourself ''why'' you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than [wiki:Self:BashFaq#faq62 generating a new temporary file in a secure manner] each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably using file locking, by rewriting your application in a language that supports it. First, ask yourself ''why'' you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than [:BashFAQ#faq62:generating a new temporary file in a secure manner] each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably use file locking, by rewriting your application in a language that supports it.
Line 40: Line 121:
Unfortunately, bash has no facility for locking a file. You can [wiki:Self:BashFaq#45 use a ''directory'' as a lock], but you cannot lock a file directly. Unfortunately, bash has no facility for locking a file. You can [:BashFAQ#45:use a ''directory'' as a lock], but you cannot lock a file directly.

'' I believe you can use {{{(set -C; >lockfile)}}} to atomically create a lockfile, please verify this. (see: [:BashFAQ#faq45:Bash FAQ #45]) --Andy753421''

You can run any program or shell script under the [http://cr.yp.to/daemontools/setlock.html setlock] program from the daemontools package. Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once. Here's an example where we want to make sure that only one "sleep" is running at a given time.

{{{
$ setlock -nX lockfile sleep 100 &
[1] 1169
$ setlock -nX lockfile sleep 100
setlock: fatal: unable to lock lockfile: temporary failure
}}}
Line 43: Line 135:

{{{<JoeNewbie> I want to process a bunch of files, and when one finishes, I want to start the next.
<JoeNewbie> And I want to make sure there are exactly 5 jobs running at a time.}}}

Many xargs allow running tasks in parallel, including FreeBSD, OpenBSD and GNU (but not Posix):

{{{
find . -print0 | xargs -0 -n 1 -P 4 command
}}}

Well, a C program may have the luxury of forking 5 children and managing them closely using {{{select()}}} or similar, to assign the next file in line to whichever child is ready to handle it. That level of detail just isn't practical in a shell script.

In a script, you're typically better off dividing the job into 5 "equal" parts, and then just launching them all in parallel. Here's an example:

{{{
#!/usr/local/bin/bash
# Read all the files (from a text file, 1 per line) into an array.
tmp=$IFS IFS=$'\n' files=($(< inputlist)) IFS=$tmp

# Here's what we plan to do to them.
do_it() {
   for f; do [[ -f $f ]] && my_job "$f"; done
}

# Divide the list into 5 sub-lists.
i=0 n=0 a=() b=() c=() d=() e=()
while ((i < ${#files[*]})); do
    a[n]=${files[i]}
    b[n]=${files[i+1]}
    c[n]=${files[i+2]}
    d[n]=${files[i+3]}
    e[n]=${files[i+4]}
    ((i+=5, n++))
done

# Process the sub-lists in parallel
do_it "${a[@]}" > a.out 2>&1 &
do_it "${b[@]}" > b.out 2>&1 &
do_it "${c[@]}" > c.out 2>&1 &
do_it "${d[@]}" > d.out 2>&1 &
do_it "${e[@]}" > e.out 2>&1 &
wait
}}}

See [:BashFAQ#faq1:reading a file line-by-line] and [:BashFAQ#faq5:arrays] and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is ''close enough'' for many purposes. Again, if you need something more sophisticated than this, you're looking at the wrong language.

This is still a work in progress. Expect some rough edges.

First, let's get the easy stuff out of the way.

Things you were supposed to know BEFORE you entered #bash

{{{<JoeNewbie> How do I kill a process by name? I need to get the PID out of ps aux | grep .... }}}

No, you don't. There's a command called pkill that does exactly what you're trying to do. You might also take a look at the command killall if you're on a legacy GNU/Linux system, but be warned: killall on some systems kills every process on the entire system. It's best to avoid it unless you really need it.

(Mac OS X comes with killall but not pkill. To get pkill, go to http://proctools.sourceforge.net/.)

If you just wanted to check for the existence of a process by name, use pgrep.

{{{<JoeNewbie> How do I run a job in the background? }}}

command &

{{{<JoeNewbie> My script runs a job in the background. How do I get its PID? }}}

The $! special parameter holds the PID of the most recently executed background job. You can use that later on in your script to keep track of the job, terminate it, record it in a PID file (shudder), or whatever.

{{{<JoeNewbie> OK, I have its PID. How do I check that it's still running? }}}

kill -0 $PID will check to see whether a signal is deliverable (i.e., the process still exists). If you need to check on a single child process asynchronously, that's the best, most portable, most efficient solution. You might also be able to use the wait shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the select(2) or poll(2) system calls. If you need to manage a complex suite of child processes and events, don't try to do it in a shell script.

{{{<JoeNewbie> I want to run something in the background and then log out }}}

If you want to be able to reconnect to it later, use screen. Launch screen, then run whatever you want to run in the foreground, and detach screen with Ctrl-A d. You can reattach to screen (as long as you didn't reboot the server) with screen -x. You can even attach multiple times, and each attached terminal will see (and control) the same thing. This is also great for remote teaching situations.

If you can't or don't want to do that, the traditional approach still works: nohup something &

Bash also has a disown command, if you want to log out with a background job running, and you forgot to nohup it initially.

{{{sleep 1000 Ctrl-Z bg disown}}}

{{{<JoeNewbie> I'm trying to kill -9 my job but blah blah blah... }}}

Woah! Stop right there! Do not use kill -9, ever. For any reason. Unless you wrote the program to which you're sending the SIGKILL, and know that you can clean up the mess it leaves. Because you're debugging it.

If a process is not responding to normal signals, it's probably in "state D" (as shown on ps u), which means it's currently executing a system call. If that's the case, you're probably looking at a dead hard drive, or a dead NFS server, or a kernel bug, or something else along those lines. And you won't be able to kill the process anyway, SIGKILL or not.

If the process is ignoring normal SIGTERMs, then get the source code and fix it!

If you have an employee whose first instinct any time a job needs to be terminated is to break out the fucking howitzers, then fire him. Now.

If you don't understand why this is a case of slicing bread with a howitzer, read [http://partmaps.org/era/unix/award.html#uuk9letter The Useless Use of Kill -9 Award].

Make SURE you have run and understood these commands:

help kill

help trap

man pkill

man pgrep

OK, now let's move on to the interesting stuff....

Things that actually need answers

{{{<JoeNewbie> I want to run two jobs in the background, and then wait until they both finish. }}}

By default, wait waits for all of your shell's children to exit.

{{{job1 & job2 & wait}}}

{{{<JoeNewbie> How can I check to see if my game server is still running? <JoeNewbie> I'll put a script in crontab, and if it's not running, I'll restart it...}}}

We get that question (in various forms) way too often. A user has some daemon with a bug, and rather than fix the bug (which admittedly lies well outside the scope of a normal system administrator's purview), they simply want to restart it whenever it dies. And yes, one could probably write a bash script that would try to parse the output of ps (or preferably pgrep if your system has it), and try to guess which process ID belongs to the daemon we want, and try to guess whether it's not there any more. But that's haphazard and dangerous. There are much better ways.

Most Unix systems already have a feature that allows you to respawn dead processes: init and inittab. If you want to make a new daemon instance pop up whenever the old one dies, typically all you need to do is put an appropriate line into /etc/inittab with the "respawn" action in column 3, and your process's invocation in column 4.

Some Unix systems don't have inittab, and some system administrators might want finer control over the daemons and their logging. Those people may want to look into [http://cr.yp.to/daemontools.html daemontools], or [http://smarden.org/runit/ runit].

This leads into the issue of self-daemonizing programs. There was a trend during the 1980s for Unix daemons such as inetd to put themselves into the background automatically. It seems to be particularly common on BSD systems, although it's widespread across all flavors of Unix.

The problem with this is that any sane method of managing a daemon requires that you keep track of it after starting it. If init is told to respawn a command, it simply launches that command as a child, then uses the wait() system call; and when the child exits, the parent can spawn another one. Daemontools works the same way: a user-supplied run script establishes the environment, and then execs the process, thereby giving the daemontools supervisor direct parental authority over the process, including standard input and output, etc.

If a process double-forks itself into the background, it breaks the connection to its parent -- intentionally. This makes it unmanageable; the parent can no longer receive the child's output, and can no longer wait() for the child in order to be informed of its death. And the parent won't even know the new daemon's process ID, so it can't even keep track of it with a simple kill -0.

So, the Unix/BSD people came up with workarounds... they created "PID files", in which a long-running daemon would write its process ID, since the parent had no other way to determine it. But PID files are not reliable. A daemon could have died, and then some other process could have taken over its PID, rendering the PID file useless. Or the PID file could simply get deleted, or corrupted. They came up with pgrep and pkill to attempt to track down processes by name instead of by number... but what if the process doesn't have a unique name? What if there's more than one of it at a time, for example, with nfsd or Apache?

These workarounds and tricks are only in place because of the original hack of self-backgrounding. Get rid of that, and everything else becomes easy! Init or daemontools or runit can just control the child process directly. And even the most raw beginner could write their own wrapper script:

   #!/bin/sh
   while true; do
      /my/game/server -foo -bar -baz >> /var/log/mygameserver 2>&1
   done

Then simply arrange for that to be executed at boot time, with a simple & to put it in the background, and voila! An instant one-shot respawn.

Most modern software packages no longer require self-backgrounding; even for those where it's the default behavior (for compatibility with older versions), there's often a switch or a set of switches which allows one to control the process. For instance, Samba's smbd now has a -F switch specifically for use with daemontools and other such programs.

If all else fails, you can try using [http://cr.yp.to/daemontools/fghack.html fghack] (from the daemontools package) to prevent the self-backgrounding.

{{{<JoeNewbie> How do I make sure only one copy of my script can run at a time? }}}

First, ask yourself why you think that restriction is necessary. Are you using a temporary file with a fixed name, rather than [:BashFAQ#faq62:generating a new temporary file in a secure manner] each time? If so, correct that bug in your script. Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously? In that case, you should probably use file locking, by rewriting your application in a language that supports it.

The naive answer to this question, which is given all too frequently by well-meaning but inexperienced scripters, would be to run some variant of ps -ef | grep -v grep | grep "$(basename "$0")" | wc -l to count how many copies of the script are in existence at the moment. I won't even attempt to describe how horribly wrong that approach is... if you can't see it for yourself, you'll simply have to take my word for it.

Unfortunately, bash has no facility for locking a file. You can [:BashFAQ#45:use a directory as a lock], but you cannot lock a file directly.

I believe you can use (set -C; >lockfile) to atomically create a lockfile, please verify this. (see: [:BashFAQ#faq45:Bash FAQ #45]) --Andy753421

You can run any program or shell script under the [http://cr.yp.to/daemontools/setlock.html setlock] program from the daemontools package. Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once. Here's an example where we want to make sure that only one "sleep" is running at a given time.

$ setlock -nX lockfile sleep 100 &
[1] 1169
$ setlock -nX lockfile sleep 100 
setlock: fatal: unable to lock lockfile: temporary failure

If environmental restrictions require the use of a shell script, then you may be stuck using that. Otherwise, you should seriously consider rewriting the functionality you require in a more powerful language.

{{{<JoeNewbie> I want to process a bunch of files, and when one finishes, I want to start the next. <JoeNewbie> And I want to make sure there are exactly 5 jobs running at a time.}}}

Many xargs allow running tasks in parallel, including FreeBSD, OpenBSD and GNU (but not Posix):

find . -print0 | xargs -0 -n 1 -P 4 command 

Well, a C program may have the luxury of forking 5 children and managing them closely using select() or similar, to assign the next file in line to whichever child is ready to handle it. That level of detail just isn't practical in a shell script.

In a script, you're typically better off dividing the job into 5 "equal" parts, and then just launching them all in parallel. Here's an example:

# Read all the files (from a text file, 1 per line) into an array.
tmp=$IFS IFS=$'\n' files=($(< inputlist)) IFS=$tmp

# Here's what we plan to do to them.
do_it() {
   for f; do [[ -f $f ]] && my_job "$f"; done
}

# Divide the list into 5 sub-lists.
i=0 n=0 a=() b=() c=() d=() e=()
while ((i < ${#files[*]})); do
    a[n]=${files[i]}
    b[n]=${files[i+1]}
    c[n]=${files[i+2]}
    d[n]=${files[i+3]}
    e[n]=${files[i+4]}
    ((i+=5, n++))
done

# Process the sub-lists in parallel
do_it "${a[@]}" > a.out 2>&1 &
do_it "${b[@]}" > b.out 2>&1 &
do_it "${c[@]}" > c.out 2>&1 &
do_it "${d[@]}" > d.out 2>&1 &
do_it "${e[@]}" > e.out 2>&1 &
wait

See [:BashFAQ#faq1:reading a file line-by-line] and [:BashFAQ#faq5:arrays] and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is close enough for many purposes. Again, if you need something more sophisticated than this, you're looking at the wrong language.

ProcessManagement (last edited 2023-08-09 06:29:52 by ormaaj)