Diff for "ProcessManagement"

Differences between revisions 65 and 66

Going on several years living in Auburn and have 4 children. It goes without mention I have a lovely better half named Russel Tonga . In between jobs I genuinely like to build my blog stop snoring (http://Www.Stopsnoringconsumerreports.com), and to be able to earn an income I am at present a Mechanical engineering technician.

-  ⇤ ← Revision 65 as of 2013-06-26 08:22:51 → 
  Size: 46496
  Editor: aim-177-116
  Comment: fix wrong number of processes
+   ← Revision 66 as of 2014-03-26 19:25:46 → ⇥
  Size: 426
  Editor: RTonga
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-This is still a work in progress.  Expect some rough edges.

<<TableOfContents>>

<<Anchor(basics)>>
= The basics =

A '''process''' is a running instance of a program in memory.  Every process is identified by a number, called the '''PID''', or '''Process IDentifier'''.  Each process has its own privately allocated segment of memory, which is not accessible from any other process.  This is where it stores its variables and other data.

The kernel keeps track of all these processes, and stores a little bit of basic metadata about them in a ''process table''.  However, for the most part, each process is autonomous, within the privileges allowed to it by the kernel.  Once a process has been started, it is difficult to do ''anything'' to it other than suspend (pause) it, or terminate it.

The metadata stored by the kernel includes a process "name" and "command line".  These are not reliable; the "name" of a process is whatever you said it is when you ran it, and may have no relationship whatsoever to the program's file name.  (On some systems, running processes can also change their own names.  For example, sendmail uses this to show its status.)  Therefore, when working with a process, you ''must'' know its PID in order to be able to do anything with it.  Looking for processes by name is extremely fallible.


<<Anchor(simple)>>
= Simple questions =

== How do I run a job in the background? ==

{{{
command &
}}}

By the way, '`&`' is a command separator in bash and other Bourne shells. It can be used any place '`;`' can (but not ''in addition to'' '`;`' -- you have to choose one or the other).  Thus, you can write this:

{{{
command one & command two & command three &
}}}

which runs all three in the background simultaneously, and is equivalent to:

{{{
command one &
command two &
command three &
}}}

Or:

{{{
for i in one two three; do command $i & done
}}}

While both `&` and `;` can be used to separate commands, `&` runs them in the background and `;` runs them in sequence.

== My script runs a job in the background.  How do I get its PID? ==

The {{{$!}}} special parameter holds the PID of the most recently executed background job.  You can use that later on in your script to keep track of the job, terminate it, record it in a PID file ''(shudder)'', or whatever.

{{{
myjob &
jobpid=$!
}}}

== OK, I have its PID.  How do I check that it's still running? ==

{{{kill -0 $PID}}} will check to see whether a signal is deliverable (''i.e.'', the process still exists).  If you need to check on a single child process asynchronously, that's the most portable solution.  You might also be able to use the {{{wait}}} shell command to block until the child (or children) terminate -- it depends on what your program has to do.

There is no shell scripting equivalent to the {{{select(2)}}} or {{{poll(2)}}} system calls.  If you need to manage a complex suite of child processes and events, don't try to do it in a shell script.  (That said, there are a few tricks in the [[#advanced|advanced]] section of this page.)

== I want to do something with a process I started earlier ==

Store the PID when you start the process and use that PID later on:

{{{
    # Bourne
    my child process &
    childpid=$!
}}}

If you're still in the parent process that started the child process you want to do something with, that's perfect.  You're guaranteed the PID is your child process (dead or alive), for the reasons [[#parents|explained below]].  You can use `kill` to signal it, terminate it, or just check whether it's still running.  You can use `wait` to wait for it to end or to get its exit code if it has ended.

If you're NOT in the parent process that started the child process you want to do something with, that's a shame.  Try restructuring your logic so that you can be.  If that's not possible, the things you can do are a little more limited and a little more risky.

The parent process that created the child process should've written its PID to some place where you can access it.  A PID file is probably the best place.  Read the PID in from wherever the parent stored it, and hope that no other process has accidentally taken control of the PID while you weren't looking.  You can use `kill` to signal it, terminate it, or just check whether it's still running.  You '''cannot''' use `wait` to wait for it or get its exit code; this is only possible from the child's parent process.  If you really want to wait for the process to end, you can poll `kill -0`:
{{{
while kill -0 $pid
do
    sleep 1
done
}}}

Everything in the preceding paragraph is risky.  The PID in the file may have been recycled before you even got there.  The PID could be recycled ''after'' you read it from the file but before you send the suicide order.  The PID could be recycled in the middle of your polling loop, leaving you in a non-terminating wait.

If you need to write programs that manage a process without maintaining a parent/child relationship, your best bet is to make sure that ''all'' of those programs run with the same User ID (UID) which is not used by any other programs on the system.  That way, if the PID gets recycled, your attempt to query/kill it will fail.  This is infinitely preferable to your sending `SIGTERM` to some innocent process.

== How do I kill a process by name?  I need to get the PID out of ps aux | grep .... ==

No, you don't.  Firstly, you probably do NOT want to find a process by name AT ALL.  Make sure you have the PID of the process and do what the above answer says.  If you don't know how to get the PID: Only the process that created your process knows the real PID.  It should have stored it in a file for you.  If you are IN the parent process, that's even better.  Put the PID in a variable ({{{process & mypid=$!}}}) and use that.

If for some silly reason you really want to get to a process purely by name, you understand that this is a broken method, you don't care that this may set your hair on fire, and you want to do it anyway, you should probably use a command called {{{pkill}}}.  You might also take a look at the command {{{killall}}} if you're on a legacy GNU/Linux system, but '''be warned''': {{{killall}}} on some systems kills '''every''' process on the entire system.  It's best to avoid it unless you ''really'' need it.

(Mac OS X comes with {{{killall}}} but not {{{pkill}}}.  To get {{{pkill}}}, go to http://proctools.sourceforge.net/.)

If you just wanted to check for the ''existence'' of a process by name, use {{{pgrep}}}.

Please note that checking/killing processes by name is ''insecure'', because processes can lie about their names, and there is nothing unique about the name of a process.

== But I'm on some old legacy Unix system that doesn't have pgrep!  What do I do? ==

As stated above, checking or killing processes by name is an extremely bad idea in the first place.  So rather than agonize about shortcut tools like `pgrep` that you don't have, you'd do better to implement some sort of robust process management using the techniques we'll talk about later.  But people love shortcuts, so let me fill in some legacy Unix issues and tricks here, '''even though you should not be using such things'''.

A legacy Unix system typically has no tool besides `ps` for inspecting running processes as a human system administrator.  People then think that this is an appropriate tool to use in a script, even though it isn't.  They fall into the mental trap of thinking that since this is the only tool provided by the OS for troubleshooting runaway processes as a human being, that it must be an appropriate tool for setting up services.

There are two entirely different `ps` commands on legacy Unix systems: System V Unix style (`ps -ef`) and BSD Unix style (`ps auxw`).  In some slightly-less-old Unix systems, the two different syntaxes are combined, and the presence or absence of a hyphen tells `ps` which set of option letters is being used.  (If you ever see `ps -auxw` with a hyphen, throw the program away immediately.)  POSIX uses the System V style, and adds a `-o` option to tell `ps` which fields you want, so you don't have to write things like `ps ... | awk '{print $2}'` any more.

Now, the second ''real'' problem with `ps -ef | grep foo` (after the fact that process names are inherently unreliable) is that there is a RaceCondition in the output.  In this pipeline, both the `ps` and the `grep` are spawned either simultaneously or nearly simultaneously.  Depending on just how nearly simultaneously they are spawned, the `grep` process might ''or might not'' show up in the `ps` output.  And the `grep foo` command is going to match both processes -- the `foo` daemon or whatever, and the `grep foo` command as well.  Assuming both of them show up.  You might get just one.

There are two workarounds for that issue.  The first is to filter out the `grep` command.  This is typically done by running `ps -ef | grep -v grep | grep foo`.  Note that the `grep -v` is done ''first'' so that it is not the final command in the pipeline.  This is so that the final command in the pipeline is the one whose exit status actually matters.  This allows commands like the following to work properly:

{{{
  if ps -ef | grep -v grep | grep -q foo; then
}}}

The second workaround involves crafting a `grep command` that will match the `foo` process but not the `grep` itself.  There are many variants on this theme, but one of the most common is:

{{{
  if ps -ef | grep '[f]oo'; then
}}}

You'll likely run into this a few times.  The RegularExpression `[f]oo` matches only the literal string `foo`; it does not match the literal string `[f]oo`, and therefore the `grep` command won't be matched either.  This approach saves one forked process (the `grep -v`), and some people find it clever.

I've seen one person try to do this:

{{{
  # THIS IS BAD!  DO NOT USE THIS!
  if ps -ef | grep -q -m 1 foo; then
}}}

Not only does this use a nonstandard GNU extension (`grep -m` -- stop after M matches), but it completely fails to avoid the race condition.  If the race condition produces both `grep` and `foo` lines, there's no guarantee the `foo` one will be first!  So, this is even worse than what we started with.

Anyway, these are just explanations of tricks you might see in other people's code, so that you can guess what they're attempting to do.  You won't be writing such hacks, I hope.

== I want to run something in the background and then log out. ==

If you want to be able to reconnect to it later, use {{{screen}}} or {{{tmux}}}.  Launch either, then run whatever you want to run in the foreground, and detach (screen with '''Ctrl-A d''' and tmux with '''Ctrl-B d''').  You can reattach (as long as you didn't reboot the server) with {{{screen -x}}} to screen and with {{{tmux attach}}} to tmux.  You can even attach multiple times, and each attached terminal will see (and control) the same thing.  This is also great for remote teaching situations.

If you can't or don't want to do that, the traditional approach still works: {{{nohup something &}}}

Bash also has a {{{disown}}} command, if you want to log out with a background job running, and you forgot to {{{nohup}}} it initially.

{{{
sleep 1000
Ctrl-Z
bg
disown
}}}

If you need to logout of an ssh session with background jobs still running, make sure their file descriptors have been redirected so they aren't holding the terminal open, or [[BashFAQ/063|the ssh client may hang]].

== I'm trying to kill -9 my job but blah blah blah... ==

Woah!  '''Stop right there!'''  Do ''not'' use {{{kill -9}}}, ever.  For any reason.  Unless you ''wrote'' the program to which you're sending the SIGKILL, and ''know'' that you can clean up the mess it leaves.  Because you're debugging it.

If a process is not responding to normal signals, it's probably in "state D" (as shown on {{{ps u}}}), which means it's currently executing a system call.  If that's the case, you're probably looking at a dead hard drive, or a dead NFS server, or a kernel bug, or something else along those lines.  And you won't be able to kill the process ''anyway'', SIGKILL or not.

If the process is ignoring normal SIGTERMs, then ''get the source code and fix it''!

If you have an employee whose first instinct any time a job needs to be terminated is to break out the fucking howitzers, then fire him.  Now.

If you don't understand why this is a case of slicing bread with a chain saw, read [[http://web.archive.org/web/20080801151452/http://speculation.org/garrick/kill-9.html|Who's [sic] idea was this?]] and [[http://partmaps.org/era/unix/award.html#uuk9letter|The UUOK9 Form Letter]].

== Make SURE you have run and understood these commands: ==
 * {{{help kill}}}
 * {{{help trap}}}
 * {{{man pkill}}}
 * {{{man pgrep}}}

''OK, now let's move on to the interesting stuff....''

<<Anchor(advanced)>>
= Advanced questions =

== I want to run two jobs in the background, and then wait until they both finish. ==

By default, {{{wait}}} waits for all of your shell's children to exit.

{{{
job1 &
job2 &
wait
}}}

You can specify one or more jobs (either by PID, or by ''jobspec'' -- see Job Control for that).  The `help wait` page is misleading (implying that only one argument may be given); refer to the full Bash manual instead.

There is no way to wait for "child process foo to end, OR something else to happen", other than [[SignalTrap|setting a trap]], which will only help if "something else to happen" is a signal being sent to the script.

There is also no way to wait for a process that is not your child.  You can't hang around the schoolyard and pick up someone else's kids.

== How can I check to see if my game server is still running?  I'll put a script in crontab, and if it's not running, I'll restart it... ==

We get that question (in various forms) ''way'' too often.  A user has some daemon, and they want to restart it whenever it dies.  Yes, one could probably write a bash script that would try to parse the output of {{{ps}}} (or preferably {{{pgrep}}} if your system has it), and try to ''guess'' which process ID belongs to the daemon we want, and try to ''guess'' whether it's not there any more.  But that's haphazard and dangerous.  There are much better ways.

Most Unix systems already ''have'' a feature that allows you to respawn dead processes: {{{init}}} and {{{inittab}}}.  If you want to make a new daemon instance pop up whenever the old one dies, typically all you need to do is put an appropriate line into {{{/etc/inittab}}} with the "respawn" action in column 3, and your process's invocation in column 4.  Then run `telinit q` or your system's equivalent to make init re-read its `inittab`.

Some Unix systems don't have {{{inittab}}}, and some system administrators might want finer control over the daemons and their logging.  Those people may want to look into [[http://cr.yp.to/daemontools.html|daemontools]], or [[http://smarden.org/runit/|runit]].

This leads into the issue of self-daemonizing programs.  There was a trend during the 1980s for Unix daemons such as {{{inetd}}} to put themselves into the background automatically.  It seems to be particularly common on BSD systems, although it's widespread across all flavors of Unix.

The problem with this is that any sane method of managing a daemon requires that you ''keep track of it after starting it''.  If {{{init}}} is told to respawn a command, it simply launches that command as a child, then uses the {{{wait()}}} system call; so, when the child exits, the parent can spawn another one.  Daemontools works the same way: a user-supplied {{{run}}} script establishes the environment, and then {{{exec}}}s the process, thereby giving the daemontools supervisor direct parental authority over the process, including standard input and output, etc.

If a process double-forks itself into the background (the way `inetd` and `sendmail` and `named` do), it breaks the connection to its parent -- intentionally.  This makes it unmanageable; the parent can no longer receive the child's output, and can no longer {{{wait()}}} for the child in order to be informed of its death.  And the parent won't even know the new daemon's process ID.  The child has run away from home without even leaving a note.

So, the Unix/BSD people came up with workarounds... they created "PID files", in which a long-running daemon would write its process ID, since the parent had no other way to determine it.  But PID files are not reliable.  A daemon could have died, and then some other process could have taken over its PID, rendering the PID file useless.  Or the PID file could simply get deleted, or corrupted.  They came up with {{{pgrep}}} and {{{pkill}}} to attempt to track down processes by name instead of by number... but what if the process doesn't have a unique name?  What if there's more than one of it at a time, like {{{nfsd}}} or Apache?

These workarounds and tricks are only in place because of the ''original'' hack of self-backgrounding.  Get rid of ''that'', and everything else becomes easy!  Init or daemontools or runit can just control the child process directly.  And even the most raw beginner could write their own [[WrapperScript|wrapper script]]:

{{{
   #!/bin/sh
   while :; do
      /my/game/server -foo -bar -baz >> /var/log/mygameserver 2>&1
   done
}}}

Then simply arrange for that to be executed at boot time, with a simple {{{&}}} to put it in the background, and ''voila''!  An instant one-shot respawn.

Most modern software packages no longer require self-backgrounding; even for those where it's the default behavior (for compatibility with older versions), there's often a switch or a set of switches which allows one to control the process.  For instance, Samba's {{{smbd}}} now has a {{{-F}}} switch specifically for use with daemontools and other such programs.

If all else fails, you can try using [[http://cr.yp.to/daemontools/fghack.html|fghack]] (from the daemontools package) to prevent the self-backgrounding.

== How do I make sure only one copy of my script can run at a time? ==

First, ask yourself ''why'' you think that restriction is necessary.  Are you using a temporary file with a fixed name, rather than [[BashFAQ/062|generating a new temporary file in a secure manner]] each time?  If so, correct that bug in your script.  Are you using some system resource without locking it to prevent corruption if multiple processes use it simultaneously?  In that case, you should probably use file locking, by rewriting your application in a language that supports it.

The naive answer to this question, which is given all too frequently by well-meaning but inexperienced scripters, would be to run some variant of {{{ps -ef | grep -v grep | grep "$(basename "$0")" | wc -l}}} to count how many copies of the script are in existence at the moment.  I won't even attempt to describe how horribly wrong that approach is... if you can't see it for yourself, you'll simply have to take my word for it.

Unfortunately, bash has no facility for locking a file.  [[BashFAQ/045|Bash FAQ #45]] contains examples of using a directory, a symlink, etc. as a means of mutual exclusion; but you cannot lock a file directly.

 ''I believe you can use {{{(set -C; >lockfile)}}} to atomically create a lockfile, please verify this. (see: [[BashFAQ/045|Bash FAQ #45]]) --Andy753421''

You could also run your program or shell script under the [[http://cr.yp.to/daemontools/setlock.html|setlock]] program from the daemontools package.  Presuming that you use the same lockfile to prevent concurrent or simultaneous execution of your script(s), you have effectively made sure that your script will only run once.  Here's an example where we want to make sure that only one "sleep" is running at a given time.

{{{
$ setlock -nX lockfile sleep 100 &
[1] 1169
$ setlock -nX lockfile sleep 100
setlock: fatal: unable to lock lockfile: temporary failure
}}}

If environmental restrictions ''require'' the use of a shell script, then you may be stuck using that.  Otherwise, you should ''seriously'' consider rewriting the functionality you require in a more powerful language.

== I want to process a bunch of files in parallel, and when one finishes, I want to start the next. And I want to make sure there are exactly 5 jobs running at a time. ==

Many `xargs` allow running tasks in parallel, including FreeBSD, OpenBSD and GNU (but not POSIX):

{{{
find . -print0 | xargs -0 -n 1 -P 5 command
}}}

One may also choose to use GNU Parallel (if available) instead of `xargs`, as GNU Parallel makes sure the output from different jobs do not mix.

{{{
find . -print0 | parallel -0 command | use_output_if_needed
}}}

A C program could fork 5 children and manage them closely using {{{select()}}} or similar, to assign the next file in line to whichever child is ready to handle it.  But bash has nothing equivalent to `select` or `poll`.

In a script where the loop is very big you can use `sem` from GNU Parallel. Here 10 jobs are run in parallel:

{{{
for i in *.log ; do
  echo "$i"
  [...do other needed stuff...]
  sem -j10 gzip $i ";" echo done
done
sem --wait
}}}

If you do not have GNU Parallel installed you're reduced to lesser solutions.  One way is to divide the job into 5 "equal" parts, and then just launch them all in parallel.  Here's an example:

{{{#!nl
#!/usr/local/bin/bash
# Read all the files (from a text file, 1 per line) into an array.
IFS=$'\n' read -r -d '' -a files < inputlist

# Here's what we plan to do to them.
do_it() {
   for f; do [[ -f $f ]] && my_job "$f"; done
}

# Divide the list into 5 sub-lists.
i=0 n=0 a=() b=() c=() d=() e=()
while ((i < ${#files[*]})); do
    a[n]=${files[i]}
    b[n]=${files[i+1]}
    c[n]=${files[i+2]}
    d[n]=${files[i+3]}
    e[n]=${files[i+4]}
    ((i+=5, n++))
done

# Process the sub-lists in parallel
do_it "${a[@]}" > a.out 2>&1 &
do_it "${b[@]}" > b.out 2>&1 &
do_it "${c[@]}" > c.out 2>&1 &
do_it "${d[@]}" > d.out 2>&1 &
do_it "${e[@]}" > e.out 2>&1 &
wait
}}}

See [[BashFAQ/001|reading a file line-by-line]] and [[BashFAQ/005|arrays]] and ArithmeticExpression for explanations of the syntax used in this example.

Even if the lists aren't quite identical in terms of the amount of work required, this approach is ''close enough'' for many purposes.

Another approach involves using a [[NamedPipes|named pipe]] to tell a "manager" when a job is finished, so it can launch the next job.  Here is an example of that approach:

{{{#!nl
#!/bin/bash

# FD 3 will be tied to a named pipe.
mkfifo pipe; exec 3<>pipe

# This is the job we're running.
s() {
  echo Sleeping $1
  sleep $1
}

# Start off with 3 instances of it.
# Each time an instance terminates, write a newline to the named pipe.
{ s 5; echo >&3; } &
{ s 7; echo >&3; } &
{ s 8; echo >&3; } &

# Each time we get a line from the named pipe, launch another job.
while read; do
  { s $((RANDOM%5+7)); echo >&3; } &
done <&3
}}}

If you need something more sophisticated than these, you're probably looking at the wrong language.

== My script runs a pipeline.  When the script is killed, I want the pipeline to die too. ==

One approach is to set up a [[SignalTrap|signal handler]] (or an `EXIT` trap) to kill your child processes right before you die.  Then, you need the PIDs of the children -- which, in the case of a pipeline, is not so easy.  You can use a [[NamedPipes|named pipe]] instead of a pipeline, so that you can collect the PIDs yourself:

{{{#!highlight bash
#!/bin/bash
unset kids
fifo=/tmp/foo$$
trap 'kill "${kids[@]}"; rm -f "$fifo"' EXIT
mkfifo "$fifo" || exit 1
command 1 > "$fifo" & kids+=($!)
command 2 < "$fifo" & kids+=($!)
wait
}}}

This example sets up a FIFO with one writer and one reader, and stores their PIDs in an array named `kids`.  The `EXIT` trap sends SIGTERM to them all, removes the FIFO, and exits.  See [[BashFAQ/062|Bash FAQ #62]] for notes on the use of temporary files.

Another approach is to enable ''job control'', which allows whole pipelines to be treated as units.

{{{#!highlight bash
#!/bin/bash
set -m
trap 'kill %%' EXIT
command1 | command2 &
wait
}}}

In this example, we enable job control with `set -m`.  The `%%` in the `EXIT` trap refers to the ''current job'' (the most recently executed background pipeline qualifies for that).  Telling bash to kill the current job takes out the entire pipeline, rather than just the last command in the pipeline (which is what we would get if we had stored and used `$!` instead of `%%`).

<<Anchor(howto)>>
= How to work with processes =

The best way to do process management in Bash is to start the managed process(es) from your script, remember its PID, and use that PID to do things with your process later on.

'''If at ALL possible, AVOID `ps`, `pgrep`, `killall`, and any other process table parsing tools.'''  These tools have no clue what process YOU WANT to talk to.  They only guess at it based on filtering unreliable information.  These tools may work fine in your little test environment, they may work fine in production for a while, but ''inevitably'' they WILL fail, because they ARE a broken approach to process management.

<<Anchor(parents)>>
== PIDs and parents ==

In UNIX, processes are identified by a number called a PID (for Process IDentifier).  Each running process has a unique identifier.  You cannot reliably determine when or how a process was started purely from the identifier number: for all intents and purposes, it is ''random''.

Each UNIX process also has a ''parent process''.  This parent process is the process that started it, but can change to the `init` process if the parent process ends before the new process does.  (That is, `init` will pick up orphaned processes.)  Understanding this parent/child relationship is vital because it is the key to reliable process management in UNIX.  A process's PID will NEVER be freed up for use after the process dies UNTIL the parent process `wait`s for the PID to see whether it ended and retrieve its exit code.  If the parent ends, the process is returned to `init`, which does this for you.

This is important for one major reason: if the parent process manages its child process, it can be absolutely certain that, even if the child process dies, no other new process can accidentally recycle the child process's PID until the parent process has `wait`ed for that PID and noticed the child died.  This gives the parent process the guarantee that the PID it has for the child process will ALWAYS point to that child process, whether it is alive or a "zombie".  Nobody else has that guarantee.

== The risk of letting the parent die ==

Why is this all so important?  Why should you care?  Consider what happens if we use a "PID file".  Assume the following sequence of events:
 1. You're a boot script (for example, one in `/etc/init.d`).  You are told to start the foodaemon.
 1. You start a foodaemon child process in the background and grab its PID.
 1. You write this PID to a file.
 1. You exit, assuming you've done your job.
 1. Later, you're started up again and told to kill the foodaemon.
 1. You look for the child process's PID in a file.
 1. You send the `SIGTERM` signal to this PID, telling it to clean up and exit.

There is absolutely no way you can be certain that the process you told to exit is actually the one you started.  The process you wanted to check up on ''could'' have died and another random new process could have easily recycled its PID that was released by `init`.

== The risk of parsing the process tree ==

UNIX comes with a set of handy tools, among which is `ps`.  This is a very helpful utility that you can use from the command line to get an overview of what processes are running on your box and what their status is.

All too many people, however, assume that computers and humans work the same way.  They think that "''I can read `ps` and see if my process is in there, why shouldn't my script do the same?''".  Here's why:  You are (hopefully) smarter than your script.  You see `ps` output and you see all sorts of information in context.  Your brain determines, "''Is this the process I'm looking for?''" and based on what you see it ''guesses'' "''Yeah, it looks like it.''".  Firstly, your script can't process context the way your brain can (no, `awk`'ing out column 4 and seeing if that contains your process's command name isn't good enough).  Secondly, even if it could do a good job, your script shouldn't be doing any guessing whatsoever.  It shouldn't need to.

`ps` output is unpredictable, highly OS-dependent, and not built for parsing.  It is next to impossible for your script to distinguish `ping` as a command name from another process's command line which may contain a similar word like `piping`, or a user named `ping`, etc.

The same goes for almost any other tool that parses the process list.  Some are worse than others, but in the end, they all do the ''wrong thing''.

== Doing it right ==

As mentioned before, the right way to do something with your child process is by using its PID, preferably (if at all possible) from the parent process that created it.

You may have come here hoping for a quick hint on how to finish your script only to find these recommendations don't apply to any of your existing code or setup.  That's probably ''not'' because your code or setup is an exception and you should disregard this; but more likely because you need to take the time and ''re-evaluate'' your existing code or setup and rework it.  This will require you to think for a moment.  Take that moment and do it right.

=== Starting a process and remembering its PID ===

To start a process asynchronously (so the main script can continue while the process runs in the "background"), use the `&` operator.  To get the PID that was assigned to it, expand the `!` parameter.  You can, for example, save it in a variable:

{{{
    # Bourne shell
    myprocess -o myfile -i &
    mypid=$!
}}}

=== Checking up on your process or terminating it ===

At a later time, you may be interested in whether your process is still running and if it is, you may decide it's time to terminate it.  If it's not running anymore, you may be interested in its exit code to see whether it experienced a problem or ended successfully.

To [[SignalTrap|send a process a signal]], we use the `kill` command.  Signals can be used to tell a process to do something, but `kill` can also be used to check if the process is still alive:

{{{
    # Bourne
    kill -0 $mypid && echo "My process is still alive."
    kill    $mypid ;  echo "I just asked my process to shut down."
}}}

`kill` sends the `SIGTERM` signal, by default.  This tells a program it's time to terminate.  You can use the `-0` option to kill if you don't want to terminate the process but just check up on whether it's still running.  In either case, the `kill` command will have a `0` exit code (success) if it managed to send the signal (or found the process to still be alive).

Unless you intend to send a very specific signal to a process, do not use any other `kill` options; in particular, ''avoid using `-9` or `SIGKILL` at all cost''.  The `KILL` signal is a very dangerous signal to send to a process and using it is almost always a bug.  Send the default `SIGTERM` instead and have patience.

To wait for a child process to finish or to read in the exit code of a process that you know has already finished (because you did a `kill -0` check, for example), use the `wait` built-in command:

{{{
    # Bash
    night() { sleep 10; }              # Define 'night' as a function that takes 10 seconds.
                                       # Adjust seconds according to current season and latitude
                                       # for a more realistic simulation.

    night & nightpid=$!
    sheep=0
    while sleep 1; do
        kill -0 $nightpid || break     # Break the loop when we see the process has gone away.
        echo "$(( ++sheep )) sheep jumped over the fence."
    done

    wait $nightpid; nightexit=$?
    echo "The night ended with exit code $nightexit.  We counted $sheep sheep."
}}}

=== Starting a "daemon" and checking whether it started successfully ===

This is a very common request.  The problem is that there ''is no answer!''  There is no such thing as "the daemon started up successfully", and if your specific daemon were to have a relevant definition to that statement, it would be so completely daemon-specific, that there is no generic way for us to tell you how to check for that condition.

What people generally resort to in an attempt to provide something "good enough", is: "''Let's start the daemon, wait a few seconds, check whether the daemon process is still running, and if so, let's assume it's doing the right thing.''".  Ignoring the fact that this is a totally lousy check which could easily be defeated by a stressed kernel, timing issues, latency or delay in the daemon's operations, and many other conditions, let's just see how we would implement this if we actually wanted to do this:

{{{
    # Bash
    mydaemon -i eth0 & daemonpid=$!
    sleep 2
    if kill -0 $daemonpid ; then
        echo "Daemon started successfully.  I think."
    else
        wait $daemonpid; daemonexit=$?
        echo "Daemon process disappeared.  I suppose something may have gone wrong.  Its exit code was $daemonexit."
    fi
}}}

To be honest, this problem is much better solved by doing a daemon-specific check.  For example, say you're starting a web server called `httpd`.  The sensible thing to check in order to determine whether the web server started successfully... is whether it's actually serving your web content!  Who'd have thought!

{{{
    # Bourne(?)
    httpd -h 127.0.0.1 & httpdpid=$!
    while sleep 1; do
        nc -z 127.0.0.1 80 && break             # See if we can establish a TCP connection to port 80.
    done

    echo "httpd ready for duty."
}}}

If something goes wrong, though, this will wait forever trying to connect to port 80.  So let's check whether `httpd` died unexpectedly or whether a certain "timeout" time elapsed:

{{{
    # Bash
    httpd -h 127.0.0.1 & httpdpid=$!
    time=0 timeout=60
    while sleep 1; do
        nc -z 127.0.0.1 80 && break             # See if we can establish a TCP connection to port 80.

        # Connection not yet available.
        if ! kill -0 $httpdpid; then
            wait $httpdpid; httpdexit=$?
            echo "httpd died unexpectedly with exit code: $httpdexit"
            exit $httpdexit
        fi
        if (( ++time > timeout )); then
            echo "httpd hasn't gotten ready after $time seconds.  Something must've gone wrong.."
            # kill $httpdpid; wait $httpdpid    # You could terminate httpd here, if you like.
            exit
        fi
    done

    echo "httpd ready for duty."
}}}

<<Anchor(theory)>>
= On processes, environments and inheritance =

Every process on a Unix system (except `init`) has a parent process from which it inherits certain things.  A process can change some of these things, and not others.  You cannot change things inside another process other than by being its parent, or attaching (attacking?) it with a debugger.

It is of paramount importance that you understand this model if you plan to use or administer a Unix system successfully.  For example, a user with 10 windows open might wonder why he can't tell all of his shells to change the contents of their PATH variable, short of going to each one individually and running a command.  And even then, the changed PATH variable won't be set in the user's window manager or desktop environment, which means any ''new'' windows he creates will still get the old variable.

The solution, of course, is that the user needs to edit a shell [[DotFiles|dot file]], then logout and back in, so that his top-level processes will get the new variable, and can pass it along to their children.

Likewise, a system administrator might want to tell her `in.ftpd` to use a default [[Permissions#umask|umask]] of 002 instead of whatever it's currently using.  Achieving that goal will require an understanding of how `in.ftpd` is launched on her system, either as a child of `inetd` or as a standalone daemon with some sort of [[BootScript|boot script]]; making the appropriate modifications; and restarting the appropriate daemons, if any.

So, let's take a closer look at how processes are created.

The Unix process creation model revolves around two system calls: `fork()` and `exec()`.  (There is actually a family of related system calls that begin with `exec` which all behave in slightly different manners, but we'll treat them all equally for now.)  `fork()` creates a child process which is a ''duplicate'' of the parent who called `fork()` (with a few exceptions).  The parent receives the child process's PID (Process ID) number as the return value of the `fork()` function, while the child gets a "0" to tell it that it's the child.  `exec()` replaces the current process with a different program.

So, the usual sequence is:

 * A program calls `fork()` and checks the return value of the system call.  If the status is greater than 0, then it's the parent process, so it calls `wait()` on the child process ID (unless we want it to continue running while the child runs in the background).
 * If the status is 0, then it's the child process, so it calls `exec()` to do whatever it's supposed to be doing.
 * But before that, the child might decide to `close()` some file descriptors, `open()` new ones, set environment variables, change resource limits, and so on.  All of these changes will remain in effect after the `exec()` and will affect the task that is executed.
 * If the return value of `fork()` is negative, something bad happened (we ran out of memory, or the process table filled up, etc.).

Let's take an example of a shell command:

{{{
echo hello world 1>&2
}}}

The process executing this is a shell, which reads commands and executes them.  For external commands, it uses the standard `fork()`/`exec()` model to do so.  Let's show it step by step:

 * The parent shell calls `fork()`.
 * The parent gets the child's process ID as the return value of `fork()` and waits for it to terminate.
 * The child gets a 0 from `fork()` so it knows it's the child.
 * The child is supposed to redirect standard output to standard error (due to the `1>&2` directive).  It does this now:
  * Close file descriptor 1.
  * Duplicate file descriptor 2, and make sure the duplicate is FD 1.
 * The child calls `exec("echo", "echo", "hello", "world", (char *)NULL)` or something similar to execute the command.  (Here, we're assuming `echo` is an external command.)
 * Once the `echo` terminates, the parent's `wait` call also terminates, and the parent resumes normal operation.

There are other things the child of the shell might do before executing the final command.  For example, it might set environment variables:

{{{
http_proxy=http://tempproxy:3128/ lynx http://someURL/
}}}

In this case, the child will put `http_proxy=http://tempproxy:3128/` into the environment before calling `exec()`.  The parent's environment is unaffected.

A child process inherits many things from its parent:

 * Open file descriptors.  The child gets copies of these, referring to the same files.
 * Environment variables.  The child gets its own copies of these, and [[BashFAQ/060|changes made by the child do not affect the parent's copy]].
 * Current working directory.  If the child changes its working directory, [[BashFAQ/060|the parent will never know about it]].
 * User ID, group ID and supplementary groups.  A child process is spawned with the same privileges as its parent.  Unless the child process is running with superuser UID (UID 0), it cannot change these privileges.
 * System resource limits.  The child inherits the limits of its parent.  A process that runs as superuser UID can raise its resource limits (`setrlimit(2)`).  A process running as non-superuser can only lower its resource limits; it can't raise them.
 * [[Permissions#umask|umask]].

An active Unix system may be perceived as a ''tree'' of processes, with parent/child relationships shown as vertical ("branch") connections between nodes.  For example,

{{{
 (init)
    |
 (login)
    |
         startx
           |
         xinit
           |
     bash .xinitrc
     /     |    \
 rxvt    rxvt   fvwm2
  |        |        \
 bash   screen       \____________________
       /   |  \              |      |     \
    bash bash  bash        xclock  xload  firefox ...
           |     |
         mutt  rtorrent
}}}

This is a simplified version of an actual set of processes run by one user on a real system.  I have omitted many, to keep things readable.  The root of the tree, shown as `(init)`, as well as the first child process `(login)`, are running as root (superuser UID 0).  Here is how this scenario came about:

 * The kernel (Linux in this case) is hard-coded to run `/sbin/init` as process number 1 when it has finished its startup.  `init` never dies; it is the ultimate ancestor of every process on the system.
 * `init` reads `/etc/inittab` which tells it to spawn some `getty` processes on some of the Linux virtual terminal devices (among other things).
 * Each `getty` process presents a bit of information plus a login prompt.
 * After reading a username, `getty` `exec()`s `/bin/login` to read the password.  (Thus, `getty` no longer appears in the tree; it has replaced itself.)
 * If the password is valid, `login` `fork()`s the user's login shell (in this case bash).  Presumably, it hangs around (instead of using `exec()`) because it wants to do some clean-up after the user's shell has terminated.
 * The user types `exec startx` at the bash shell prompt.  This causes bash to `exec()` `startx` (and therefore the login shell no longer appears in the tree).
 * `startx` is a wrapper that launches an X session, which includes an X server process (not shown -- it runs as root), and a whole slew of client programs.  On this particular system, `.xinitrc` in the user's home directory is a script that tells which X client programs to run.
 * Two `rxvt` terminal emulators are launched from the `.xinitrc` file (in the background using `&`), and each of them runs a new copy of the user's shell, bash.
  * In one of them, the user has typed `exec screen` (or something similar) to replace bash with screen.  Screen, in turn, has three bash child processes of its own, two of which have terminal-based programs running in them (mutt, rtorrent).
 * The user's window manager, `fvwm2`, is run in the foreground by the `.xinitrc` script.  A window manager or desktop environment is usually the last thing run by the `.xinitrc` script; when the WM or DE terminates, the script terminates, and brings down the whole session.
 * The window manager runs several processes of its own (xclock, xload, firefox, ...).  It typically has a menu, or icons, or a control panel, or some other means of launching new programs.  We will not cover window manager configurations here.

Other parts of a Unix system use similar process trees to accomplish their goals, although few of them are quite as deep or complex as an X session.  For example, `inetd` runs as a daemon which listens on several UDP and TCP ports, and launches programs (`ftpd`, `telnetd`, etc.) when it receives network connections.  `lpd` runs as a managing daemon for printer jobs, and will launch children to handle individual jobs when a printer is ready.  `sshd` listens for incoming SSH connections, and launches children when it receives them.  Some electronic mail systems (particularly [[CategoryQmail|qmail]]) use relatively large numbers of small processes working together.

Understanding the relationship among a set of processes is vital to administering a system.  For example, suppose you would like to change the way your FTP service behaves.  You've located a configuration file that it is known to read at startup time, and you've changed it.  Now what?  You could reboot the entire system to be sure your change takes effect, but most people consider that overkill.  Generally, people prefer to restart only the minimal number of processes, thereby causing the least amount of disruption to the other services and the other users of the system.

So, you need to understand how your FTP service starts up.  Is it a standalone daemon?  If so, you probably have some system-specific way of restarting it (either by running a BootScript, or manually killing and restarting it, or perhaps by issuing some special service management command).  More commonly, an FTP service runs under the control of `inetd`.  If this is the case, you don't need to restart anything at all.  `inetd` will launch a fresh FTP service daemon every time it receives a connection, and the fresh daemon will read the changed configuration file every time.

On the other hand, suppose your FTP service doesn't have its own configuration file that lets you make the change you want (for example, changing its umask for the default [[Permissions]] of uploaded files).  In this case, you know that it inherits its umask from `inetd`, which in turn gets its umask from whatever boot script launched it.  If you would like to change FTP's umask in this scenario, you would have to edit `inetd`'s boot script, and then kill and restart `inetd` so that the FTP service daemons (`inetd`'s children) will inherit the new value.  And by doing this, you are also changing the default umask of every ''other'' service that `inetd` manages!  Is that acceptable?  Only you can answer that.  If not, then you may have to change how your FTP service runs, possibly moving it to a standalone daemon.  This is a system administrator's job.

----
CategoryShell CategoryUnix
+Going on several years living in Auburn and have 4 children. It goes without mention I have a lovely better half named Russel Tonga .  In between jobs I genuinely like to build my blog stop snoring ([[http://Www.Stopsnoringconsumerreports.com/5-controversial-snoring-solutions-that-dont-work/|http://Www.Stopsnoringconsumerreports.com]]), and to be able to earn an income I am at present a Mechanical engineering technician.