Diff for "BashFAQ/079"

Differences between revisions 11 and 18 (spanning 7 versions)

How can I grep for lines containing foo AND bar, foo OR bar? Or for files containing foo AND bar, possibly on separate lines?

This is really three different questions, so we'll break this answer into three parts.

foo AND bar on the same line

The easiest way to match lines that contain both foo AND bar is to use two grep commands:

grep foo | grep bar
grep foo "$myfile" | grep bar   # for those who need the hand-holding

It can also be done with one egrep, although (as you can probably guess) this doesn't really scale well to more than two patterns:

egrep 'foo.*bar|bar.*foo'

If you prefer, you can achieve this in one sed or awk statement:

sed -n '/foo/{/bar/p}'
awk '/foo/ && /bar/'

If you need to scale the awk solution to an arbitrary number of patterns, you can construct the awk command on the fly:

# bash, ksh93
# Constructs awk "/$1/&&/$2/&&...."
# Data to be matched should be on stdin.
# Writes matching lines to stdout.
multimatch() {
  (($# < 2)) && { echo "usage: multimatch pat1 pat2 [...]" >&2; return 1; }
  awk "/$1/$(printf "&&/%s/" "${@:2}")"
}

Or, POSIX version:

# POSIX
multimatch() {
  [ $# -lt 2 ] && { echo "usage: multimatch pat1 pat2 [...]" >&2; return 1; }
  __p1=$1
  shift
  awk "/$__p1/$(printf "&&/%s/" "$@")"
}

Alas, POSIX functions do not have local variables. Also, both of these fail if any of the patterns contain slash characters. (Fixing that is left as an exercise for the reader.)

A POSIX version that doesn't embed the regexes into the awk script.

# POSIX
multimatch() { 
  awk 'BEGIN{for(i=1;i<ARGC;i++) a[i]=ARGV[i]; ARGC=1} {for (i in a) if ($0 !~ a[i]) next; print}' "$@"
}

foo OR bar on the same line

There are lots of ways to match lines containing foo OR bar. grep can be given multiple patterns with -e:

grep -e 'foo' -e 'bar'

Or you can construct one pattern:

grep 'foo\|bar'
egrep 'foo|bar'
grep -E 'foo|bar'

Note the difference in syntax: when you call egrep (which is the same as grep -E) you enable Extended Regular Expression (ERE) syntax. Plain grep defaults to Basic Regular Expression (BRE) syntax which requires that you express the | operator as \| and likewise for all the other RE operators. Many people find this ugly or confusing do to the fact that in most other contexts a backslash-escaped character means a literal, and therefore tend to reflexively use egrep/ERE syntax whenever the RE is more than a simple keyword.

It can also be done with sed, awk, etc.

awk '/foo|bar/'

The awk approach has the advantage of letting you use awk's other features on the matched lines, such as extracting only certain fields.

To match lines that do not contain "foo" AND do not contain "bar":

grep -E -v 'foo|bar'
# some people prefer egrep -v 'foo|bar'

foo AND bar in the same file, not necessarily on the same line

If you want to match files (rather than lines) that contain both "foo" and "bar", there are several possible approaches. The simplest (although not necessarily the most efficient) is to read the file twice:

grep -q foo "$myfile" && grep -q bar "$myfile" && echo "Found both"

The double grep -q solution has the advantage of stopping each read whenever it finds a match; so if you have a huge file, but the matched words are both near the top, it will only read the first part of the file. Unfortunately, if the matches are near the bottom (worst case: very last line of the file), you may read the whole file two times.

Another approach is to read the file once, keeping track of what you've seen as you go along. In awk:

awk '/foo/{a=1} /bar/{b=1} a&&b{print "both found";exit} END{if (a&&b){ exit 0} else{exit 1}}'

It reads the file one time, stopping when both patterns have been matched. No matter what happens, the END block is then executed, and the exit status is set accordingly.

If you want to do additional checking of the file's contents, this awk solution can be adapted quite easily.

-  ⇤ ← Revision 11 as of 2008-11-22 14:09:10 → 
  Size: 2759
  Editor: localhost
  Comment: converted to 1.6 markup
+   ← Revision 18 as of 2011-04-12 09:33:19 → ⇥
  Size: 4278
  Editor: c-69-181-152-24
  Comment: Can't use | with BREs?  Nonsense.
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
+This is really three different questions, so we'll break this answer into three parts.

=== foo AND bar on the same line ===
-Line 37:
+Line 40:
-To match lines containing foo OR bar, {{{egrep}}} is the natural choice, but it can also be done with {{{sed}}}, {{{awk}}}, etc.
+Or, POSIX version:
{{{
# POSIX
multimatch() {
  [ $# -lt 2 ] && { echo "usage: multimatch pat1 pat2 [...]" >&2; return 1; }
  __p1=$1
  shift
  awk "/$__p1/$(printf "&&/%s/" "$@")"
}
}}}

Alas, POSIX functions do not have local variables.  Also, both of these fail if any of the patterns contain slash characters.  (Fixing that is left as an exercise for the reader.)

A POSIX version that doesn't embed the regexes into the awk script.
{{{
# POSIX
multimatch() { 
  awk 'BEGIN{for(i=1;i<ARGC;i++) a[i]=ARGV[i]; ARGC=1} {for (i in a) if ($0 !~ a[i]) next; print}' "$@"
}
}}}

=== foo OR bar on the same line ===

There are lots of ways to match lines containing foo OR bar.  `grep` can be given multiple patterns with `-e`:
-Line 40:
+Line 66:
+grep -e 'foo' -e 'bar'
}}}

Or you can construct one pattern:

{{{
grep 'foo\|bar'
-Line 41:
+Line 74:
-# some people prefer grep -E 'foo|bar'
+grep -E 'foo|bar'
}}}
-Line 43:
+Line 77:
-# This is another option, some people prefer:
grep -e 'foo' -e 'bar'
+Note the difference in syntax: when you call `egrep` (which is the same as `grep -E`) you enable [[RegularExpression|Extended Regular Expression]] (ERE) syntax.  Plain `grep` defaults to [[RegularExpression|Basic Regular Expression]] (BRE) syntax which requires that you express the `|` operator as `\|` and likewise for all the other RE operators.  Many people find this ugly or confusing do to the fact that in most other contexts a backslash-escaped character means a literal, and therefore tend to reflexively use `egrep`/ERE syntax whenever the RE is more than a simple keyword.
-Line 46:
+Line 79:
-# awk equivalent (eg if you want to extract fields)
+It can also be done with {{{sed}}}, {{{awk}}}, etc.

{{{
-Line 50:
+Line 85:
-{{{egrep}}} is the oldest and most portable form of the {{{grep}}} command using [[RegularExpression|Extended Regular Expressions (EREs)]].  {{{grep -E}}} is required by POSIX.
+The `awk` approach has the advantage of letting you use `awk`'s other features on the matched lines, such as extracting only certain fields.

To match lines that do not contain "foo" AND do not contain "bar":

{{{
grep -E -v 'foo|bar'
# some people prefer egrep -v 'foo|bar'
}}}

=== foo AND bar in the same file, not necessarily on the same line ===
-Line 57:
+Line 101:
-Line 65:
+Line 108:
-It reads the file one time, stopping when both patterns have been matched.  No matter what happens, the END block is then executed, 
and the exit status is set accordingly.
+It reads the file one time, stopping when both patterns have been matched.  No matter what happens, the END block is then executed,  and the exit status is set accordingly.