Diff for "BashFAQ/036"

Differences between revisions 1 and 18 (spanning 17 versions)

How can I get all lines that are: in both of two files (set intersection) or in only one of two files (set subtraction).

Use the comm(1) command:

# Bash
# Intersection of file1 and file2
# (i.e., only the lines that appear in both files)
comm -12 <(sort file1) <(sort file2)

# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)

Read the comm man page for details. Those are process substitutions you see up there.

If for some reason you lack the core comm program, or seek alternatives, you can use these other methods. The grep (#1) or awk (#4) methods are faster than the above comm + sort (multiple calls to sort + pipes slow it down), but #1 and #4 don't scale as well to very large files since one of the data files is loaded into memory.

An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
```
  # intersection of file1 and file2
  grep -xF -f file1 file2

  # subtraction of file1 from file2
  grep -vxF -f file1 file2
```
- It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
- Note that the file specified with -f will be loaded into memory, so it doesn't scale for very large files.
- It should work with any POSIX grep; on older systems you may need to use fgrep rather than grep -F.

An implementation using sort and uniq:

  # intersection of file1 and file2
  sort file1 file2 | uniq -d  (Assuming each of file1 or file2 does not have repeated content)

  # file1-file2 (Subtraction)
  sort file1 file2 file2 | uniq -u

  # same way for file2 - file1, change last file2 to file1
  sort file1 file2 file1 | uniq -u

Another implementation of subtraction:
```
  sort file1 file1 file2 | uniq -c |
  awk '{ if ($1 == 2) { $1 = ""; print; } }'
```
- This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
- Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
- Finally, it sorts the output for you. If that's a problem, then you'll have to abandon this approach altogether. Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.

These are subtraction and intersection with awk, regardless of whether the input files are sorted or contain duplicates:

  # prints lines only in file1 but not in file2. Reverse the arguments to get the other way round
  awk 'NR==FNR{a[$0];next} !($0 in a)' file2 file1

  # prints lines that are in both files; order of arguments is not important
  awk 'NR==FNR{a[$0];next} $0 in a' file1 file2

For an explanation of how these work, see http://awk.freeshell.org/ComparingTwoFiles.

If the lines of your files contain extra rubbish data, and you only want to compare part of each line from file 1 vs. part of each line from file 2, see FAQ 116.

CategoryShell

-  ⇤ ← Revision 1 as of 2007-05-02 23:25:20 → 
  Size: 1960
  Editor: redondos
  Comment:
+   ← Revision 18 as of 2017-05-02 18:53:07 → ⇥
  Size: 3253
  Editor: GreyCat
  Comment: Link to 116 for the more generalized problem.
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-[[Anchor(faq36)]]
+<<Anchor(faq36)>>
 Line 3:
-Use the comm(1) command.
+Use the `comm(1)` command:
-Line 7:
+Line 6:
-  # intersection of file1 and file2
  comm -12 <(sort file1) <(sort file2)
  # subtraction of file1 from file2
  comm -13 <(sort file1) <(sort file2)
+# Bash
# Intersection of file1 and file2
# (i.e., only the lines that appear in both files)
comm -12 <(sort file1) <(sort file2)

# Subtraction of file1 from file2
# (i.e., only the lines unique to file2)
comm -13 <(sort file1) <(sort file2)
-Line 13:
+Line 16:
-Read the comm(1) manpage for details.
+Read the `comm` man page for details.  Those are [[ProcessSubstitution|process substitutions]] you see up there.
-Line 15:
+Line 18:
-If for some reason you lack the core comm(1) program, you can use these other methods:
+If for some reason you lack the core `comm` program, or seek alternatives, you can use these other methods. The `grep` (#1) or `awk` (#4) methods are faster than the above `comm` + `sort` (multiple calls to sort + pipes slow it down), but #1 and #4 don't scale as well to very large files since one of the data files is loaded into memory.
-Line 17:
+Line 20:
-an amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.

note that it probably only works with GNU grep, and that the file specified with -f is will be loaded into ram, so it doesn't scale for very large files.

it has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).

{{{
+. An amazingly simple and fast implementation, that took just 20 seconds to match a 30k line file against a 400k line file for me.
 {{{
-Line 26:
+Line 24:
-  # substraction of file1 from file2
+  # subtraction of file1 from file2
-Line 28:
+Line 27:
-}}}
+ }}}
-Line 30:
+Line 29:
-an implementation using sort and uniq
+  * It has grep read one of the sets as a pattern list from a file (-f), and interpret the patterns as plain strings not regexps (-F), matching only whole lines (-x).
  * Note that the file specified with -f will be loaded into memory, so it doesn't scale for very large files.
  * It should work with any POSIX grep; on older systems you may need to use `fgrep` rather than `grep -F`.
-Line 32:
+Line 33:
-{{{
+. An implementation using sort and uniq:
 {{{
-Line 35:
+Line 37:
-Line 37:
+Line 40:
-Line 39:
+Line 43:
-}}}
+ }}}
-Line 41:
+Line 45:
-another implementation of substraction:
{{{
  cat file1 file1 file2 | sort | uniq -c |
+. Another implementation of subtraction:
 {{{
  sort file1 file1 file2 | uniq -c |
-Line 45:
+Line 49:
-}}}
+ }}}
-Line 47:
+Line 51:
-This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
+  * This may introduce an extra space at the start of the line; if that's a problem, just strip it away.
  * Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
  * Finally, it sorts the output for you.  If that's a problem, then you'll have to abandon this approach altogether.  Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.
-Line 49:
+Line 55:
-Also, this approach assumes that neither file1 nor file2 has any duplicates in it.
+. These are subtraction and intersection with awk, regardless of whether the input files are sorted or contain duplicates:
 {{{
  # prints lines only in file1 but not in file2. Reverse the arguments to get the other way round
  awk 'NR==FNR{a[$0];next} !($0 in a)' file2 file1
-Line 51:
+Line 60:
-Finally, it sorts the output for you.  If that's a problem, then you'll have to abandon this approach altogether.  Perhaps you could use awk's associative arrays (or perl's hashes or tcl's arrays) instead.
+  # prints lines that are in both files; order of arguments is not important
  awk 'NR==FNR{a[$0];next} $0 in a' file1 file2
 }}}
 For an explanation of how these work, see [[http://awk.freeshell.org/ComparingTwoFiles]].

If the lines of your files contain extra rubbish data, and you only want to compare ''part'' of each line from file 1 vs. ''part'' of each line from file 2, see [[BashFAQ/116|FAQ 116]].

See also: http://www.pixelbeat.org/cmdline.html#sets

----
CategoryShell