Diff for "BashFAQ/026"

Differences between revisions 5 and 23 (spanning 18 versions)

the_order_of_lines_in_a_file.3F__.28Or_select_a_random_line_from_a_file.2C_or_select_a_random_file_from_a_directory..29">How can I randomize (shuffle) the order of lines in a file? (Or select a random line from a file, or select a random file from a directory.)

class="line874">To randomize the lines of a file, here is one approach. This one involves generating a random number, which is prefixed to each line; then sorting the resulting lines, and removing the numbers.

    # POSIX(?) randomize() { while IFS='' read -r l ; do printf "0$RANDOM\t%s\n" "$l"; done | sort -n | cut -f2- }

Note: the leading 0 is to make sure it doesn't break if the shell doesn't support $RANDOM, which is supported by BASH, KornShell, and POSIX shell, but not BourneShell. Of course, if your shell doesn't have $RANDOM, this won't shuffle the lines very well.

I can't find RANDOM in SUS. I don't think RANDOM is POSIX -- pgas

Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:

    # Bourne awk ' BEGIN { srand() } { print rand() "\t" $0 } ' | sort -n |    # Sort numerically on first (random number) column cut -f2-     # Remove sorting column

This is (possibly) faster than the previous solution, but will not work for very old AWK implementations (try "nawk", or "gawk", if available). The advantage of this one is that it doesn't require $RANDOM in your shell; that's outsourced to awk instead.

A generalized version of this question might be, How can I shuffle the elements of an array? If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us badly biased results. A more complex (and correct) algorithm looks like this:

    # Uses a global array variable.  Must be non-sparse. # Bash syntax. shuffle() { local i tmp size max rand # $RANDOM % (i+1) is biased because of the limited range of $RANDOM # Compensate by using a range which is a multiple of the array size. size=${#array[*]} max=$(( 32768 / size * size )) > for ((i=size-1; i>0; i--)); do while (( (rand=$RANDOM) >= max )); do :; done rand=$(( rand % (i+1) )) tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp done }

This function shuffles the elements of an array in-place using the Knuth-Fisher-Yates shuffle algorithm.

Another question we frequently see is, How can I print a random line from a file? The problem here is that you need to know in advance how many lines the file contains. Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory. Let's explore both of these approaches.

   # POSIX n=$(wc -l < "$file")        # Count number of lines. r=$(($RANDOM % n + 1))      # Random number from 1..n. (See below) sed -n "$r{p;q;}" "$file"   # Print the r'th line.

(These examples use the answer from FAQ 11 to print the n'th line.) The first one's pretty straightforward -- we use wc to count the lines, choose a random number, and then use sed to print the line. If we already happened to know how many lines were in the file, we could skip the wc command, and this would be a pretty efficient approach.

The next example sucks the entire file into memory. This approach saves time reopening the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.)

   # Bash oIFS=$IFS IFS=$'\n' lines=($(<"$file")) IFS=$oIFS n=${#lines[@]} r=$((RANDOM % n))   # see below echo "${lines[r]}"

Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0.

Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.). A similar technique can be used:

    # Bash files=(*.ogg)                  # Or *.gif, or * n=${#files[@]}                 # For aesthetics xmms -- "${files[RANDOM % n]}" # Choose a random element

Note that these last few examples use a simple modulus of the RANDOM variable, so the results are biased. If this is a problem for your application, then use the anti-biasing technique from the Knuth-Fisher-Yates example, above.

... or just use shuf (man shuf).

No man page for shuf on HP-UX 10.20, OpenBSD 4.0, or Debian unstable. apt-cache show shuf gives nothing. Searching for shuf in the http://freshmeat.net/ search box gives no results. Do you have a pointer to where this thing comes from?
- On Debian 4.0, shuf is in the science/biosquid package
  shuf is a part of GNU Coreutils
  - Not in GNU coreutils 5.97, which is the newest available in Debian unstable as of 2007-06-20.
    - gnu.org clearly shows shuf in their Coreutils package. If only Debian would update their packages once a century.
The real point of all this is that shuf is not portable.

Speaking of GNU coreutils, as of version 6.9 GNU sort has the -R (aka --random-sort) flag. Oddly enough, it only works for the generic locale:

     LC_ALL=C sort -R file     # output the lines in file in random order LC_ALL=POSIX sort -R file # output the lines in file in random order LC_ALL=en_US sort -R file # effectively ignores the -R option

For more details, see info coreutils sort or an equivalent manual.

-  ⇤ ← Revision 5 as of 2007-06-20 03:54:42 → 
  Size: 3241
  Editor: wsip-70-169-139-34
  Comment:
+   ← Revision 23 as of 2009-10-10 10:00:23 → ⇥
  Size: 5963
  Editor: ppp089210047129
  Comment: a note about RANDOM. is it really posix ? pgas
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-[[Anchor(faq26)]]
+<<Anchor(faq26)>>
 Line 3:
+To randomize the lines of a file, here is one approach.  This one involves generating a random number, which is prefixed to each line; then sorting the resulting lines, and removing the numbers.
-Line 5:
+Line 6:
-    randomize(){
        while read l ; do echo "0$RANDOM $l" ; done |
+    # POSIX(?)
    randomize() {
        while IFS='' read -r l ; do printf "0$RANDOM\t%s\n" "$l"; done |
-Line 8:
+Line 10:
-        cut -d" " -f2-
+        cut -f2-
-Line 12:
+Line 14:
-Note: the leading 0 is to make sure it doesnt break if the shell doesnt support $RANDOM, which is supported by ["BASH"], KornShell, KornShell93 and ["POSIX"] shell, but not BourneShell.
+Note: the leading 0 is to make sure it doesn't break if the shell doesn't support $RANDOM, which is supported by [[BASH]], KornShell, and [[POSIX]] shell, but not BourneShell.  Of course, if your shell doesn't have $RANDOM, this won't shuffle the lines very well.
-Line 14:
+Line 16:
-The same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:
+  I can't find RANDOM in SUS. I don't think RANDOM is POSIX -- pgas


Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:
-Line 16:
+Line 21:
+    # Bourne
-Line 24:
+Line 30:
-This is faster than the previous solution, but will not work for very old AWK implementations (try "nawk", or "gawk", if available).
+This is (possibly) faster than the previous solution, but will not work for very old [[AWK]] implementations (try "nawk", or "gawk", if available).  The advantage of this one is that it doesn't require $RANDOM in your shell; that's outsourced to awk instead.
-Line 26:
+Line 32:
-A related question we frequently see is, "How can I print a random line from a file?"  The problem here is that you need to know in advance how many lines the file contains.  Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory.  Let's explore both of these approaches.
+A generalized version of this question might be, ''How can I shuffle the elements of an array?''  If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears.  A naive approach would give us [[http://www.codinghorror.com/blog/archives/001015.html|badly biased results]].  A more complex (and correct) algorithm looks like this:
-Line 29:
+Line 35:
+    # Uses a global array variable.  Must be non-sparse.
    # Bash syntax.
    shuffle() {
       local i tmp size max rand

       # $RANDOM % (i+1) is biased because of the limited range of $RANDOM
       # Compensate by using a range which is a multiple of the array size.
       size=${#array[*]}
       max=$(( 32768 / size * size ))

       for ((i=size-1; i>0; i--)); do
          while (( (rand=$RANDOM) >= max )); do :; done
          rand=$(( rand % (i+1) ))
          tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp
       done
    }
}}}

This function shuffles the elements of an [[BashFAQ/005|array]] in-place using the [[http://en.wikipedia.org/wiki/Knuth_shuffle|Knuth-Fisher-Yates shuffle algorithm]].

Another question we frequently see is, ''How can I print a random line from a file?''  The problem here is that you need to know in advance how many lines the file contains.  Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory.  Let's explore both of these approaches.

{{{
   # POSIX
-Line 30:
+Line 60:
-   r=$((RANDOM % n + 1))       # Random number from 1..n.
+   r=$(($RANDOM % n + 1))      # Random number from 1..n. (See below)
-Line 34:
+Line 64:
-(These examples use the answer from [#faq11 FAQ 11] to print the n'th line.)  The first one's pretty straightforward -- we use {{{wc}}} to count the lines, choose a random number, and then use {{{sed}}} to print the line.  If we already happened to know how many lines were in the file, we could skip the {{{wc}}} command, and this would be a very efficient approach.
+(These examples use the answer from [[BashFAQ/011|FAQ 11]] to print the n'th line.)  The first one's pretty straightforward -- we use {{{wc}}} to count the lines, choose a random number, and then use {{{sed}}} to print the line.  If we already happened to know how many lines were in the file, we could skip the {{{wc}}} command, and this would be a pretty efficient approach.
-Line 36:
+Line 66:
-The next example sucks the entire file into memory.  This approach saves time reopening the file, but obviously uses more memory.  (Arguably:  on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED)
+The next example sucks the entire file into memory.  This approach saves time reopening the file, but obviously uses more memory.  (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.)
-Line 39:
+Line 69:
+   # Bash
-Line 41:
+Line 72:
-   r=$((RANDOM % n))
+   r=$((RANDOM % n))   # see below
-Line 47:
+Line 78:
-Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to chose a random song to play, or a random image to display, etc.).  A similar technique can be used:
+Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.).  A similar technique can be used:
-Line 50:
+Line 81:
-    files=(*.ogg)               # Or *.gif, or *
    n=${#files[@]}              # For aesthetics
    xmms "${files[RANDOM % n]}" # Choose a random element
+    # Bash
    files=(*.ogg)                  # Or *.gif, or *
    n=${#files[@]}                 # For aesthetics
    xmms -- "${files[RANDOM % n]}" # Choose a random element
-Line 55:
+Line 87:
-...or just use '''shuf''' (man shuf).
+Note that these last few examples use a simple modulus of the RANDOM variable, so the results are biased.  If this is a problem for your application, then use the anti-biasing technique from the Knuth-Fisher-Yates example, above.

... or just use {{{shuf}}} (man shuf).
-Line 59:
+Line 93:
- ''On Debian 4.0, '''shuf''' is in the science/biosquid package''
+  ''On Debian 4.0, '''shuf''' is in the science/biosquid package''
-Line 61:
+Line 95:
- ''shuf is a part of GNU Coreutils''
+  ''shuf is a part of GNU Coreutils''

   ''Not in GNU coreutils 5.97, which is the newest available in Debian unstable as of 2007-06-20.''

    ''gnu.org clearly shows shuf in their Coreutils package. If only Debian would update their packages once a century.''

 ''The real point of all this is that `shuf` is not portable.''

Speaking of GNU coreutils, as of version 6.9 GNU sort has the -R (aka --random-sort) flag. Oddly enough, it only works for the generic locale:
{{{
     LC_ALL=C sort -R file     # output the lines in file in random order
     LC_ALL=POSIX sort -R file # output the lines in file in random order
     LC_ALL=en_US sort -R file # effectively ignores the -R option
}}}

For more details, see `info coreutils sort` or an equivalent manual.