Differences between revisions 27 and 34 (spanning 7 versions)
Revision 27 as of 2010-01-11 22:12:19
Size: 5247
Editor: GreyCat
Comment: link to interesting mailing list post
Revision 34 as of 2012-11-27 14:25:08
Size: 5465
Editor: geirha
Comment: move expansion outside printf's format string
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
        while IFS='' read -r l ; do printf "$RANDOM\t%s\n" "$l"; done |         while IFS='' read -r l ; do printf '%d\t%s\n' "$RANDOM" "$l"; done |
Line 32: Line 32:
    # Uses a global array variable. Must be non-sparse.     # Uses a global array variable. Must be compact (not a sparse array).
Line 58: Line 58:
   r=$(($RANDOM % n + 1)) # Random number from 1..n. (See below)    r=$((RANDOM % n + 1))  # Random number from 1..n. (See below)
Line 62: Line 62:
   awk -v n="$(wc -l<"$file")" 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}'    awk -v n="$(wc -l<"$file")" 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}' "$file"
Line 71: Line 71:
   oIFS=$IFS IFS=$'\n' lines=($(<"$file")) IFS=$oIFS    unset lines i
   while IFS= read -r 'lines[i++]'; do :; done < "$file" # See FAQ 5
Line 103: Line 104:
[[http://lists.gnu.org/archive/html/bug-bash/2010-01/msg00042.html]] points out a surprising pitfall concerning the use of `RANDOM` without a leading `$` in certain mathematical contexts. (Upshot: you should prefer `n=$((...math...)); ((array[n]++))` over `((array[...math...]++))` in almost every case.)  > --([[http://lists.gnu.org/archive/html/bug-bash/2010-01/msg00042.html]] points out a surprising pitfall concerning the use of `RANDOM` without a leading `$` in certain mathematical contexts. (Upshot: you should prefer `n=$((...math...)); ((array[n]++))` over `((array[...math...]++))` in almost every case.))--

Behavior described appears reversed in current versions of mksh, ksh93, Bash, and Zsh. Still something to keep in mind for legacy. -ormaaj

How can I randomize (shuffle) the order of lines in a file? (Or select a random line from a file, or select a random file from a directory.)

To randomize the lines of a file, here is one approach. This one involves generating a random number, which is prefixed to each line; then sorting the resulting lines, and removing the numbers.

    #bash
    randomize() {
        while IFS='' read -r l ; do printf '%d\t%s\n' "$RANDOM" "$l"; done |
        sort -n |
        cut -f2-
    }

RANDOM is supported by BASH, KornShell but is not defined by posix.

Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:

    # Bourne
    awk '
        BEGIN { srand() }
        { print rand() "\t" $0 }
    ' |
    sort -n |    # Sort numerically on first (random number) column
    cut -f2-     # Remove sorting column

This is (possibly) faster than the previous solution, but will not work for very old AWK implementations (try "nawk", or "gawk", or /usr/xpg4/bin/awk if available). (Note that awk use the epoch time as a seed for srand(), which might not be random enough for you)

A generalized version of this question might be, How can I shuffle the elements of an array? If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us badly biased results. A more complex (and correct) algorithm looks like this:

    # Uses a global array variable.  Must be compact (not a sparse array).
    # Bash syntax.
    shuffle() {
       local i tmp size max rand

       # $RANDOM % (i+1) is biased because of the limited range of $RANDOM
       # Compensate by using a range which is a multiple of the array size.
       size=${#array[*]}
       max=$(( 32768 / size * size ))

       for ((i=size-1; i>0; i--)); do
          while (( (rand=$RANDOM) >= max )); do :; done
          rand=$(( rand % (i+1) ))
          tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp
       done
    }

This function shuffles the elements of an array in-place using the Knuth-Fisher-Yates shuffle algorithm.

Another question we frequently see is, How can I print a random line from a file? The problem here is that you need to know in advance how many lines the file contains. Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory. Let's explore both of these approaches.

   # Bash
   n=$(wc -l < "$file")        # Count number of lines.
   r=$((RANDOM % n + 1))       # Random number from 1..n. (See below)
   sed -n "$r{p;q;}" "$file"   # Print the r'th line.

   #posix with awk
   awk -v n="$(wc -l<"$file")" 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}' "$file"

(see this faq for more info about printing the n'th line.)

The next example sucks the entire file into memory. This approach saves time reopening the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.)

   # Bash
   unset lines i
   while IFS= read -r 'lines[i++]'; do :; done < "$file"   # See FAQ 5
   n=${#lines[@]}
   r=$((RANDOM % n))   # see below
   echo "${lines[r]}"

Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0.

Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.). A similar technique can be used:

    # Bash
    files=(*.ogg)                  # Or *.gif, or *
    n=${#files[@]}                 # For aesthetics
    xmms -- "${files[RANDOM % n]}" # Choose a random element

Note that these last few examples use a simple modulus of the RANDOM variable, so the results are biased. If this is a problem for your application, then use the anti-biasing technique from the Knuth-Fisher-Yates example, above.

Other non portable utilities:

  • GNU Coreutils shuf (in recent enough coreutils)

  • GNU sort -R

Speaking of GNU coreutils, as of version 6.9 GNU sort has the -R (aka --random-sort) flag. Oddly enough, it only works for the generic locale:

     LC_ALL=C sort -R file     # output the lines in file in random order
     LC_ALL=POSIX sort -R file # output the lines in file in random order
     LC_ALL=en_US sort -R file # effectively ignores the -R option

For more details, see info coreutils sort or an equivalent manual.

Behavior described appears reversed in current versions of mksh, ksh93, Bash, and Zsh. Still something to keep in mind for legacy. -ormaaj


CategoryShell

BashFAQ/026 (last edited 2022-01-30 23:49:34 by emanuele6)