Differences between revisions 11 and 12
Revision 11 as of 2008-02-05 16:58:48
Size: 5499
Editor: GreyCat
Comment: shuffling an array
Revision 12 as of 2008-02-06 12:07:00
Size: 5749
Editor: adsl-71-138-131-18
Comment:
Deletions are marked like this. Additions are marked like this.
Line 31: Line 31:
    # Uses a global array variable. The array must be non-sparse.     # Uses function arguments as array elements
Line 34: Line 34:
        local i n tmp
 for ((i=${#array[*]}-1; i>0; i--)); do
     n=$(( RANDOM % (i+1) ))
     tmp=${array[i]} array[i]=${array[n]} array[n]=$tmp
 done
       local i n tmp max rand
       local -a array=(${@})
       arr_size=${#array[*]}
       max=$(( 32768 / arr_size * arr_size ))

       for ((i=${arr_size}-1; i>0; i--)); do
          rand=$RANDOM
          while (( rand >= max )); do
             rand=$RANDOM
          done
          n=$(( rand % (i+1) ))
          tmp=${array[i]} array[i]=${array[n]} array[n]=$tmp
       done
       echo "${array[@]}"

Anchor(faq26)

How can I randomize (shuffle) the order of lines in a file? (Or select a random line from a file, or select a random file from a directory.)

To randomize the lines of a file, here is one approach. This one involves generating a random number, which is prefixed to each line; then sorting the resulting lines, and removing the numbers.

    randomize(){
        while read l ; do echo "0$RANDOM $l" ; done |
        sort -n |
        cut -d" " -f2-
    }

Note: the leading 0 is to make sure it doesn't break if the shell doesn't support $RANDOM, which is supported by ["BASH"], KornShell, KornShell93 and ["POSIX"] shell, but not BourneShell. Of course, if your shell doesn't have $RANDOM, this won't shuffle the lines very well.

Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:

    awk '
        BEGIN { srand() }
        { print rand() "\t" $0 }
    ' |
    sort -n |    # Sort numerically on first (random number) column
    cut -f2-     # Remove sorting column

This is (possibly) faster than the previous solution, but will not work for very old [:AWK:] implementations (try "nawk", or "gawk", if available). The advantage of this one is that it doesn't require $RANDOM in your shell; that's outsourced to awk instead.

A generalized version of that question might be, How can I shuffle the elements of an array? If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us [http://www.codinghorror.com/blog/archives/001015.html badly biased results]. A more complex (and correct) algorithm looks like this:

    # Uses function arguments as array elements
    # Bash syntax.
    shuffle() {
       local i n tmp max rand
       local -a array=(${@})
       arr_size=${#array[*]}
       max=$(( 32768 / arr_size * arr_size ))

       for ((i=${arr_size}-1; i>0; i--)); do
          rand=$RANDOM
          while (( rand >= max )); do
             rand=$RANDOM
          done
          n=$(( rand % (i+1) ))
          tmp=${array[i]} array[i]=${array[n]} array[n]=$tmp
       done
       echo "${array[@]}"
    }

This function shuffles the elements of an [:BashFAQ#faq5:array] in-place using the [http://en.wikipedia.org/wiki/Knuth_shuffle Knuth-Fisher-Yates shuffle algorithm].

Another question we frequently see is, How can I print a random line from a file? The problem here is that you need to know in advance how many lines the file contains. Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory. Let's explore both of these approaches.

   n=$(wc -l < "$file")        # Count number of lines.
   r=$((RANDOM % n + 1))       # Random number from 1..n.
   sed -n "$r{p;q;}" "$file"   # Print the r'th line.

(These examples use the answer from [:BashFAQ#faq11:FAQ 11] to print the n'th line.) The first one's pretty straightforward -- we use wc to count the lines, choose a random number, and then use sed to print the line. If we already happened to know how many lines were in the file, we could skip the wc command, and this would be a very efficient approach.

The next example sucks the entire file into memory. This approach saves time reopening the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.)

   oIFS=$IFS IFS=$'\n' lines=($(<"$file")) IFS=$oIFS
   n=${#lines[@]}
   r=$((RANDOM % n))
   echo "${lines[r]}"

Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0.

Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to chose a random song to play, or a random image to display, etc.). A similar technique can be used:

    files=(*.ogg)               # Or *.gif, or *
    n=${#files[@]}              # For aesthetics
    xmms "${files[RANDOM % n]}" # Choose a random element

... or just use shuf (man shuf).

  • No man page for shuf on HP-UX 10.20, OpenBSD 4.0, or Debian unstable. apt-cache show shuf gives nothing. Searching for shuf in the http://freshmeat.net/ search box gives no results. Do you have a pointer to where this thing comes from?

    • On Debian 4.0, shuf is in the science/biosquid package

      shuf is a part of GNU Coreutils

      • Not in GNU coreutils 5.97, which is the newest available in Debian unstable as of 2007-06-20.

        • gnu.org clearly shows shuf in their Coreutils package. If only Debian would update their packages once a century.

Speaking of GNU coreutils, as of version 6.9 GNU sort has the -R (aka --random-sort) flag. Oddly enough, it only works for the generic locale:

     LC_ALL=C sort -R file     # output the lines in file in random order
     LC_ALL=POSIX sort -R file # output the lines in file in random order
     LC_ALL=en_US sort -R file # effectively ignores the -R option

You can seed the random value to sort with the --random-source flag, which expects a file with entropy.

     export LC_ALL=C
     # Keep in mind that seeding a random number generator with another RNG
     # only "lends" the original seed's entropy to the new RNG. sort -R will
     # not be "more random" than /dev/urandom!
     sort --random-source=/dev/urandom -R file

BashFAQ/026 (last edited 2022-01-30 23:49:34 by emanuele6)