4991
Comment: change the posix tag, add some posix solutions, some clean up
|
12935
New example: Awk as a source of seeded pseudorandom numbers
|
Deletions are marked like this. | Additions are marked like this. |
Line 2: | Line 2: |
== How can I randomize (shuffle) the order of lines in a file? (Or select a random line from a file, or select a random file from a directory.) == | == How can I randomize (shuffle) the order of lines in a file? Or select a random line from a file, or select a random file from a directory? == |
Line 5: | Line 5: |
{{{#!highlight bash # Bash/Ksh randomize() { while IFS='' read -r l ; do printf '%d\t%s\n' "$RANDOM" "$l"; done | sort -n | cut -f2- } }}} RANDOM is supported by [[BASH]] and KornShell, but is not defined by POSIX. Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs: {{{#!highlight bash # Bourne awk ' BEGIN { srand() } { print rand() "\t" $0 } ' | sort -n | # Sort numerically on first (random number) column cut -f2- # Remove sorting column }}} This is (possibly) faster than the previous solution, but will not work for very old [[AWK]] implementations (try `nawk`, or `gawk`, or `/usr/xpg4/bin/awk` if available). (Note that AWK uses the epoch time as a seed for `srand()`, which may or may not be random enough for you.) Other non-portable utilities that can shuffle/randomize a file: * GNU `shuf` (in recent enough GNU coreutils) * GNU `sort -R` (coreutils 6.9) For more details, please see their manuals. === Shuffling an array === A generalized version of this question might be, ''How can I shuffle the elements of an array?'' If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us [[http://www.codinghorror.com/blog/archives/001015.html|badly biased results]]. A more complex (and correct) algorithm looks like this: {{{#!highlight bash # Uses a global array variable. Must be compact (not a sparse array). # Bash syntax. shuffle() { local i tmp size max rand size=${#array[*]} for ((i=size-1; i>0; i--)); do # $RANDOM % (i+1) is biased because of the limited range of $RANDOM # Compensate by using a range which is a multiple of the rand modulus. max=$(( 32768 / (i+1) * (i+1) )) while (( (rand=$RANDOM) >= max )); do :; done rand=$(( rand % (i+1) )) tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp done } }}} This function shuffles the elements of an [[BashFAQ/005|array]] in-place using the [[http://en.wikipedia.org/wiki/Knuth_shuffle|Knuth-Fisher-Yates shuffle algorithm]]. If we just want the unbiased random number picking function, we can split that out separately: {{{#!highlight bash # Returns random number from 0 to ($1-1) in global var 'r'. # Bash syntax. rand() { local max=$((32768 / $1 * $1)) while (( (r=$RANDOM) >= max )); do :; done r=$(( r % $1 )) } }}} This `rand` function is better than using `$((RANDOM % n))`. For simplicity, many of the remaining examples on this page may use the modulus approach. In all such cases, switching to the use of the `rand` function will give better results; this improvement is left as an exercise for the reader. === Selecting a random line/file === Another question we frequently see is, ''How can I print a random line from a file?'' There are two main approaches to this: * Count the number of lines '''n''', select a random integer '''r''' from 1 to '''n''', and print line '''r'''. * Read line by line, selecting lines with a varying probability as we go along. ==== With counting lines first ==== The simpler approach is to count lines first. {{{#!highlight bash # Bash n=$(wc -l <"$file") # Count number of lines. r=$((RANDOM % n + 1)) # Random number from 1..n (see warnings above!) sed -n "$r{p;q;}" "$file" # Print the r'th line. # POSIX with (new) AWK awk -v n="$(wc -l <"$file")" \ 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}' "$file" }}} (See [[BashFAQ/011|FAQ 11]] for more info about printing the r'th line.) The next example sucks the entire file into memory. This approach saves time rereading the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.) {{{#!highlight bash # Bash unset lines n while IFS= read -r 'lines[n++]'; do :; done < "$file" # See FAQ 5 r=$((RANDOM % n)) # See warnings above! echo "${lines[r]}" }}} Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0. Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.). A similar technique can be used: {{{#!highlight bash # Bash files=(*.ogg) # Or *.gif, or * n=${#files[@]} # For readability xmms -- "${files[RANDOM % n]}" # Choose a random element }}} ==== Without counting lines first ==== If you happen to have GNU shuf you can use that, but it is not portable. {{{#!highlight bash # example, 5 random lines from file shuf -n 5 file }}} Without shuf, we have to write some code ourselves. If we want n random lines we need to: 1. accept the first n lines 2. accept each further line with probability n/nl where nl is the number of lines read so far 3. if we accepted the line in step 2, replace a random one of the n lines we already have {{{#!highlight bash # WARNING: srand() without an argument seeds using the current time accurate to the second. # If run more than once in a single second on the clock you will get the same output. # Find a better way to seed this. n=$1 shift awk -v n="$n" ' BEGIN { srand() } NR <= n { lines[NR - 1 ] = $0; next } rand() < n / NR { lines[int(rand() * n)] = $0 } END { for (k in lines) print lines[k] } ' "$@" }}} Bash and POSIX sh solutions forthcoming. === Known bugs === . --([[http://lists.gnu.org/archive/html/bug-bash/2010-01/msg00042.html]] points out a surprising pitfall concerning the use of `RANDOM` without a leading `$` in certain mathematical contexts. (Upshot: you should prefer `n=$((...math...)); ((array[n]++))` over `((array[...math...]++))` in almost every case.))-- . Behavior described appears reversed in current versions of mksh, ksh93, Bash, and Zsh. Still something to keep in mind for legacy. -ormaaj === Using external random data sources === Some people feel the shell's builtin `RANDOM` parameter is not sufficiently random for their applications. Typically this will be an interface to the C library's `rand(3)` function, although the Bash manual does not specify the implementation details. Some people feel their application requires cryptographically stronger random data, which would have to be supplied by some external source. Before we explore this, we should point out that often people want to do this as a first step in writing some sort of random password generator. If that is your goal, you should at least consider using a password generator that has already been written, such as [[http://sourceforge.net/projects/pwgen/|pwgen]]. Now, if we're considering the use of external random data sources in a Bash script, we face several issues: * The data source will probably not be portable. Thus, the script will only be usable in special environments. * If we simply grab a byte (or a single group of bytes large enough to span the desired range) from the data source and do a modulus on it, we will run into the bias issue described earlier on this page. There is absolutely no point in using an expensive external data source if we're just going to bias the results with sloppy code! To work around that, we may need to grab bytes (or groups of bytes) repeatedly, until we get one that can be used without bias. * Bash can't handle raw bytes very well, so each time we grab a byte (or group) we need to do something to it to turn it into a number that Bash can read. This may be an expensive operation. So, it may be more efficient to grab ''several'' bytes (or groups), and do the conversion to readable numbers, all at once. * Depending on the data source, these random bytes may be precious, so grabbing a lot of them all at once and discarding ones we don't use might be more expensive (by whatever metric we're using to measure such costs) than grabbing one byte at a time, even counting the conversion step. This is something you'll have to decide for yourself, taking into account the needs of your application, and the nature of your data source. At this point you should be seriously rethinking your decision to do this in Bash. Other languages already have features that take care of all these issues for you, and you may be ''much'' better off writing your application in one of those languages instead. You're still here? OK. Let's suppose we're going to use the `/dev/urandom` device (found on most Linux and BSD systems) as an external random data source in Bash. This is a character device which produces raw bytes of "pretty random" data. First, we'll note that the script will only work on systems where this is present. In fact, you should add an explicit check for this device somewhere early in the script, and abort if it's not found. Now, how can we turn these bytes of data into numbers for Bash? If we attempt to read a byte into a variable, a NUL byte would give us a variable which appears to be empty. However, since no other input gives us that result, this may be acceptable -- an empty variable means we tried to read a NUL. We can work with this. The good news is we won't have to fork an `od(1)` or any other external program to read bytes. Then, since we're reading one byte at a time, this also means we don't have to write any prefetching or buffering code to save forks. One other gotcha, however: reading bytes only works in the C [[locale]]. If we try this in `en_US.utf8` we get an empty variable for every byte from 128 to 255, which is clearly no good. So, let's put this all together and see what we've got: {{{#!highlight bash #!/usr/bin/env bash # Requires Bash 3.1 or higher, and an OS with /dev/urandom (Linux, BSD, etc.) export LANG=C if [[ ! -e /dev/urandom ]]; then echo "No /dev/urandom on this system" >&2 exit 1 fi # Return an unbiased random number from 0 to ($1 - 1) in variable 'r'. rand() { if (($1 > 256)); then echo "Argument larger than 256 currently unsupported" >&2 r=-1 return 1 fi local max=$((256 / $1 * $1)) while IFS= read -r -n1 -d '' r < /dev/urandom printf -v r %d "'$r" ((r >= max)) do : done r=$((r % $1)) } }}} This uses a trick from [[BashFAQ/071|FAQ 71]] for converting bytes to numbers. When the variable populated by `read` is empty (because of a NUL byte), we get 0, which is just what we want. Extending this to handle ranges larger than 0..255 is left as an exercise for the reader. ==== Awk as a source of seeded pseudorandom numbers ==== Sometimes we don't actually ''want'' truly random numbers. In some applications, we want a reproducible stream of pseudorandom numbers. We achieve this by using a pseudorandom number generator (PRNG) and "seeding" it with a known value. The PRNG then produces the same stream of output numbers each time, for that seed. Bash's `RANDOM` works this way (assigning to it seeds the PRNG that bash uses internally), but for this example, we're going to use awk instead. Awk's `rand()` function returns a floating point value, so we don't run into the biasing issue that we have with bash's `RANDOM`. Also, awk is much faster than bash, so really it's just the better choice. For this example, we'll set up the awk as a background process using ProcessSubstitution. We will read from it by maintaining an open FileDescriptor connected to awk's standard output. |
|
Line 6: | Line 217: |
#bash randomize() { while IFS='' read -r l ; do printf "$RANDOM\t%s\n" "$l"; done | sort -n | cut -f2- } }}} Note: the leading 0 is to make sure it doesn't break if the shell doesn't support $RANDOM, which is supported by [[BASH]], KornShell but not posix. Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs: {{{ # Bourne awk ' BEGIN { srand() } { print rand() "\t" $0 } ' | sort -n | # Sort numerically on first (random number) column cut -f2- # Remove sorting column }}} This is (possibly) faster than the previous solution, but will not work for very old [[AWK]] implementations (try "nawk", or "gawk", or /usr/xpg4/bin/awk if available). (Note that awk use the epoch time as a seed for srand(), which might not be random enough for you) A generalized version of this question might be, ''How can I shuffle the elements of an array?'' If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us [[http://www.codinghorror.com/blog/archives/001015.html|badly biased results]]. A more complex (and correct) algorithm looks like this: {{{ # Uses a global array variable. Must be non-sparse. # Bash syntax. shuffle() { local i tmp size max rand # $RANDOM % (i+1) is biased because of the limited range of $RANDOM # Compensate by using a range which is a multiple of the array size. size=${#array[*]} max=$(( 32768 / size * size )) for ((i=size-1; i>0; i--)); do while (( (rand=$RANDOM) >= max )); do :; done rand=$(( rand % (i+1) )) tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp done } }}} This function shuffles the elements of an [[BashFAQ/005|array]] in-place using the [[http://en.wikipedia.org/wiki/Knuth_shuffle|Knuth-Fisher-Yates shuffle algorithm]]. Another question we frequently see is, ''How can I print a random line from a file?'' The problem here is that you need to know in advance how many lines the file contains. Lacking that knowledge, you have to read the entire file through once just to count them -- or, you have to suck the entire file into memory. Let's explore both of these approaches. {{{ # Bash n=$(wc -l < "$file") # Count number of lines. r=$(($RANDOM % n + 1)) # Random number from 1..n. (See below) sed -n "$r{p;q;}" "$file" # Print the r'th line. #posix with awk awk -v n="$(wc -l<"$file")" 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}' }}} (see [[BashFAQ/011|this faq]] for more info about printing the n'th line.) The next example sucks the entire file into memory. This approach saves time reopening the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.) {{{ # Bash oIFS=$IFS IFS=$'\n' lines=($(<"$file")) IFS=$oIFS n=${#lines[@]} r=$((RANDOM % n)) # see below echo "${lines[r]}" }}} Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0. Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.). A similar technique can be used: {{{ # Bash files=(*.ogg) # Or *.gif, or * n=${#files[@]} # For aesthetics xmms -- "${files[RANDOM % n]}" # Choose a random element }}} Note that these last few examples use a simple modulus of the RANDOM variable, so the results are biased. If this is a problem for your application, then use the anti-biasing technique from the Knuth-Fisher-Yates example, above. Other non portable utilities: * GNU Coreutils {{{shuf}}} (in recent enough coreutils) * GNU sort -R Speaking of GNU coreutils, as of version 6.9 GNU sort has the -R (aka --random-sort) flag. Oddly enough, it only works for the generic locale: {{{ LC_ALL=C sort -R file # output the lines in file in random order LC_ALL=POSIX sort -R file # output the lines in file in random order LC_ALL=en_US sort -R file # effectively ignores the -R option }}} For more details, see `info coreutils sort` or an equivalent manual. |
# Bash list=(ox zebra cat buffalo giraffe salamander) n=${#list[@]} exec 3< <( awk -v seed=31337 -v n="$n" \ 'BEGIN {srand(seed); while (1) {print int(n*rand())}}' ) # Print a "random" list element every second, using the background awk stream # as a seeded PRNG. while true; do read -r r <&3 printf %s\\n "${list[r]}" sleep 1 done }}} Each time the program is run, the same results will be printed. Changing the seed value will change the results. If you ''don't'' want the same results every time, you can change `srand(seed);` to `srand();` in the awk program. Awk will then seed its PRNG using the current epoch timestamp. ---- CategoryShell |
How can I randomize (shuffle) the order of lines in a file? Or select a random line from a file, or select a random file from a directory?
To randomize the lines of a file, here is one approach. This one involves generating a random number, which is prefixed to each line; then sorting the resulting lines, and removing the numbers.
RANDOM is supported by BASH and KornShell, but is not defined by POSIX.
Here's the same idea (printing random numbers in front of a line, and sorting the lines on that column) using other programs:
This is (possibly) faster than the previous solution, but will not work for very old AWK implementations (try nawk, or gawk, or /usr/xpg4/bin/awk if available). (Note that AWK uses the epoch time as a seed for srand(), which may or may not be random enough for you.)
Other non-portable utilities that can shuffle/randomize a file:
GNU shuf (in recent enough GNU coreutils)
GNU sort -R (coreutils 6.9)
For more details, please see their manuals.
Shuffling an array
A generalized version of this question might be, How can I shuffle the elements of an array? If we don't want to use the rather clumsy approach of sorting lines, this is actually more complex than it appears. A naive approach would give us badly biased results. A more complex (and correct) algorithm looks like this:
1 # Uses a global array variable. Must be compact (not a sparse array).
2 # Bash syntax.
3 shuffle() {
4 local i tmp size max rand
5
6 size=${#array[*]}
7 for ((i=size-1; i>0; i--)); do
8 # $RANDOM % (i+1) is biased because of the limited range of $RANDOM
9 # Compensate by using a range which is a multiple of the rand modulus.
10
11 max=$(( 32768 / (i+1) * (i+1) ))
12 while (( (rand=$RANDOM) >= max )); do :; done
13 rand=$(( rand % (i+1) ))
14 tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp
15 done
16 }
This function shuffles the elements of an array in-place using the Knuth-Fisher-Yates shuffle algorithm.
If we just want the unbiased random number picking function, we can split that out separately:
This rand function is better than using $((RANDOM % n)). For simplicity, many of the remaining examples on this page may use the modulus approach. In all such cases, switching to the use of the rand function will give better results; this improvement is left as an exercise for the reader.
Selecting a random line/file
Another question we frequently see is, How can I print a random line from a file?
There are two main approaches to this:
Count the number of lines n, select a random integer r from 1 to n, and print line r.
- Read line by line, selecting lines with a varying probability as we go along.
With counting lines first
The simpler approach is to count lines first.
1 # Bash
2 n=$(wc -l <"$file") # Count number of lines.
3 r=$((RANDOM % n + 1)) # Random number from 1..n (see warnings above!)
4 sed -n "$r{p;q;}" "$file" # Print the r'th line.
5
6 # POSIX with (new) AWK
7 awk -v n="$(wc -l <"$file")" \
8 'BEGIN{srand();l=int((rand()*n)+1)} NR==l{print;exit}' "$file"
(See FAQ 11 for more info about printing the r'th line.)
The next example sucks the entire file into memory. This approach saves time rereading the file, but obviously uses more memory. (Arguably: on systems with sufficient memory and an effective disk cache, you've read the file into memory by the earlier methods, unless there's insufficient memory to do so, in which case you shouldn't, QED.)
Note that we don't add 1 to the random number in this example, because the array of lines is indexed counting from 0.
Also, some people want to choose a random file from a directory (for a signature on an e-mail, or to choose a random song to play, or a random image to display, etc.). A similar technique can be used:
Without counting lines first
If you happen to have GNU shuf you can use that, but it is not portable.
Without shuf, we have to write some code ourselves. If we want n random lines we need to:
- accept the first n lines
- accept each further line with probability n/nl where nl is the number of lines read so far
- if we accepted the line in step 2, replace a random one of the n lines we already have
1 # WARNING: srand() without an argument seeds using the current time accurate to the second.
2 # If run more than once in a single second on the clock you will get the same output.
3 # Find a better way to seed this.
4
5 n=$1
6 shift
7
8 awk -v n="$n" '
9 BEGIN { srand() }
10 NR <= n { lines[NR - 1 ] = $0; next }
11 rand() < n / NR { lines[int(rand() * n)] = $0 }
12 END { for (k in lines) print lines[k] }
13 ' "$@"
Bash and POSIX sh solutions forthcoming.
Known bugs
http://lists.gnu.org/archive/html/bug-bash/2010-01/msg00042.html points out a surprising pitfall concerning the use of RANDOM without a leading $ in certain mathematical contexts. (Upshot: you should prefer n=$((...math...)); ((array[n]++)) over ((array[...math...]++)) in almost every case.)
- Behavior described appears reversed in current versions of mksh, ksh93, Bash, and Zsh. Still something to keep in mind for legacy. -ormaaj
Using external random data sources
Some people feel the shell's builtin RANDOM parameter is not sufficiently random for their applications. Typically this will be an interface to the C library's rand(3) function, although the Bash manual does not specify the implementation details. Some people feel their application requires cryptographically stronger random data, which would have to be supplied by some external source.
Before we explore this, we should point out that often people want to do this as a first step in writing some sort of random password generator. If that is your goal, you should at least consider using a password generator that has already been written, such as pwgen.
Now, if we're considering the use of external random data sources in a Bash script, we face several issues:
- The data source will probably not be portable. Thus, the script will only be usable in special environments.
- If we simply grab a byte (or a single group of bytes large enough to span the desired range) from the data source and do a modulus on it, we will run into the bias issue described earlier on this page. There is absolutely no point in using an expensive external data source if we're just going to bias the results with sloppy code! To work around that, we may need to grab bytes (or groups of bytes) repeatedly, until we get one that can be used without bias.
Bash can't handle raw bytes very well, so each time we grab a byte (or group) we need to do something to it to turn it into a number that Bash can read. This may be an expensive operation. So, it may be more efficient to grab several bytes (or groups), and do the conversion to readable numbers, all at once.
- Depending on the data source, these random bytes may be precious, so grabbing a lot of them all at once and discarding ones we don't use might be more expensive (by whatever metric we're using to measure such costs) than grabbing one byte at a time, even counting the conversion step. This is something you'll have to decide for yourself, taking into account the needs of your application, and the nature of your data source.
At this point you should be seriously rethinking your decision to do this in Bash. Other languages already have features that take care of all these issues for you, and you may be much better off writing your application in one of those languages instead.
You're still here? OK. Let's suppose we're going to use the /dev/urandom device (found on most Linux and BSD systems) as an external random data source in Bash. This is a character device which produces raw bytes of "pretty random" data. First, we'll note that the script will only work on systems where this is present. In fact, you should add an explicit check for this device somewhere early in the script, and abort if it's not found.
Now, how can we turn these bytes of data into numbers for Bash? If we attempt to read a byte into a variable, a NUL byte would give us a variable which appears to be empty. However, since no other input gives us that result, this may be acceptable -- an empty variable means we tried to read a NUL. We can work with this. The good news is we won't have to fork an od(1) or any other external program to read bytes. Then, since we're reading one byte at a time, this also means we don't have to write any prefetching or buffering code to save forks.
One other gotcha, however: reading bytes only works in the C locale. If we try this in en_US.utf8 we get an empty variable for every byte from 128 to 255, which is clearly no good.
So, let's put this all together and see what we've got:
1 #!/usr/bin/env bash
2 # Requires Bash 3.1 or higher, and an OS with /dev/urandom (Linux, BSD, etc.)
3
4 export LANG=C
5 if [[ ! -e /dev/urandom ]]; then
6 echo "No /dev/urandom on this system" >&2
7 exit 1
8 fi
9
10 # Return an unbiased random number from 0 to ($1 - 1) in variable 'r'.
11 rand() {
12 if (($1 > 256)); then
13 echo "Argument larger than 256 currently unsupported" >&2
14 r=-1
15 return 1
16 fi
17
18 local max=$((256 / $1 * $1))
19 while IFS= read -r -n1 -d '' r < /dev/urandom
20 printf -v r %d "'$r"
21 ((r >= max))
22 do
23 :
24 done
25 r=$((r % $1))
26 }
This uses a trick from FAQ 71 for converting bytes to numbers. When the variable populated by read is empty (because of a NUL byte), we get 0, which is just what we want.
Extending this to handle ranges larger than 0..255 is left as an exercise for the reader.
Awk as a source of seeded pseudorandom numbers
Sometimes we don't actually want truly random numbers. In some applications, we want a reproducible stream of pseudorandom numbers. We achieve this by using a pseudorandom number generator (PRNG) and "seeding" it with a known value. The PRNG then produces the same stream of output numbers each time, for that seed.
Bash's RANDOM works this way (assigning to it seeds the PRNG that bash uses internally), but for this example, we're going to use awk instead. Awk's rand() function returns a floating point value, so we don't run into the biasing issue that we have with bash's RANDOM. Also, awk is much faster than bash, so really it's just the better choice.
For this example, we'll set up the awk as a background process using ProcessSubstitution. We will read from it by maintaining an open FileDescriptor connected to awk's standard output.
# Bash list=(ox zebra cat buffalo giraffe salamander) n=${#list[@]} exec 3< <( awk -v seed=31337 -v n="$n" \ 'BEGIN {srand(seed); while (1) {print int(n*rand())}}' ) # Print a "random" list element every second, using the background awk stream # as a seeded PRNG. while true; do read -r r <&3 printf %s\\n "${list[r]}" sleep 1 done
Each time the program is run, the same results will be printed. Changing the seed value will change the results.
If you don't want the same results every time, you can change srand(seed); to srand(); in the awk program. Awk will then seed its PRNG using the current epoch timestamp.