Randomness
Randomness is actually a very complicated mathematical concept. We have to get philosophical as to just what it means. Then we have to get philosophical about the implications of whatever definition we decided. In the real world, we’ve used randomness to make decisions, for encryption purposes, and for security. If you’ve got any good reason upon which to base a decision, the coin toss or cutting cards are philosophically lousy ways to make a decision. In his book, “The Luck of the Draw: The Role of Lotteries in Decision-Making,” Peter Stone (ISBN 978-0-19-975610-0) sets up a philosophical argument for a “just lottery rule” that has four criteria.
- The lottery prize under question can’t be subdivided without compromising or destroying its value. Think about a lottery to pick a heart transplant winner.
- The lottery prize is homogeneous. Otherwise, there would be no reason to prefer one unit over another or to desire more than one prize. Very few people want to win two heart transplants.
- The claims against the lottery prize are also homogeneous. All lottery tickets are equal.
- Claims against the prize might vary in strength, and the lottery will handle the claims in order of greatest strength. An older patient might be able to lay a stronger claim than some kid in his 20s. This means we take those young guys out of the lottery pool, but then we treat all those older people as equal.
Benford's law is a heuristic about the frequency distribution of leading digits in many real-life sets of numerical data. For example, it played a part in tracking down financial fraud in the movie “The Accountant”, and it has been observed in mathematical tables. The heuristic states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. For example, in sets that obey the law, “1” appears as the leading significant digit about 30% of the time, while “9” appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time.
Random Numbers
It’s possible to get true random numbers into your database, but you need to have some physical device that goes out into the real world and finds these random numbers. The simplest one is a book of random numbers. The most famous one came from the RAND Corporation, “A Million Random Digits with 100,000 Normal Deviates” and you can still get a copy today or download a PDF. The main use of the tables was in statistics and the experimental design of scientific experiments, especially those that used the Monte Carlo method; in cryptography, they have also been used as “nothing up my sleeve numbers”, for example in the design of the Khafre cipher.
The book was one of the last of a series of random number tables produced from the mid-1920s to the 1950s, after which the development of high-speed computers allowed faster operation through the generation of pseudo-random numbers rather than reading them from tables. The book was reissued in 2001 (ISBN 0-8330-3047-7) with a new foreword by RAND Executive Vice President Michael D. Rich.
The now-defunct Digital Equipment Corporation (DEC) was very popular in the scientific community in its day. The PDP–11 minicomputers had an optional circuit card that featured spec of radioactive polonium that could generate random numbers. The only real problem with it was that it had to carry the “black and yellow radiation propeller” symbol and that made people nervous.
Silicon Graphics created another hardware random number generator, Lavarand, which was more “user-friendly” than the radiation symbol. It worked by taking pictures of the patterns made by the floating material in lava lamps, extracting random data from the pictures, and using the result to seed a pseudo-random number generator. You can find an image of the device and have flashbacks to your hippie days.
Pseudo-random Numbers
"Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.” --John Von Neumann (1951).
There’s an old Dilbert cartoon in which our hero visits the basement for the accounting trolls live. One of the little horned monsters is sitting at a desk, Repeatedly going “9, 9, 9, 9…” The head troll informs Dilbert that he is their random number generator. Dilbert responds that this doesn’t sound very random to him. Head troll’s response is simply, “prove it” in the last panel of the strip.
That Dilbert cartoon gets a laugh, but it also asks an excellent question. How do you know that something is random? The formal definition is that all single digits are equally represented in your population and distributed evenly. First, you check for the single digits, then check for all pairs of digits being equally represented, then check for triples, quadruples and quintuples. This takes a bit of time.
But you can’t trust a simple human inspection of your random number sets. If you ask people to generate a series of random digits, they stink at it. Real random digit permutations do have a lot of repetition in them. Besides avoiding repetition, people have biases; Asians will tend to repeat the digit 3 and Westerners will repeat the digit 7 more than would occur in a truly random stream.
Another consideration is whether you can have selection from the domain with or without replacement. A pair of dice in a crap game allows the same total to be thrown on each toss. A deck of cards does not allow two players to hold the same card in their hand (well, unless you ever had the fun of playing Poker with a Pinochle deck).
Making random numbers is important in computer science. Graphics and cryptography, in particular, rely on randomness or at least randomness that's good enough to fool users, hackers and gamers. The most common random number generator, implemented in virtually all programming languages are only pseudo-randomness (too pseudo for cryptography) at the expense of an intensive CPU workout. This problem also holds for SQL engines that have a Random () function of some kind.
At a conference in 1949, John von Neumann presented an algorithm for random number generation called the method of middle-square. Intended for use with ENIAC, the world's first true stored-program computer, middle-square can be done by hand. Allegedly, it was invented by a Franciscan friar circa 1245 CE. The algorithm is pretty simple:
- Pick a seed number. You can use timestamp or other internal numbers from the computer hardware. Say you grab four digits from someplace. This will be the length (n) of the pseudo-random numbers.
- Square the four-digit number, but you need an eight-digit number (where the length is 2n). If the result of the square is less than eight digits, add zeros to the front as padding.
- Extract the middle four digits and return it as a result. Keep the new number as the seed for the next step. In this four-digit example, take the eight-digit intermediate step and peel off the high order two and the low order two digits.
- Repeat the process. And expect it to fail. It has a short period, which means that the generated permutations will repeat fairly quickly. Perhaps even worse, they converge to zero or settle down toward a constant number, which is where the random sequence dies. You can try using larger and larger seeds, but convergence is an inherent problem.
Here is an example in T-SQL to demonstrate:
DECLARE @List TABLE (Number INT); DECLARE @number INTEGER = 1234; DECLARE @strNumber VARCHAR(8); WHILE NOT EXISTS (SELECT * FROM @List WHERE Number = @Number) BEGIN INSERT INTO @List (Number) VALUES (@Number); SET @strNumber = @Number * @Number; SET @strNumber = SUBSTRING(RIGHT('00000000' + @strNumber, 8), 3, 4); SET @number = @strNumber; END; SELECT * FROM @List;
Random Program Execution
When I teach SQL, my students usually start off with a procedural language of some kind. Their model of a computer program is at the source code goes to a compiler, and the compiler eventually generates some kind of executable machine code or interpretable code for a virtual machine. They have lived in a world in which the same source code will always generate the same executable (assuming you’re using the same compiler on the same machine).
The idea that the same SQL query could generate a different execution plan each time you run it through the SQL engine was really weird. A new index, a change in the statistics, or a mix of other users in other sessions hitting the same database can radically change how queries executed.
An even “weirder” model of computing is due to Edsger Dijkstra. This model is called “guarded commands” and you can read the original papers on this at the University of Texas in the Dijkstra archives. The idea is that each statement in the language has a guard, shown by an arrow. The guard is a predicate which, if TRUE, allows the statement to be executed. If it’s FALSE, the statement cannot be executed. Assignment is shown with the usual = symbol, and it works the way the SET clause works in SQL. That means the list of variables on the left side receives the corresponding expression values on the right side all at once.
The two most important control statements in the language are selection (a version of the if-then-else construct in other languages) and repetition (a looping construct). Here is the basic syntax for each:
if G0 ? S0 | G1 ? S1 ... | Gn ? Sn fi
Upon execution of a selection, all guards are evaluated. If none of the guards evaluates to TRUE then execution of the selection aborts, otherwise one of the guards that has the value TRUE is chosen non-deterministically and the corresponding statement is executed.
The other control flow construct is a repetition, and it looks like this:
do G0 ? S0 | G1 ? S1 ... | Gn ? Sn od
Upon execution of a repetition, all guards are evaluated. If all guards evaluate to FALSE, then SKIP (a no-op command) is executed. Otherwise, one of the guards that has value TRUE is chosen non-deterministically and the corresponding statement is executed after which the repetition is executed again.
This language was created for formal correctness proofs of programs, but it’s also suitable for modeling circuits that can behave nondeterministically. Guarded commands are used within the Promela programming language, but I don’t know if anyone actually uses them for databases.
References: Dijkstra, Edsger W. "EWD472: Guarded commands, non-determinacy and formal derivation of programs" (PDF). Retrieved August 16, 2006.
Shuffling
In many applications, we do not want to issue the sequence numbers in sequence. This pattern can give information that we do not wish to expose. Instead, we are looking for a random permutation of sequence numbers in this set. Do not get mixed up; we want known values that are supplied in random order and not random numbers. Most random number generators can repeat values, which would defeat the purpose of this drill.
Tree-structured indexes, such as a B-Tree that have sequential insertions, become unbalanced and have to be re-organized frequently. However, if the same set of keys is presented in a random order, the tree tends to stay balanced and you get much better performance.
The generator explained here is an implementation of the additive congruential method of generating values in pseudo-random order and is due to Roy Hann of Rational Commerce Limited, a CA-Ingres consulting firm. It is based on a shift-register and an XOR-gate, and it has its origins in cryptography.
While there are other ways to do this, this code is nice because:
- The algorithm can be written in C or another low-level language as bit manipulations for speed. But the math is fairly simple even in base ten.
- The algorithm tends to generate successive values that are (usually) "far apart", which is handy for improving the performance of tree indexes. You will tend to put data on separate physical data pages in storage.
- The algorithm does not cycle until it has generated every possible value, so you don't have to worry about duplicates. Just count how many calls have been made to the generator.
- The algorithm produces uniformly distributed values, which is a nice mathematical property to have. It also does not include zero.
Let's walk through all the iterations of the 4-bit Generator. Initially, the shift register contains the value 0001. The two rightmost bits are XOR-ed together, giving 1, and the result is fed into the leftmost bit position. The previous register contents shift one bit right. The iterations of the register are shown in this table, with their base ten values:
iteration 1: 0001 (1)
iteration 2: 1000 (8)
iteration 3: 0100 (4)
iteration 4: 0010 (2)
iteration 5: 1001 (9)
iteration 6: 1100 (12)
iteration 7: 0110 (6)
iteration 8: 1011 (11)
iteration 9: 0101 (5)
iteration 10: 1010 (10)
iteration 11: 1101 (13)
iteration 12: 1110 (14)
iteration 13: 1111 (15)
iteration 14: 0111 (7)
iteration 15: 0011 (3)
iteration 16: 0001 (1) wrap-around!
It might not be obvious that successive values are far apart when looking at a tiny 4-bit register. The values are generated in no apparent order; all possible values except zero are eventually produced, and the termination condition is clear -- the Generator cycles back to one or whatever the starting value was.
Generalizing the algorithm to arbitrary binary word sizes, and therefore longer number sequences, is not as easy as you might think. Finding the "tap" positions where bits are extracted for feedback varies according to the word-size in an extremely non-obvious way. Choosing incorrect tap positions results in an incomplete and usually very short cycle that is unusable. If you want the details and tap positions for words of one to 100 bits, see E. J. Watson, "Primitive Polynomials (Mod 2)", Mathematics of Computation, v.16, 1962, p.368-369.
UPDATE Generator31 SET seed = seed/2 + MOD(MOD(seed, 2) + MOD(seed/8, 2), 2)*2^30;
Or if you prefer the algorithm in C:
int Generator31 () {static int n = 1; n = n >> 1 | ((n^n >> 3) & 1) << 30; } return n;
Deck of Playing Cards
If I were to ask you to write a program to shuffle a deck of cards, the most obvious way of doing this would probably be to model the deck as an array, loaded with the numbers one through 52 then loop through the array and swap each element with a random element anywhere in the array. The question is, are all possible permutations equally likely.
One thing you’ll notice immediately is that any array element can be swapped more than once. The card that was in the first position of the array will probably be placed somewhere further in the deck, and when that position is reached, it’ll be swapped again. This “over shuffling” is the cause of a problem; some permutations are more likely than others. To demonstrate what happens, consider a simple deck of three cards, {A, B, C}.
The first iteration swaps A with another element at random, i.e. either A, B or C, so the three equally likely permutations after the first iteration are:
swap with A {A, B, C} no change!
swap with B {B, A, C}
swap with C {C, B, A}
The second iteration starts with one of the three permutations generated at the first iteration and the possible swaps are:
Second position with first position gives one of:
{A, B, C} ? {B, A, C}
{B, A, C} ? {A, B, C}
{C, B, A} ? {B, C, A}
Second position with second position gives one of:
{A, B, C} ? {A, B, C} no change
{B, A, C} ? {B, A, C} no change
{C, B, A} ? {C, B, A} no change
And finally second position with third position gives one of:
{A, B, C} ? {A, C, B}
{B, A, C} ? {B, C, A}
{C, B, A} ? {C, A, B}
So now we have nine equally likely permutations after the second shuffle. That is:
{A, B, C}
{A, B, C}
{A, C, B}
{B, A, C}
{B, A, C}
{B, C, A}
{B, C, A}
{C, A, B}
{C, B, A}
The final step is to randomly move the final element to a new random position, and this has three possibilities.
The third card is moved to the first position:
{A, B, C} ? {C, B, A}
{A, B, C} ? {C, B, A}
{A, C, B} ? {B, C, A}
{B, A, C} ? {C, A, B}
{B, A, C} ? {C, A, B}
{B, C, A} ? {A, C, B}
{B, C, A} ? {A, C, B}
{C, A, B} ? {B, A, C}
{C, B, A} ? {A, B, C}
The third element is moved to the second:
{B, A, C} ? {B, C, A}
{A, B, C} ? {A, C, B}
{B, C, A} ? {B, A, C}
{A, B, C} ? {A, C, B}
{B, A, C} ? {B, C, A}
{C, B, A} ? {C, A, B}
{A, C, B} ? {A, B, C}
{B, C, A} ? {B, A, C}
{C, A, B} ? {C, B, A}
And finally, the third element is moved to the third position, which effectively has no change, and you could drop this step from loop to improve performance. This gives a list of the 27 possible permutations, which should be all equally likely. But let’s count them:
{A, B, C} occurs 4 times
{A, C, B} occurs 5 times
{B, A, C} occurs 5 times
{B, C, A} occurs 5 times
{C, A, B} occurs 4 times
{C, B, A} occurs 4 times
If the shuffle were random, then each three-letter order would occur the same number of times, but you can see that {A, C, B}, {B, A, C} and {B, C, A} are produced more often. If you add more cards to the deck, the discrepancy in frequencies increases. The shuffle produces 27 permutations, but a good shuffle should always produce a multiple of (n!) permutations and 27 is not a multiple of 6.
Knuth-Fisher-Yates Shuffle
The name, Donald Knuth, should be known to anyone who works with computers. In the old days his multi-volume magnum opus, Art of Programming was the basic text for all graduate computer science courses. It avoids the problem we just saw by making sure that each card is only considered for a random swap once. In other words, as the array is scanned the card under consideration is only swapped with cards that haven't been considered so far. If you write out the steps of this algorithm, you discover that on first iteration you get the same result as the earlier shuffle:
swap with A {A, B, C} no change
swap with B {B, A, C}
swap with C {C, B, A}
But on the second iterations the second element can only swap with itself or the third element:
{A, B, C} ? {A, B, C}
{B, A, C} ? {B, A, C}
{C, B, A} ? {C, B, A}
or
{A, B, C} ? {A, C, B}
{B, A, C} ? {B, C, A}
{C, B, A} ? {C, A, B}
And at the third iteration, the final card can only swap with itself and so there is no change, so you ought to skip. This gives just one of each of the possible permutations of three things so it is a fair shuffle.
Your first impulse is to think the more you shuffle, the more random the results. But we just showed it’s not true. If you really need to shuffle a deck of 52 cards where every permutation of the deck, i.e. (52!) permutations, should occur equally often. The solution is to use a cryptographic-quality random number generator with a large period.
Conclusion
If you feel a little confused about randomness from all of this, don’t feel bad. Many decades ago, when mainframes roamed the earth, IBM had a random number generator on their IBM 360 model computers in the FORTRAN library. It was the standard tool used in universities at the time. And it was wrong. The result was an awful lot of graduate thesis work had to be redone. If a major computer company has problems, and maybe you shouldn’t feel so bad.