Wednesday, October 1, 2008

Fat Tailed Distributions,
The Law of Large Numbers and
Black Swans


If you hated math class, this post might not be your cup of tea, but give it a chance and ask questions.

The previous post introduced mathematical concepts unfamiliar to most people, even to most economists and certainly to most politicians. I'll try to explain them here. In the preceding post, I try to explain how they relate to Peak Rent.

I'll illustrate "fat tailed distribution" with a concrete example. Below are two, relatively simple, discrete frequency distributions, one thin tailed and one fat tailed.



Thin tailed distribution
Value:12345....N
Frequency:1/21/41/81/161/32....1/2^N
(one over two to the power of N)



Fat tailed distribution
Value:12345....N
Frequency:1/21/61/121/201/30....1/N*(N + 1)
(one over N times (N + 1))

You can verify by googling (or mathematically) that

1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 + ... = 1

and

1/2 + 1/6 + 1/12 + 1/20 + 1/30 + 1/42 + ... = 1,

so both frequency distributions are properly normalized. In other words, I can approximate two populations with these distributions by filling two boxes with tokens, labeled 1, 2, 3 and so on, as follows.

Each box contains roughly a million tokens. In the first box, half a million tokens are labeled 1, a quarter million are labeled 2, an eighth of a million are labeled 3 and so on. In the second box, half a million tokens are labeled 1, a sixth of a million are labeled 2, a twelfth of a million are labeled 3 and so on. I add tokens to each box this way until both boxes contain 999,000 tokens.

I truncate the distributions at 999,000 tokens/box, because I otherwise require fractional tokens. For example, if the total number of tokens is a million, the number of tokens labeled 20 in the first distribution is less than one, because 2 to the 20th power is greater than a million.

If you randomly draw a hundred tokens from the first box and average the numbers labeling them, the average will be very nearly two. If you draw a thousand tokens and average the numbers, the average will almost surely be closer to two. This convergence on two is what Russ meant by "Law of Large Numbers" in the previous post, but mathematicians conventionally use "Law of Large Numbers" differently.

If you randomly draw a hundred tokens from the second box and average the numbers on them, the average will be around 6.5 but maybe not so close to 6.5. If you draw a thousand tokens and average the numbers, the larger sample average is no more predictable than the smaller sample average. In other words, large sample averages have no expected value.

From both boxes, the proportion of tokens labeled 1 approaches 1/2 in samples of increasing size. This convergence is the Law of Large Numbers.

Because we've truncated the distribution, very large sample averages from the second box (approaching a million in size) do approach 6.5, because the entire truncated distribution has this mean value; however, the untruncated distribution has no mean value.

If we fill a box with roughly a billion tokens and stop when we can no longer add whole tokens, the mean value of the entire box is larger than 6.5. If we use a trillion tokens, the mean value is larger still. If we keep filling larger and larger boxes this way, the mean value increases without limit.

Why do large sample averages from the second box have no expected value (if the samples are much smaller than a million)? Consider how I've filled the boxes. Both boxes have a million tokens, but as I add tokens of increasing value, I fill the first box faster.

For example, I add a quarter million tokens labeled 2 to the first box but only a sixth of a million of these tokens to the second box. As a consequence, the highest label in the second box is much larger than the highest label in the first box.

In fact, the highest label in the first box is only 10, while the highest label in the second box is over 1000, a hundred times larger, and roughly 9000 tokens in the second box are higher than 100.

When you sample from the second box, you occasionally pick one of these rare tokens with a very large label, and these very large values contribute far more to the sample average than other values in the sample. Sample averages from the second box are less predictable for this reason.

The rare but significant tokens are Taleb's Black Swans. See the previous post.

In economic terms, if the tokens represent the wealth of persons, the richest person recorded in the second box is much richer than the richest person in the first box. In reality, wealth is distributed more like tokens in the second box.

No comments: