Statistics

Mean \(E(X)\) may not capture a problem's concern  #

Flip a fair coin 100 times—it gives a sequence of heads (H) and tails (T). For each HH in the sequence of flips, Alice gets a point; for each HT, Bob does, so e.g. for the sequence THHHT Alice gets 2 points and Bob gets 1 point. Who is most likely to win? -- Daniel Litt

Answer: Bob wins

An intuitive explanation is that HH's distribution is more skewed due to the fact that HH's possible scores are much higher due to consecutive heads. Thus, HH's distribution has more probability mass in the region x < mean. Since HH and HT have the same mean, it follows that HH < HT happens more frequently.

Let n be the sequence length, the mean of Alice's and Bob's distributions are the same (n-1)/4. The below diagram shows the distributions of HH and HT and HH - HT:

As we increase n, the number of possible sequences increases exponentially. Thus we experiment with the sampling method, too. As shown in the below diagrams for n = 32, the sampling method matches pretty well with the exact one:

Having confirmed that sampling works, we plot the case for n = 1024

It seems that the Central Limit Theorem is in effect here. Moreover the edge HT over HH has diminishes as the sequence length increases:

16 32 1024
hh < ht0.4640.4750.495
hh == ht0.1430.1000.018
hh > ht0.3930.4250.487

Mean Deviation is more useful than Standard Deviation  #

The classical argument regarding the Standard Deviation \(\sqrt{\frac{1}{n-1}\sum(x-\overline{x})^{2}}\) having an advantage over the Mean Absolute Deviation \(\frac{1}{n}\sum\left| x - \overline{x} \right|\) is a fragile one. It works only for the perfect Normal distribution and collapses at the slightest occurance of fat tails. In mathematical terms, the argument states that the statistical efficiency \(\frac{Var(X)}{E(X)^{2}}\) of the Standard Deviation is better than that of the Mean Deviation:

$$\displaylines{ \begin{aligned} \text{Asymptotic Relative Efficiency (ARE)} =\space & \underset{n\to \infty }{lim}\left( \frac{\frac{Var(Std)}{E(Std)^{2}}}{\frac{Var(Mad)}{E(Mad)^{2}}} \right) \\ =\space & \underset{n\to \infty }{lim}\frac{\frac{n{\Gamma(\frac{n}{2})}^{2}}{2{\Gamma(\frac{n+1}{2})}^{2}}-1}{\frac{\pi-2}{2n}} \\ =\space & \frac{1}{\pi-2} \simeq 0.875 \end{aligned} }$$

However, this result quickly becomes invalidated when considering the simplest model for volatility, which is a mixing model with an occasional jump with probability \(p\):

$$\displaylines{ Var(x)=\begin{cases} \sigma^{2}(1+a) & \text{with probability $p$} \\ \sigma^{2} & \text{with probability $p-1$} \end{cases} }$$

The below diagram shows simulation results with \(p=0.01\) and \(n=10000\):

We see that even though for \(a=0\) \(RE=0.878\), for \(a=2\) \(RE=1.434\) which starts to cause degradation. In other words, a minute presence of outliers makes MAD more efficient than STD.

Morever, many statistical phenomena and processes have "infinite variance" (such as the popular Pareto 80/20 rule) but have finite, and sometimes very well behaved, mean deviations. Whenever the mean exists, MAD exists. The reverse (infinite MAD and finite STD) is never true.

In summary, we should retire the notion of Standard Deviation and replace it with the more effective one of Mean Deviation. Standard deviation, STD, should be left to mathematicians, physicists and mathematical statisticians deriving limit theorems. There is no scientific reason to use it in statistical investigations in the age of the computer.

The Central Limit Theorem does not apply to many real world problems with finite sample sizes  #

The Generalized Central Limit Theorem states that the sum of random variables \(S_{n} = X_{1}+X_{2}+\cdots +X_{n}\) converges to \(n\mu + n^{\frac{1}{\alpha}}\text{Stable}(\alpha)\) where \(\text{Stable}(\alpha)\) is the Stable distribution, parameterized by the tail index \(\alpha\). Thus when \(\alpha > 1\), \(\underset{n\to \infty}{lim}\left( \frac{S_{n}}{n} - \mu \right) = n^{\frac{1}{\alpha}-1} = 0\) and we get the Law of Large Numbers. The below graph (Fig 8.1[1a]) illustrates this phenomenon assuming \(X_{n}\) has been subtracted by the linear trend \(\mu\) leaving the random walk \(S_{n}\):

A consequence of this is that for all powers \(p=1,2,3,\cdots\), if \(E(X^{p}) < \infty\) then (10.2.6[1b]) $$\displaylines{ \underset{n\to \infty}{lim}\frac{\text{max}(X_{1}^{p},X_{2}^{p},\cdots ,X_{n}^{p})}{X_{1}^{p}+X_{2}^{p}+\cdots +X_{n}^{p}} = 0 }$$ For example, in the case of \(p=4\), the existence of the 4th moment should prevent a single data point to dominate the kurtosis's calculation as \(n\) increases. This behaviour is observed for the Gaussian distribution, but not for both the Taiwan1 and U.S. stock markets, as shown in the below Max-to-Sum plot (MS plot):
dataset size \(|X|^{4}\) Max-to-Sum
Taiwan stock index (TAIEX)864 (58 years)0.25
S&P 500964 (69 years)0.46
Gaussian with same σ as TAIEX8640.04
Data below from Table 7.2 Taleb[1c]
Silver46 years0.94
Crude Oil26 years0.79
Yen/USD38 years0.27
Jakarta Stock Index16 years0.19
Corn49 years0.18

The reason for the volality in the kurtosis of both the Taiwan and U.S. stock markets is due to their fat tails. To quantify how fat the tails are, we fit2 them to the pareto distribution \(\frac{1}{x^{\alpha + 1}}\) where alpha is the tail index3. The smaller \(\alpha\), the fatter the tail; for \(\alpha <= 4\) kertosis does not exist; for \(\alpha <= 3\) skewness does not exist; for \(\alpha <= 2\) variance becomes infinite; for \(\alpha <= 1\) mean becomes infinite.

We see that the fitted tail indices are smaller than 3, which explains why the 3rd and 4th moments in the MS plots never converge. With a tail index \(\alpha\) less than 4, the sample variance \(\frac{1}{n}\sum X_{i}^{2}\) converges to \(\frac{1}{n^{1-\frac{2}{\alpha}}}\left(\text{log}(n)\right)^{\frac{2}{\alpha}} \text{Stable}(\frac{\alpha}{2})\) with \(\frac{\alpha}{2} < 2\), instead of the chi-square distribution. Since \(n^{1-\frac{2}{\alpha}}\) is slower than \(\sqrt{n}\), which is the convergence speed provided by the Central Limit Theorem, the chi-square test for variance and the Goodness of fit test for linear regression give dangerously misleading confidence intervals for the Taiwan and U.S. stock markets4.

A quantitative metric that measures the convergence speed of the Central Limit Theorem is the \(\kappa\) metric introduced by Nassim Taleb (Equation 8.1 [1d]). Let \(\mathbb{M}(n) = \mathbb{E}(\left| S_{n} - \mathbb{E}(S_{n}) \right|)\) be the mean absolute deviation for \(n\) summands, define $$\displaylines{ \kappa(n_{0}, n) = 2 - \frac{\text{log}(\frac{n}{n_{0}})}{\text{log}(\frac{\text{min}\left\{ \mathbb{M}(i),\space i\ge n \right\} }{\mathbb{M}(n_{0})})} }$$ By definition, \(\kappa(n_{0}, n) \ge \kappa(n_{0}, n+1)\), and for stable distributions \(\kappa(n_{0}, n) = 2 - \alpha\). Since the Gaussian distribution corresponds to \(Stable(2)\), \(\kappa_{gaussian} = 0\).

Let \(P\) be some distribution, \(n_{g}\) be some sample size, since \(\kappa\) is related to the mean absolute deviation \(\mathbb{M}\), we can use it to calculate the minimum sample size \(n_{\nu}\) such that the deviation of said distribution \(P\) \(\frac{\mathbb{M}_{P}(n_{nu})}{n_{nu}}\) is less than that of a Gaussian \(\frac{\mathbb{M}_{gaussian}(n_{g})}{n_{g}}\). Let \(n_{0}=1\), and scale \(P\) such that \(\mathbb{M}_{P}(1)=\mathbb{M}_{gaussian}(1)\), then $$\displaylines{ \begin{aligned} \frac{\mathbb{M}_{P}(n_{\nu})}{n_{\nu}} < \frac{\mathbb{M}_{gaussian}(n_{g})}{n_{g}} & \Rightarrow \frac{n_{\nu}^{\frac{1}{2-\kappa(n_{\nu})}}}{n_{\nu}} < \frac{n_{g}^{\frac{1}{2}}}{n_{g}} \\ & \Rightarrow n_{\nu} < n_{g}^{\frac{2-\kappa(n_{\nu})}{2(1-\kappa(n_{\nu}))}} \\ \end{aligned} }$$ Since \(n_{\nu} > n_{g}\), thus \(\kappa(n_{\nu}) < \kappa(n_{g})\). Note also for distributions where the mean exists, that is with a tail index larger than 1, \(0 \le \kappa(n_{g}) \le 1\), and thus \(\kappa(n_{g}) < \frac{2\kappa(n_{g})}{\kappa(n_{g})+1}\), and \(\kappa(n_{\nu}) < \frac{2\kappa(n_{g})}{\kappa(n_{g})+1}\). In this case, $$\displaylines{ \begin{aligned} \kappa(n_{\nu}) < \frac{2\kappa(n_{g})}{\kappa(n_{g})+1} & \Rightarrow \frac{2-\kappa(n_{\nu})}{2(1-\kappa(n_{\nu}))} < \frac{1}{1-\kappa(n_{g})} \\ & \Rightarrow n_{\nu} < n_{g}^{\frac{1}{1-\kappa(n_{g})}} \end{aligned} }$$ In other words, given \(n_{g}\), we can use \(n_{\nu} = n_{g}^{\frac{1}{1-\kappa(n_{g})}}\) to gaurantee that \(\frac{\mathbb{M}_{P}(n_{\nu})}{n_{\nu}} < \frac{\mathbb{M}_{gaussian}(n_{g})}{n_{g}}\). For the pareto distribution, we have the following values for \(\kappa(n)\):

Table 1: \(\kappa_{\alpha}(n)\) values for the Pareto distribution
\(\alpha\) 2 30 100 1,000 10,000 100,000 1,000,000
1.160.8790.8630.8580.8530.8500.8480.8470.840
1.250.8290.7990.7900.7790.7720.7680.7650.75
1.500.7240.6600.6390.6080.5870.5720.5610.50
1.750.6500.5600.5290.4820.4470.4200.4000.25
2.000.5940.4870.4490.3910.3480.3140.2860
2.250.5510.4310.3890.3260.2790.2420.2140
2.500.5170.3870.3430.2780.2310.1950.1690
2.750.4880.3530.3080.2420.1970.1640.1400
3.000.4650.3250.2800.2150.1720.1420.1200
3.250.4450.3020.2570.1940.1530.1250.1060
3.500.4280.2830.2380.1780.1390.1130.0950
3.750.4130.2670.2230.1640.1280.1040.0870
4.000.4000.2530.2100.1530.1180.0960.0810

For the Taiwan stock index, we have \(n_{\nu}=15555\) daily samples, sample mean \(\overline{x}=0.00015\), sample standard deviation \(s=0.00624\). Setting the null hypothesis to \(\mu=0\), the naive t-statistic is \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n_{\nu}}}}=3.01\), which is above the 5% critical value of 1.96. However, since the Taiwan stock index has a tail index of \(\alpha=2.51\), the true gaussian degrees of freedom with a deviation that is equal to \(\text{Pareto}(2.51)\) is actually \(n_{g}=n_{\nu}^{\frac{2-2\kappa_{\alpha}(n_{\nu})}{2-\kappa_{\alpha}(n_{\nu})}}\). From Table 1, we see that \(\kappa_{2.51}(15555)\simeq 0.231\). This means the true equivalent gaussian sample size is \(n_{g}=15555^{\frac{2-2*0.231}{2-0.231}}=4411\), and the true t-statistic is 1.60 (p-value 11%), which is not 5% significant. Therefore, when using the Central Limit Theorem correctly, even with 58 years of data, it remains inconclusive whether the Taiwan stock market has a non-zero return (such a conclusion actually takes at least 349 years of data5).

For the common 80/20 law which is \(\text{Pareto}(\text{log}_{4}(5))=\text{Pareto}(1.16)\), to get mean deviations equivalent to a 30 sample gaussian, the Central Limit Theorem would require a sample size of \(30^{\frac{1}{1-\kappa_{\text{Pareto}(1.16)}(30)}}=30^{\frac{1}{1-0.863}}=6\cdot 10^{10}\).

Physics might dictate that the mean \(E(X)\) be non-existent  #

Consider the intensity of a point light source on a vertical wall, as shown below:

The probability of \(\theta\) is the uniform distribution, and the distance between the light source and the wall is \(\ell\). We are interested in the distribution of \(x\). Since \(tan(\theta)=\frac{x}{\ell}\), the distribution is:

$$\displaylines{ \begin{aligned} P(X) =\space & 1 \cdot d\theta \\ =\space & \frac{d\theta}{dx}dx \\ =\space & \frac{1}{\pi} \frac{\ell}{\ell^{2}+x^{2}}dx \end{aligned} }$$

which is the Cauchy distribution \(Cauchy(\ell)\). The Cauchy distribution, like all stable distributions, has the property that the sum of two independent Cauchies is also a Cauchy, \(Cauchy(\ell_{1}) + Cauchy(\ell_{2}) = Cauchy(\ell_{1}+\ell_{2})\). This means we can view the distribution as a random walk by photons in small steps, known as the Huygens principle:

We split \(\ell\) into \(n\) steps, and let \(X_{t}\) be the random variable describing a photon's random walk, \(P(X_{t}) = Cauchy(\frac{\ell}{n})\). The law of large numbers says that the final location of the photon, \(X_{1} + X_{2} + ... + X_{n}\) approaches \(n\cdot E(X)\) as \(n\) increases. In other words, we would observe a single spot of bright light at position \(n\cdot E(Cauchy(\frac{\ell}{n}))\), when the rest of the wall is completely dark. However, a completely dark wall is completely nonsensical in physics. The only way out of this paradox is to conclude that the mean of the Cauchy distribution does not exist.

Power laws are more common than the gaussian in the sample size limit \(n\to \infty\)  #

Many fields are interested in quantities outside of the sum of random variables \(X_{1}+X{2}+\cdots +X{n}\). For example, hydrology is focused on the maxima of random variables \(\text{max}\left( X_{1}, X{2}, \cdots , X{n} \right)\) where \(X_{i}\) represents the flood level. The insurance industry is concerned with the mean excess function \(E\left( X-u | X>u \right)\) which can be interpreted as the expected claim size. The mean excess function is also called mean residual life in a reliability or medical context. For these quantities, We have the following limit laws:

The fat tails of the Pareto distribution have the following properties:

Independence-Correlation relationship breaks down for fat tails.  #

Correlation implies independence between random variables \(X_{1}\) and \(X_{2}\) when they are stably normal, that is \(Y = aX_{1} + bX_{2}\) normal for all \(a\) and \(b\).

However, this is not the case for fat tailed distributions. In fact, the multivariate Cauchy distribution with zero correlation, while stable9, is not independent: $$\displaylines{ \begin{aligned} \text{MultivariateCauchy} =\space & \frac{1}{2\pi}\frac{1}{\left(x_{1}^{2}+x_{2}^{2}+1\right)^{\frac{3}{2}}}dx_{1}dx_{2} \\ \text{IndependentCauchy} =\space & \left(\frac{1}{\pi}\frac{1}{x_{1}^{2}+1}dx_{1}\right)\left(\frac{1}{\pi}\frac{1}{x_{2}^{2}+1}dx_{2}\right) \end{aligned} }$$ This different is made most stark in the following contour plots:

It turns out that we need to stay close to Gaussian for the \(\text{Uncorrelated}\rightarrow \text{Independent}\) inference to be valid:

We borrow Taleb's words to summarize this section:

Causal questions can never be answered from data and statistics alone  #

Simpson's paradox states that statistical results that appear within groups can disappear or reverse itself when the groups are combined. Moreover, whether to trust the partitioned or combined data depends on the causal structure giving rise to the data.

The below graph[5] describes different scenarios regarding the question, "does treatment lead to recovery?", which is of a causal nature.

A systematic way to decide whether to trust the partitioned or combined results is to apply the backdoor criterion[6]. First, some definitions:

With these definitions, the backdoor criterion states:

A set of conditioned nodes \(S\) is sufficient for measuing the total effect of X to Y if:

Case (c) in the above graph illustrates a collider scenario. \(L_{1}\) is a health-conscious personality (more physically healthy, more visits to the doctor). \(L_{2}\) is a pre-cancer lesion that affects recovery. \(Z\) is a diagnostic test for pre-cancer. In this case, the backdoor path \(X\gets \circ\to Z\gets \circ \to Y\) is blocked when \(Z\) is not conditioned on, since \(Z\) is a collider. In other words, the combined results should be used. Traditional statistic recommends that we "control for everything". However, case (c) is a good example where controlling \(Z\) actually introduces spurious effects from the latent variable \(L_{2}\).

For case (d), \(Z\) should be conditioned on.

For case (e) above, \(Z\) itself is not sufficient, and we have use the sets \(\left\{ L_{1}, Z \right\}\) or \(\left\{L_{2}, Z \right\}\).

Note that the quantity provided by the backdoor criterion is the total effect. Whether or not the total effect is the right question to ask depends on the research at hand. Let's look at a real world example regarding the U.S. health-care system's efficacy against COVID-19:

Table 2: COVID-19 fatality rates (in percentages) in the U.S.10
Age\Race White, non-Hispanic Others
0-190.040.03
20-290.040.04
30-390.050.11
40-490.140.27
50-590.250.72
60-690.771.89
70-793.465.38
80+14.5014.19
All age groups0.550.50

The question we'd like to ask from this data is: Is the U.S. medical system fair to people of all races? Looking at the partitioned data, non-whites have a much higher fatality rate than whites across all age groups except for the very young and old. For the 50-59 age group, this fatality rate gap may be as high as 3x. However, when we look at the combined data, the conclusion is reversed, with whites overall having a slightly higher fatality rate.

The reason for this discrepency is simple: Whites are older. According to the U.S. census, 9.4% of whites are over 75 years old, whereas for non-whites this figure is only 4.8%. At the same time, old people are precisely the ones who are at greatest risk of dying from COVID by a wide margin as can be seen from Table 2. In other words, if you tell me that you are white and COVID-positive, but do not tell me your age, I have to assume you have a higher risk of dying than your neighbor who is black and COVID-positive. But once you tell me your age, your risk of dying becomes much less than your neighbor who is black, COVID-positive and of the same age. Surely the act of telling me your age should not make any difference to your medical condition.

To resolve this paradox, we have to determine which is a more accurate picture of reality, one where we look at the partitioned data or the other where we look at the combined?

If we apply the backdoor criterion along with the above causal model, we would choose the aggregate data, and conclude that the American clinical system does not discriminate against non-whites.

However, a deeper inspect of this argument suggests that it may not be representing our true intentions. The question the backdoor criterion is targeting is, "What is the total lifetime effect of intervening (before birth) to change a person's race?". And yes: if we could perform that intervention, and if our sole objective was to prevent death from COVID, we would choose to change our race from white to non-white. The benefit we would get from such an intervention is that we are less likely to live beyond 75 years old where we are at a high risk of dying from COVID. Obviously, this is nonsense as saving lifes from COVID by making people die earlier is not a justifiable health policy.

Thus, the query we want to interrogate the data with is not “What is the total effect?” but “What is the direct effect?” In fact, this form of query is almost always what we are truly interested in when talking about discrimination. With respect to discrimination in the U.S. health-care system, our true intention is to hold all other variables constant, and see what is the effect of changing race alone. In this case, that means stratifying the data by Age, and reporting the Controlled Direct Effects (CDE) for each age group, CDE(\(40\lt age\lt49\)), CDE(\(50\lt age\lt59\)), etc. Alternatively, we could use the Mediation Formula for the Natural Direct Effect $$\displaylines{ NDE_{x,x'}(Y)=\sum_{z}\left[ E(Y|x',z)-E(Y|x,z)\right] P(z|x) }$$ where \(x\) denotes the race of non-whites and \(x'\) for whites, Y denotes fatality, and z denotes age groups. In this case, the NDE of setting race from non-white to white is -0.17%, meaning the fatality would drop from 0.50% to 0.33%, which is supportive of discrimination.

In conclusion, data and statistical calculations themselves are the least important aspects of scientific research. The genuinely critical parts are the underlying or assumed causal structure of the data, as well as the causal questions that capture our true intentions.


  1. ^ We use 18 day returns for the Taiwan stock market, as opposed to the 1 day in Chapter 10 of Taleb[1g]. The reason is because Taiwan's 1 day return distribution is not a Pareto, due to the market regulation of capping daily stock price fluctuations at 7%.
  2. ^ Before fitting our data, it is important to check that our sample is indeed independent and identically distributed (i.i.d.).

    Define a new record as a data point whose value is higher than all values before it. For i.i.d. data, the number of new records at time \(t\) grows roughly as \(log(t)\) (Section 10.2.7, Taleb[1h]). Below, we see that this is indeed the case for both the Taiwan and U.S. stock markets:

    Furthermore, since the number of new records is sensitive to the ordering of data, it serves as a good test for dependence between successive events in a time series. For i.i.d. data, we should expect the number of new records be invariant to random permutations of the underlying data. The below graphs show that our empirical number of records of 6 do not lie at the edges of the the distribution among 1000 permutations, suggesting an acceptance of the i.i.d. hypothesis:

    Since we are focused on the tails, our actual sample is only a subset of our data consisting of only values exceeding some threshold. We want to make sure this subset is also i.i.d. According to Theorem 5.3.2, Embrechts[2b], the key to achieving i.i.d. is in choosing a sufficiently high threshold, such that points above this threshold form a Poisson process. In this case, the inter-arrival times of these high values follow the exponential distribution (Example 6.5.9, Embrechts[2c]). The left plot below shows that this is indeed the case for the Taiwan stock market. The right plot below further confirms that these inter-arrival times are mutually independent.

    For the U.S. stock market, the below plots show that the independent Poisson process assumption also holds:

    Last but not least, we stress that the i.i.d. evidence here only apply to the extreme values of our data. It is well known that the stock market carry strong correlations day-to-day, although this effect might be weakened to some degree by our use of 18 day returns.

  3. ^ The Zipf plot shows a necessary but not sufficient condition for fat tails. Therefore, we complement the analysis with a mean excess function plot (ME plot)[3]. The mean excess function is defined as \(e(u) = E(X-u|X>u)\), which captures the idea of \(X\)'s expected excess beyond some threshold \(u\), given that \(X > u\). It is most useful for identifying the subexponential family of distributions, since they are the only class of distributions where \(\underset{u\to \infty}{lim}e(u) = \infty\) (Equation 6.5[2d]). Below we show \(e(u)\) for some classical distributions:
    Below we show the ME plots for both the Taiwan and U.S. stock markets:
    Since there is undoubtedly an upward trend, it is clear that both data belong to the subexponential class and have fat tails.
  4. ^ Here we numerically show how far off the Central Limit Theorem estimates are for the variance of Taiwan's stock market TAIEX. The sample size of TAIEX is 864, and is sampled from the \(\text{Pareto}(2.51)\) distribution. If we blindly conduct the standard chi-square test for variance at upper significance level of 5%, we get an upper confidence limit of \(\text{inverse-CDF}\left( \frac{\chi^{2}(864-1)}{864-1}, 0.05 \right) =\) scipy.stats.chi2.ppf(1-0.05, 863)/863 = 1.0805. Henceforth, we might erroneously believe that 95% percent of the time, our sample estimate would deviate from the true variance by only 8%, implying that our sample size of 864 is sufficient. This can also be seen in the below Monte Carlo results for Gaussians:

    However, in reality under the true distribution \(\text{Pareto}(2.51)\), the distribution is drastically different as shown below:

    We see that the mode is around 0.5, and the 95% percentile is at 1.9. This means we are likely to underestimate the true variance by half, and our confidence interval is at best 190%. Moreover, we observe that the these issues are due to the power law fat tail that is not attenuated even at a sample size of 915. The below graph shows that even with a million samples, our confidence interval of 1.2 remains much larger than that of the standard chi-square test:

    To conclude, tests involving the square of a random variable \(X^{2}\) in the numerator are dangerous to perform on distributions that lack the 4th moment, which include the majority of real world data. This means standard statistical concepts such as covariance \(\sum_{i}(x_{i}-\overline{x_{i}})(y_{i}-\overline{y_{i}})\), as well as the \(R^{2}\) coefficient for linear regression \(R^{2}=1-\frac{\sum_{i}\left( y_{i}-(ax_{i}+b) \right)^{2}}{\sum_{i}\left(y_{i}-\overline{y_{i}}\right)^{2}}\) are much less useful than advertised.

  5. ^ Given \(\overline{x}=0.00015\), \(s=0.00624\), null hypothesis of \(\mu=0\) and \(P_{gaussian}(-1.96 < x < 1.96) = 0.95\), we want to first calculate the number of gaussian samples \(n_{g}\) corresponding to a 95% confidence t-statistic \(\frac{\overline{x}-\mu}{\frac{s}{\sqrt{n_{g}}}}=1.96\). This gives \(n_{g}=(1.96\cdot \frac{0.00624}{0.00015-0})^{2} = 6648\). Next, using \(\kappa_{2.51}(6648)\simeq 0.231\), we get the number of samples required by the true \(Pareto(2.51)\) distribution \(n_{\nu}=6648^{\frac{1}{1-0.231}}=93538\). Since the average number of trading days per year in Taiwan is 268, this translates into \(\frac{93538}{268}=349\) years.
  6. ^ We note that from Equation 3.45, Embrechts[2e]: $$\displaylines{ P\left( x>u+t | x>u \right)=\begin{cases} \left( 1 + \frac{1}{\alpha}\cdot \frac{t}{e(u)} \right)^{-\alpha} & ,\space \alpha<\infty \\ e^{-\frac{t}{e(u)}} & ,\space \alpha=\infty \end{cases} }$$ where \(e(u)\) is the mean excess function. For the subexponential class, \(\underset{u\to \infty}{lim}e(u)=\infty\). Therefore, \(\underset{u\to \infty}{lim}P\left( x>u+t | x>u \right)=1\) for all \(t\). Another derivation can be found in Lemma 1.3.5, Embrechts[2f].
  7. ^ Here we prove that \(P\left( X_{1}+X_{2}+\cdots X_{n} > x\right) = P\left( \text{max}\left( X_{1}, X_{2}, \cdots , X_{n} \right) > x \right)\) as \(x\to \infty\) for the Pareto distribution \(\frac{\alpha}{x^{1+\alpha}}\). We focus on the case \(n=2\) as the general case follows by induction. On the right hand side we have: $$\displaylines{ \begin{aligned} P\left( \text{max}\left( X, Y \right) > L \right) =\space & P\left( X > L \right) + P\left( Y > L \right) - P\left( X > L \wedge Y > L \right) \\ =\space & \int_{L}^{\infty}\frac{\alpha}{x^{1+\alpha}}dx + \int_{L}^{\infty}\frac{\alpha}{y^{1+\alpha}}dy - \int_{L}^{\infty}\int_{L}^{\infty}\frac{\alpha}{x^{1+\alpha}}\frac{\alpha}{y^{1+\alpha}}dxdy \\ =\space & \frac{1}{L^{\alpha}} + \frac{1}{L^{\alpha}} - \frac{1}{L^{2\alpha}} \\ =\space & 2\cdot \frac{1}{L^{\alpha}} \end{aligned} }$$ For the left hand side, we consult the below diagram:
    $$\displaylines{ \begin{aligned} P\left( X+Y>L \right) =\space & 2\cdot \text{area}_{A} + \text{area}_{B} \\ =\space & 2\cdot \int_{1}^{\frac{L}{2}}\int_{L}^{\infty}\frac{\alpha}{\left( z-y \right)^{1+\alpha}}\frac{\alpha}{y^{1+\alpha}}dzdy+\frac{2^{2\alpha}}{L^{2\alpha}} \space,\space \text{where}\space z=x+y \\ =\space & 2\cdot \int_{1}^{\frac{L}{2}}\frac{1}{\left( L-y \right)^{\alpha}}\frac{\alpha}{y^{1+\alpha}}dy \\ =\space & 2\left[ \frac{1}{\left( L-y \right)^{\alpha}}\left( -\frac{1}{y^{\alpha}} \right) \bigg\rvert_{1}^{\frac{L}{2}} - \int_{1}^{\frac{L}{2}}\frac{\alpha}{\left( L-y \right)^{1+\alpha}}\left( -\frac{1}{y^{\alpha}} \right)dy \right] \text{, integration by parts} \\ =\space & 2\left[\frac{1}{\left( L-1 \right)^{\alpha}} - \frac{2^{2\alpha}}{L^{2\alpha}} + \int_{1}^{\frac{L}{2}}\frac{\alpha}{\left( L-y \right)^{1+\alpha}}\frac{1}{y^{\alpha}}dy \right] \\ <\space & 2\left( \frac{1}{\left( L-1 \right)^{\alpha}} - \frac{2^{2\alpha}}{L^{2\alpha}} \right) \space,\space \text{since the last term is a positive number} \\ =\space & 2\cdot \frac{1}{L^{\alpha}} \end{aligned} }$$ Since \(P\left( \text{max}\left( X, Y \right) > L \right) < P\left( X+Y>L \right)\) and \(P\left( \text{max}\left( X, Y \right) > L \right)=\frac{2}{L^{\alpha}}\), we get \(\frac{2}{L^{\alpha}} < P\left( X+Y>L \right) < \frac{2}{L^{\alpha}}\), which completes our proof.
  8. ^ For a derivation of the mean excess function of the Pareto distribution, see Eliazar[4]. For a list of \(e(u)\) for some standard distributions, see Table 3.4.7 Embrechts[2g].
  9. ^ Here we show that the zero correlation multivariate Cauchy distribution is stable, that is \(Y = aX_{1} + bX_{2}\) is Cauchy for all \(a\) and \(b\). $$\displaylines{ \begin{aligned} \int_{-\infty}^{\infty}\frac{1}{2\pi}\frac{1}{\left(x_{1}^{2}+x_{2}^{2}+1\right)^{\frac{3}{2}}}dx_{1}dx_{2} =\space & \int_{-\infty}^{\infty}\frac{1}{2\pi}\frac{1}{\left(x_{1}^{2}+\left(\frac{y}{b}-\frac{a}{b}x_{1}\right)^{2}+1\right)^{\frac{3}{2}}}dx_{1}\frac{1}{b}dy \\ =\space & \frac{1}{2\pi b}\frac{1}{y^{2}+a^{2}+b^{2}}\frac{x_{1}(a^{2}+b^{2})-ay}{\sqrt{\frac{(y-ax_{1})^{2}}{b^{2}}+x_{1}^{2}+1}}\bigg\rvert_{-\infty}^{\infty}dy \\ =\space & \frac{1}{2\pi b}\frac{1}{y^{2}+a^{2}+b^{2}}\frac{x_{1}(a^{2}+b^{2})}{\frac{|x_{1}|}{b}\sqrt{a^{2}+b^{2}}}\bigg\rvert_{-\infty}^{\infty}dy \\ =\space & \frac{1}{\pi\sqrt{a^{2}+b^{2}}}\frac{1}{(\frac{y}{\sqrt{a^{2}+b^{2}}})^{2}+1}dy \\ =\space & \text{Cauchy}(0, \sqrt{a^{2}+b^{2}}) \end{aligned} }$$
  10. ^ The raw data comes in the name of "Covid-19 Case Surveillance Public Use Data". The sqlite commands to aggregate the data are:
        .import COVID-19_Case_Surveillance_Public_Use_Data.csv covid --csv
        select *, cast(deaths as float)/num_cases from (select age_group, case when `Race and ethnicity (combined)`='White, Non-Hispanic' then 'white' else 'other' end as race, sum(1) as num_cases, sum(case when death_yn='Yes' then 1 else 0 end) as deaths from covid where current_status='Laboratory-confirmed case' and age_group not in ('NA', 'Unknown') and medcond_yn='No' group by age_group, race);
        
    The aggregated number of confirmed cases and deaths is given below:
    age_group race num_cases deaths
    0 - 9 Years other 21049 10
    0 - 9 Years white 10929 5
    10 - 19 Years other 48859 12
    10 - 19 Years white 43449 16
    20 - 29 Years other 92645 40
    20 - 29 Years white 77203 33
    30 - 39 Years other 75982 83
    30 - 39 Years white 50992 25
    40 - 49 Years other 59138 157
    40 - 49 Years white 44041 63
    50 - 59 Years other 40753 292
    50 - 59 Years white 43781 109
    60 - 69 Years other 19053 360
    60 - 69 Years white 27443 211
    70 - 79 Years other 6750 363
    70 - 79 Years white 11463 397
    80+ Years other 3621 514
    80+ Years white 6027 874

Reference

  1. ^ a b c d e f g h Nassim Nicholas Taleb, Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications, arXiv:2001.10488
  2. ^ a b c d e f g Modelling Extremal Events: for Insurance and Finance, Paul Embrechts, Claudia Klüppelberg, Thomas Mikosch, Springer
  3. ^ Are your data really Pareto distributed?, Pasquale Cirillo, arXiv:1306.0100
  4. ^ Lindy's law, Iddo Eliazar, Physica A: Statistical Mechanics and its Applications
  5. ^ Understanding Simpson's Paradox, Judea Pearl, The American Statistician. 68 (1): 8–13, doi:10.2139/ssrn.2343788
  6. ^ Causal Diagrams for Epidemiologic Research, Greenland, Sander; Pearl, Judea; Robins, James M. (January 1999), Epidemiology, 10 (1): 37–48, doi:10.1097/00001648-199901000-00008