[bit.ly/dsia06b] – Part B of Lec 6 on Descriptive Statistics: An Islamic Approach. A Fisherian approach to statistics begins by ASSUMING that the data is a random sample from a theoretical population characterized by a small number of parameters. Such an assumption has no basis in reality, but is made to make statistical computations possible in a pre-computer era. Because of inertia, this century-old methodology continues to dominate the field, even though advances in computational capabilities have made it obsolete. Instead of an assumed distribution, REAL statistics takes the ACTUAL data distribution as the central tool for data analysis. This actual distribution never belongs to any of neat theoretical family of distributions which make for elegant mathematical analysis. It cannot be written does on paper as a formula, but it can easily be computed by the computer. A visual depiction of the actual data distribution is the Histogram, which we have studied earlier. A more rigorous mathematical approach would be based on the Empirical Cumulative Distribution Function (ECDF) which we will study and explore later. The ECDF is directly based on the data, depends on all of the data, and does not allow for data reduction, unlike the Fisherian approach. It is just that we can now handle computations on 15509 data points in the HIES data with a click, so we do not NEED to reduce the data before we start the analysis.

It is nonetheless useful to have a few SUMMARY statistics which describe the data distribution. Our main concern in this lecture is to develop the concept of the QUARTILES, as a natural and intuitive description of the data set. Parenthetically, we note that in the Fisherian approach, these summary statistics are the mean and the standard deviation, which work wonderfully for the hypothetical normal distributions, but are extremely poor for other distributions. Here is a visual description of how we define the Quartiles.

First we sort the data, so that it is arranged in increasing order. Then we divide the data into FOUR equal parts. With 15509 points of data, this comes out to about 3877 data points in each of the four portions. The Summary Statistics are the separating points for these four portions of the data: Q0, Q1, Q2, Q3, Q4. Note that HH Size is always an integer and varies from a minimum of 1 to a maximum of 61 on this data set of 15509 points. The summary statistics are computed as follows:

- Q0=1=Minimum HHS
- Q1=5=HHS(3877)= 1st Quartile, 3877.25=15509/4
- Q2=6=Median – HHS(7754), 7754.5=15509/2
- Q3=8=HHS(11631)=3rd Quartile, 11631.73=3*15509/4
- Q5=61=Maximum

We humans are much better at absorbing information in pictures and graphs, as opposed to numbers. That is why a Box-and-Whiskers plot (short form: Boxplot) provides a Graphic View of Summary Stats:

The LEFT whisker of the box-and-whisker plot goes from Q0 to Q1, the minimum to the first quartile. For HH Size the minimum value is 1, while the first quartile occurs at HH Size = 5. The third quartiles is at HH Size = 8. The BOX is made between Q1 and Q3 and represents HALF of the data. 25% of the data in in the lower whisker, while another 25% is in the RIGHT whisker which goes from Q3 to the maximum HH Size of 61. A line is drawn in the middle of the box to show where Q2 or the median belongs. Thus, all 5 quartiles Q0,Q1,Q2.Q3, and Q4 are pictures in the boxplot.

So what do we learn from this boxplot? Of the greatest importance is the Central Value or the median, which is HH Size = 6. What exactly does this mean? It means that HALF of the households have HH Size ≤ 6, while HALF of the households have household sizes ≥ 6. Thus the HH Size of 6 divides the population into two equal halves, where one half is smaller and the other half is larger. Some technical issues arise because HH Size is integer valued and jumps from 5 to 6 to 7 without taking any values in the middle. Thus, when we look at HH Sizes of 1,2,3,4,5, less than 50% HHs these sizes. When we add the size 6, then more than 50% of the households have size 1-6. This is a technical issue which is not of importance for us in the present context.

After the CENTRAL VALUE or the median, the next most important thing is the SPREAD of the data, which is measured by the Interquartile Range. This is defined as the distance between Q3 and Q1. In this data set, the BOX goes from HH Size 5 to HH Size 8. This means the 50% of the households have sizes in the range 5,6,7,8. 25% or less have HH Size below (1,2,3,4), while 25% have HH Size above (9,10,…,61). This tells us that the distribution is Asymmetric; it is Right SKEWed. The left whisker is very short, so the data distribution has a Short Left Tail. On the other hand, it has an Extremely Long Right Tail. To understand the quartiles better, we show how we can compute them from the following table. For each HH Size, the table COUNTS the number of HouseHolds with SMALLER HH Size. Thus, the first entry shows that there are 3412 HH’s which have size {1,2,3,4} (less than 5). We note that 3412/15,509 = 22%, so this is less than a quarter of the population. However, when we go to the next entry, that is 5521 HH’s of size {1,2,3,4,5} and this is 35% of the population. So HH Size = 5 goes from 22% to 35.6% which COVERS the 25% or the first quartile. Similarly, HH Size = 6 takes us from 35.6% to 50.6%, which COVERS 50% or the second quartile. Similarly, HH Size =8 takes us from 64.7% to 75.6%, which COVERS Q3 = 75%.

HH Size | < | % |

5 | 3412 | 22.00% |

6 | 5521 | 35.60% |

7 | 7891 | 50.90% |

8 | 10030 | 64.70% |

9 | 11722 | 75.60% |

These problems, where the percentiles jump from 22% to 35.6% without coming close to 25% arise because HH Size is an integer and can only take certain fixed valued. We next look at the Summary Stats for HH TE/cap (Total Expenditure per capita), which is a continuous variable. As we will see, these problems do not arise for continuous variables. A table similar to the one above lists the five quartiles of TE/cap in the first column. The second column list the NUMBER of HH’s which have smaller TE/cap, while the 3^{rd} column displays this number as a percentage of 15509.

TE/cap |
#HH below |
% HH below |

MIN = 1966 | 0 | 0% |

Q1 = 14275 | 3877 | 24.998% |

Q2 = 19454 | 7754 | 49.997% |

Q3 = 28648 | 11631 | 74.995% |

MAX = 1268708 | 15508 | 100% |

A visual depiction of these quartiles can be seen in a boxplot:

The central value is the Median TE/cap = 19,454. This is central because 7754 HHs are below (having less TE/cap) and also 7754 are above, having more TE/cap. In traditional statistics, one might use the Average value of PKR 27,119 for the Center of this distribution. This is great if the distribution is normal, but it becomes Very Distorted due to presence of huge outliers, which are not part of any normal distribution. In general, the widely used summary statistics of the Mean is best for Normal, but VERY BAD for general distributions. In contrast, the Median works well for ALL distributions, and has a natural and intuitive interpretation.

The next thing we learn from the data is the Dispersion: How Spread Out is the Data? The boxplot used the middle 50% of the data to measure this. The Interquartile Range. [14275, 28648] – Half of the households have TE/cap within this range. 25% have LESS and 25% have more. IQR = 28468-14275 = 14193. This is a natural measure of dispersion for general distributions. It is the REPLACEMENT for Standard Deviation which works well ONLY for normal distributions.

The boxplot also tells us about the Skewness & the Tails. Both HHSize and HH TE/cap are right-skewed. TE/cap is much more skewed. Both have large right tails — HHSize goes upto 61 – HH TE/Cap goes to 1,268,708. TE/cap has much more extreme extension in right tail. In contrast, the Normal distribution is symmetric and has thin tails.

We would like to study the relationship between HH Size and HH TE/cap (which is a proxy for HH Wealth). In conventional statistics, the methodology of doing this is based on “regressions”. As usual, these regressions are based on large numbers of unverifiable and false assumptions. Famous statistician David Freedman said that ‘we have been running regressions for a century. This has not led to any useful results. Let us abandon the technique’. In real statistics, we propose to use the Median Line of X given Y as a REPLACEMENT for regression lines. We will illustrate this by drawing the two Median-lines, one of HH Size against TE/cap and the other for TE/cap against the HH Size. Intuitively, the idea is to create small boxes (bins) for one the variables, say Y. That amount to making the range of variation small for that variable. Within a bin, the variable Y does not vary much. Now compute the MEDIAN value of X in this bin. That will tell us the central value of X for Y’s within a particular box or bin. Now, as we change the Y-values, moving up across the Y-bins, we will find 203 949 5500 how median of X changes as Y changes across bins. This will give us an idea about how the variable X responds to changes in the variable Y. We now illustrate the concept of Median-Lines for our HIES data set.

Conceptually, it is easier to see how the Median TE/cap varies with household size. We simply subdivide the data into groups according to HH Size. For each HH Size, we look at ALL the HHs with that size. Here is the Median Graph of TE/cap according to HH Size:

The first point on the graph shows that when HH Size = 1, the MEDIAN TE/cap is above 100,000. Note carefully what this means. It does not mean that ALL Households of size 1 are rich. Rather, there are (only) 159 HH’s of size 1 in the entire sample of 15,509. Among these 159 HH’s the median income is above 100,000 – that is more than half, or 80+ HH’s, have income in excess of 100,000. 80 of the HH’s in this group (having size = 1) have TE/cap LESS than 100,000. Similarly, for each category of HH Size, the dot shows the median TE/cap of all HH’s having that size.

It is clear from the graph that, as HH Size Increases, MEDIAN TE/cap decreases. The most rapid changes occur early, for small HH Size. From HH Size of 1 to 5, there is rapid reduction of Median TE/cap as HH Size increases. From HH Size 5 to 10, there are small reductions in median income. After HH Size = 10, Median line seems pretty flat.

Next, we consider the other Median-LIne of HH Size for TE/cap groups. In order to create this, the first step is to subdivide TE/cap into small buckets. There are many possibilities, but in the present case, a natural method is as follows. We note that 775 x 20 = 15500, so if we create 20 buckets, with each bucket having 775 families, we will cover 15500 families. To cover the remaining 9 families, we can just add one family to every other bucket. We will describe the full technical details how to do these operations in EXCEL in the next portion of this lecture. For the moment, we just note the income groups which are created by this procedure are as follows:

Group
No. |
TE/cap
Lo |
TE/cap
Hi |
Group Size | Group
No |
TE/cap
Lo |
TE/cap
Hi |
Group Size |

1 | 1966 | 9609 | 775 | 11 | 19455 | 20819 | 775 |

2 | 9614 | 11220 | 776 | 12 | 20821 | 22308 | 776 |

3 | 11221 | 12324 | 775 | 13 | 22309 | 24018 | 775 |

4 | 12324 | 13359 | 776 | 14 | 24024 | 26151 | 776 |

5 | 13359 | 14275 | 775 | 15 | 26154 | 28648 | 775 |

6 | 14275 | 15235 | 776 | 16 | 28651 | 31909 | 776 |

7 | 15236 | 16199 | 775 | 17 | 31916 | 37098 | 775 |

8 | 16200 | 17235 | 776 | 18 | 37100 | 45565 | 776 |

9 | 17236 | 18272 | 775 | 19 | 45576 | 64297 | 775 |

10 | 18273 | 19454 | 776 | 20 | 64317 | 1268708 | 775 |

Each of these 20 buckets have 775 or 776 families. Now we look at each of these buckets separately, and compute the MEDIAN HH Size for each group of 775/776 families. These can be plotted as follows

This is a graph of the Median HH Size for each of the 20 income groups as described above. This same information can be given in tabular form as follows:

TE/cap | Med HHs | TE/cap | Med HHs |

1966 | 9 | 19455 | 6 |

9614 | 8 | 20821 | 6 |

11221 | 8 | 22309 | 6 |

12324 | 8 | 24024 | 6 |

13359 | 8 | 26154 | 5 |

14275 | 7 | 28651 | 5 |

15236 | 7 | 31916 | 5 |

16200 | 7 | 37100 | 5 |

17236 | 7 | 45576 | 5 |

18273 | 6 | 64317 | 4 |

Both the graph and the table provide us with the same information. As we go up the TE/cap groups, the median HH Size decreases. This supports the idea that wealthier families have fewer children. But it also supports the reverse causality. That is, having more members in a Households reduces the amount of money available per member. That is, large HH Size leads to poverty. To understand causality is of essential importance, but this cannot be learnt from the data – the data does NOT provide the information required to learn about the causal directions.

**Conclusions**

We have done this data analysis without any assumptions about randomness. Even though we are using the word “distribution” to describe the data, this is just an observed pattern that the data follow. We are NOT making any assumption that the data is a random draw from any distribution at all. Fisherian old-school statisticians will find this terminology very confusing, because we are using similar words with different meanings. For example, the Median-Lines a a description of the “conditional distributions” of HH TE/cap given HH Size and also of HH Size given HH TE/cap. More discussion of this subtle issue will be given later in the course.

Both Median Lines show that wealthier families have less children – conversely, small HHs corresponds to higher TE/cap. Note that variables have been CAREFULLY chosen – This result holds for TE/cap but not for TE. Real Statistics requires relating data series to real concepts, not just treating them as numbers. Our Median-Lines show ASSOCIATION between the two variables. CAUSALITY cannot be learned directly from the data. The last point of great importance is that the relationship between HH Size and HH TE/cap is not deterministic. At any given HH Size, we have a large range of Households with very different TE/caps. Similarly, in every income (TE/cap) group, there is a large range of HH Sizes. How to understand these “flexible” relationships, also called “stochastic” relationships, will be the subject of the next portion of this lecture.

Links to Related Materials: