bit.ly/DSIA03E – Part E of Lecture 3 in Descriptive Statistics: An Islamic Approach. Lecture examines effects of varying bin sizes on histograms
Preliminary Remarks: Mistaking the Map for the Territory
In order to be able to understand (simplify & reduce) data – it is useful to construct a statistical model for it. If the data follow a theoretical distribution, then they can be described by a formula. The DISTRIBUTION of the data may be identified with a theoretical distribution (like the Normal). If this is true, that allows us to substantially reduce the data set, since Normal distributions are completely characterized by only two numbers: the mean and the standard deviation.
A good way to identify the Data distribution is to look at the HISTOGRAM – a picture of the data. But, as we will see in this lecture, there are many possible histograms for the data, depending on the bin size. A traditional question is: What is the BEST model for the data? In the current context, what is the best bin size for making a histogram. This is the WRONG question. Data is primary, models are secondary. Different types of models describe different aspects of the data. As we decrease the bin size, we get a more refined picture of the data. At each level of refinement – histograms illuminate different aspects of the data. There is no one BEST bin size. We will illustrate this general concept by examining the histogram for Life Expectancies for 190 countries in the WDI data set for the year 2018.
We start by looking at the Default Histogram for 2018 Life Expectancy for 190 countries in WDI. The Histogram goes from MIN=52.8 to MAX=85.0 and makes 7 bins of equal size, where Bin Size = 4.6 years.
Starting with this as a baseline we will examine the effects of making the bin-size smaller or larger. In general, if bin size is too large, all data goes into one bin and details are lost. On the other hand, if bin size is too small, every bin contains only one or zero data points and the groupings in data are not VISIBLE from the graph. The above 7 bins is a compromise between these two opposing effects, as we will soon see.
We start with the Coarsest Histogram with One Bin Only:
From this, we learn the RANGE of the data: it varies from MIN=52.805 to MAX=84.934. This is a COUNT histogram. We learn that there are 190 countries in the data set from the vertical axis. Later we will study a PERCENTAGE or PROBABILITY histogram. This gives us the proportion of the population in a given bin. From a probability histogram, we would not learn the count, since only 100% would appear on the vertical axis.
Next let us look at a histogram with only two bins:
Two Bins divide the range from 52.8 to 84.9 into two equal parts. The midpoint of range is 68.8. 55 countries are in first bin of below midpoint Life Expectancy, while 135 countries in 2nd bin. Clearly, the distribution is NOT symmetric. From this graph it is obvious that the Normal distribution would NOT be the right model for this data set.
The 3 Bin Histogram divides countries into three categories – high, middle and low Life Expectancy. The Low LE Bin goes from 52.8 to 63.5, and has only 27 countries. The middle LE bin goes from 63.5 to 74.2, and has 71 countries. The high LE bin goes from 74.2 to 84.9, and has 92 countries:
What is very surprising is that the largest number of countries are in the highest category. The MODE is the bin (or category) which has the largest number of categories. The graph shows that the Mode is at last bin. WHY is this very surprising? That will become clear if we look at the histogram of these same 190 countries classified by GNP per capita in the same year 2018. This is graphed below
This 3 bin Histogram of GNP per capita, in PPP terms, constant USD, show that only a few countries belong the the high GNP category, and the vast majority belong to the low GNP category. This shows that EVEN countries in bottom third income category can achieve high life expectancies for population. This means that cheap and simple measures sufficient for substantially and significantly lowering mortality rates. One does not need to wait to grow rich as a country, in order to take effective measures to improve the health of the population.
The 4 Bin Histogram divides the range into four categories, with Bin Width = 8 years:
In this histogram, the Modal Bin is [68.8, 76.9] with 75 countries. There are only 59 countries in highest bin going from 76.9 to 84.9. The graph suggests that it is relatively easy to get LE upto 70, much harder to get it upto 80. To learn more about this, we need to look at the mortality rates in each age group. By comparing between countries with low and high mortality rates, we can learn about where is the greatest potential for improvement. To realize this potential, we need to investigate carefully the causal determinants of mortality.
As we go through graphs of 5, 6, 7, 8, and 10 bins, we get more information about how the data divides into different kinds of groupings. At each level of refinement we get more information about the data, and we also pick up some visual patterns not visible at other levels of refinement. However, as we increase the level of refinement, we start losing the ability to look at the graph and interpret it directly and visually. Here is the histogram with 20 bins:
There are FOUR modes in this histogram. When the number of countries in a bin is small, countries can fall into a bin or out of it by statistically accident. When you have two bins, High and Low, classifications would be robust to small errors – regardless of how you compute it, the classifications would remain the same for most countries. However, when you make up a large number of categories, this is no longer true, and classification can be much affected by small errors in the data. Thus the number of countries displayed in the graph is NOISY – it is much affected by errors. As we make the bins even smaller, the noise increases even more and the patterns in the data are no longer visible.
With 200 bins, it is very hard to see any of the patterns in the data that were easily visible when the bin sizes were smaller:
There is a paradox here. Technically, all the data in the coarse bins is actually contained in the refined bin. It is just that our eyes do not process this kind of information well, we cannot convert the picture to the patterns visible in the histograms with larger bin size. Good bin sizes balance objective information with our subjective capabilities to process information. The default bin-size chosen by EXCEL gives a fairly good picture of the data. Note that information in histogram becomes visible AFTER we make the graph, so choice of “optimal” bin size is impossible. The choice of bin size gives us a Distribution which provides a MODEL for data, but there is no TRUE model. All models are approximations to enable us to summarize the data and understand it.
Deeper understanding requires examination of mortality rates and their causal determinants. This requires going further, beyond the data sets, into examining mortality rates, classifying them by type, and examining causes of each type. Numbers gives us clues about the real world, but are never the goal of the analysis,. Statistical analysis must be followed up by examining real world issues that they highlight.
LINKS TO RELATED MATERIALS
Lecture 1: Distinguishing features of an Islamic Approach to Statistics – In four parts: bit.ly/dsia01a , b, c, d,
Lecture 2: Comparing Numbers: Comparing multidimensional qualities necessarily involves values, and hence most rankings are subjective, not objective measures of external reality. In six parts: bit.ly/dsia02a, b, c, d, e, f
Lecture 3 (Current Lecture) on Life Expectancies Part A explains the Life Expectancy is a one-dimensional numerical measure, and hence objective. Part B described how LE is computed in detail, and what these numbers mean. Part C makes a start on analyzing World Bank WDI data for 190 countries from 1960 to 2017 on Life Expectancies. Part D constructs, analyzes and interprets HISTOGRAMS for this data set. This part E analyzes effects of changing bin size on Histograms Shortlinks are bit.ly/dsia03a, b, c, d, e