[bit.ly/dsia05D] – Part D of Lec 5:Descriptive Statistics-An Islamic Approach. In previous lectures, we have explored some of the reasons why foundations of modern statistics constructed by Sir Ronald Fisher are deeply flawed. In this lecture we explain the basics of our alternative approach to the subject.
This lecture will explain how we can re-build Statistics on new foundations. To do this, we will first explain the foundations of conventional statistics – which may be called “nominalist” or Fisherian statistics. Then we will explain the alternative approach we propose, naming it REAL statistics. Our goal in this lecture is to provide clarity on the differences between the two approaches.
The Fisherian approach is based on fancy mathematical models, which are purely IMAGINARY – That is, the models come from the imagination of the statistician, and have no corresponding object in reality against which they can be verified. A Fisherian MODEL for data ALWAYS involves treating SOMETHING as a perfectly random sample from a hypothetical infinite population. However, there is flexibility in what that “something” may be – it is this flexibility that is deadly, allowing us to prove anything we like. The flexibility was not originally part of Fisher’s approach. He proposed to model the data directly. Later workers “generalized” his approach to make it applicable to a wide variety of data sets and situations. This generalization was dangerous because it makes unverifiable assumptions about unobservable entities and uses this as the basic engine of inference. In contrast, the Fisherian approach makes assumptions directly about the data, and hence is easier to assess and understand, although equally difficult to prove or disprove.
The typical use of this imaginary methodology involves breaking the data into two components: DATA = LAW + ERROR. The LAW captures a flexible class of models which you believe to be true. This flexibility makes the ERROR unobservable, because it shifts as you try out different potential laws. This gives you a HUGE potential for constructing ANY LAW you like to explain the data – what is unexplained by the law is AUTOMATICALLY part of the ERROR.
We illustrate how this methodology allows us to prove anything at all: Take any data, and decompose at as DATA = Desired Law + Error. This is always valid By DEFINING Error := DATA – Desired Law. Now make STOCHASTIC assumptions about Error in rough conformity with errors obtained at your desired law. Current methodology allows us to make almost any assumptions we like about the error. The beauty of the stochastic assumptions is that a wide range of numbers satisfy them. If we say that the errors follow some common distribution ( that is, they are random draws from a hypothetical infinite population) it is very hard to assess whether or not this is true. This difficulty is increased because a flexible range of laws make it difficult to pinpoint the actual errors, to check the stochastic assumptions. Furthermore conventional methodology generally does not bother to even try to test assumptions on errors, making it even easier to prove any model conforms to the data.
The key illusion created by conventional statistical methods is based on a misunderstanding of the nature of statistical models. ALL statistical inference is based on the IMAGINARY stochastic model regarding errors. HOWEVER, textbooks create the widespread belief that inference comes from the DATA! This is what permits us to “LIE with statistics”. Making complex assumptions about errors allows us to achieve any kind of inference, and attribute this to the data. Then we can browbeat people by telling them that we have made a deep analysis of the data, and the truths we have uncovered cannot be accessed by ordinary people not trained in the mysteries of sufficient statistics. In fact, the inferences come from unverifiable assumptions about unobservable errors.
In opposition to this, we propose an alternative, which we will call REAL Statistics. At the heart of this approach is the idea that the data provides us with CLUES about underlying realities. The goal of inference is NOT related to DATA itself. Rather, the GOAL is to use the data to UNDERSTAND the real-world processes which generated the data. This NECESSARILY involves going beyond the data. Conventional statistics treats only the data, and Fisher explained that the goal of statistics is to reduce large and complex data sets to a few numbers which adequately summarize the data and can be understood. Today, because of advanced computational capabilities, we are able to directly handle large data sets, and can move beyond this idea of statistics as being the reduction of data sets.
Our approach radically changes the task of the teacher of statistics, requiring the creations of new textbooks as well as more training. We must ALWAYS look at the DATA set together with the REAL WORLD PROBLEM under study with the help of the given data set. We can NEVER study DATA sets in isolation, as a collection of numbers. This teachers will have to acquire knowledge and expertise going beyond the numbers to the real world phenomena which generate the numbers.
Another way to understand conventional statistics is to say that it has the following GOAL: find STOCHASTIC patterns in the data. These patterns allow us to treat the data as Random Sample from an IMAGINED population. There is NO WAY to assess validity of imaginary assumption. The pattern is in the eye of the beholder, and cannot be matched against real structures to see whether it is “true”. The standard methods to assess validity of patterns are goodness of fit, prediction, and control. These are central to conventional methodology, but of peripheral interest in the real methodology. To understand why the search for patterns fail, we consider the failure of this methodology illustrated by the failure of the forecast competition run by International Journal of Forecasting (IJoF). The IJoF ran a competition for many years, where researchers were invited to submit algorithms for finding patterns in data, and using these algorithms to predict the next few data points. IJoF tried these different pattern finding algorithms on a thousands of real world data series to see which one works best. But these competitions did not yield any consistent results. Different types of algorithms would perform differently across series, with unpredictable patterns of performance. This becomes perfectly understandable from the REAL statistics perspective. An algorithm would perform well if an only if the pattern it discovered matched the underlying real world structures which generate the data. These structures differ widely across the data series and so no one algorithm could find them all. It is only after we know the real world context that we can search for the right kind of pattern. Without checking for match to reality, we are just ‘shooting in the dark’ and completely random forecasting results are to be expected. For more details, see A Realist Approach to Econometrics.
We come to the question of “How to do REAL statistics?”. The basic goal is to Look at the BEHAVIOR of the data to get CLUES about the operation of the real world. Note that this step – looking at the data – was NOT POSSIBLE when Fisher created his brilliant methodology (for the time). Given a 1000 points of data, it was a massively laborious task to graph the data, or to create histograms, which provide a picture of the data distribution. Now, we can do this with one click. The ultimate GOAL is to discover CAUSAL EFFECTS, or UNOBSERVABLE OBJECTS, which give rise to the patterns we see in the data. But the first step is to just be able to look at the patterns in the data, without imposing preconceived patterns on them, as required by the Fisherian approach. Descriptive Statistics is about LEARNING to look at the data in a way which leads to LEARNING about the real world. The real world is characterized by unobservable objects and unobservable causes. But before we can learn about these deeper realities, we must learn how to read the surface – the appearance of the data. An early approach to “just looking at the data” was pioneered by Tukey, with the name of Exploratory Data Analysis. EDA was a collection of techniques for looking at the data. However, it was consistent with, and complementary to, the Fisherian approach. The goal was to see if the data patterns would validate a Fisherian model for the data, or whether they would suggest some alternative theoretical models. EDA looks at the data in order to generate a Fisherian hypothesis about the data – NOT a hypothesis about the real world process which generates the data.
The TASK of a DS teacher is much more difficult than that of a conventional statistician. Biometrics is the study of statistics applied to Biological Problems. The teacher must know some biology in addition to statistics. The real world context has dramatic effect on HOW to analyze the numbers. We illustrated this by the study of inflation, where the discussion required understanding WHY inflation matters, and WHY we are trying to measure this. Different numbers and different techniques become useful according to different uses for these inflation numbers.
Since there is no universal collection of methods valid for all contexts, teaching can be by apprenticeship, via case studies only. Within any real world context, we must learn about the real world to understand the linkages between the real world and the numbers which measure aspects of the real world. We must know the MEANING of the numbers, not just the numbers. This necessarily requires going beyond conventional statistics, which deals only with analysis of numbers. No template for analysis can be given to students. Rather, by teaching how to think about numbers in different real world contexts, we hope the student will learn some ways of thinking which can be applied more generally. This is like the “case study” method now popular in business schools. In this course, we will illustrate this methodology in different contexts.
In this course, we are trying to learn HOW to LOOK at DATA. This is because this is first and introductory course. Learning and analyzing deeper real world objects and causes is very much a part of REAL statistics, but requires advanced methods, suitable for later courses. We note that techniques of “Data Visualization” enabled by computers were far beyond the reach of researchers a few decades ago. Making a histogram, or a graph, of 1000 data points was extremely laborious task. Now it can be done with a click. It is NO LONGER necessary to make convenient simplifying assumptions – as in Fisherian approach to statistics. This leads to a radical conclusion: A HUGE amounts of extremely sophisticated mathematical theory is PURELY IMAGINARY and can be thrown out of the window! , We can temper this radical conclusion by noting that there are certain limited contexts where the Fisherian probability models provide an adequate match, or even an excellent match, to the actual data. In such cases, the original methods would continue to be valid and useful, as supplements to the more general approaches to be studied in Real Statistics.
Links to Previous Lectures.
Motivation and Explanation of the Islamic Approach, is given in the first lecture. 1A: Descriptive Statistics: Islamic Approach, 1B: Purpose: Heart of An Islamic Approach, 1C: Eastern & Western Knowledge, and 1D: How to Teach & Learn: Islamic Principles
Currently, this course is under development, and is being offered for beta-testing as a free online course, with the expectation of getting useful feedback for the final version. You can register for the course at the Al-Nafi Portal: Descriptive Statistics: An Islamic Approach.