Real Statistics: A Radical Approach

Asad Zaman
11 min readJun 16, 2022

This is an announcement of an upcoming course (online/live), meant to supplant and replace the conventional approach to statistics. The current approach was developed in early 20th century on the basis of logical positivist foundations. Logical Positivist philosophy has proven to be defective and rejected. However, the necessary rethinking of the foundations of modern statistics has not been carried out. In this article, we will show how positivist foundations lead to three major flaws in the foundations of modern statistics. We invite interested readers to join a course in Introductory Statistics, based on a realist philosophy, which remedies these flaws, and calls for a revolution in the discipline. To register for this online course, fill out the Google Form at http://bit.ly/RSRA000. The rest of this article explains the defects in modern statistics which make it necessary to create a new approach to the subject.

The Wikipedia article about “How To Lie With Statistics” shows that this is the most popular statistics textbook, with larger sales than that of all other statistics textbooks combined!! WHY?

Three Major Defects in Modern Statistics: These defects are common to the entire structure of knowledge created in the West over the past few centuries, and these make it easy to lie with statistics:

. Ontology: What exists
· Epistemology: What can we know about the world (and what exists)
· Methodology: What are the correct methods to acquire knowledge

All three were created in early 20th Century, based on the philosophy of Logical Positivism. This philosophy had a spectacular crash in mid 20th Century. BUT: the foundations of statistics were not revisited!! Planned course rebuilds statistics by replacing these foundations.

Historical Background of European Intellectual Developments

The philosophy of Logical Positivism is a theory of knowledge which arose in the West due to peculiar historical circumstances outlined below:

1. Europeans reconquer advanced but decaying Islamic Spain (Al-Andalus).

2. Translations of millions of books in the libraries of Al-Andalus ends the dark ages of Europe.

3. Influx of new knowledge creates a battle between “science” and religion which lasts for two centuries, and splits Christianity.

4. Bloody fratricidal battles between Christian factions necessitate building knowledge on secular grounds — equally acceptable to all factions

The Key (false) Assumption: There exists secular knowledge! That is, we can start from ZERO, and build knowledge purely on the basis of observations and logic. This is illustrated by Descartes: “I think therefore I am!”. This attempt to prove self-existence, starting from complete doubt about everything, is logically flawed. The attempt reflects the trauma of Loss of Faith, created by religious wars. What everyone believed with certainty for centuries turned out to be wrong! European intellectuals faced the question of “How can we build knowledge which we can be sure of?”

We will use the term “Epistemological Arrogance” for the stance that human beings can start with zero knowledge, and use observations and logic to arrive at certain knowledge about the world we live in. This epistemological stance, also known as the Quest for Certainty, has been responsible for some disastrous errors in European philosophies. To understand this better, it is useful to contrast it with the Epistemological Humility which the Quran teaches: (17:85) “You (mankind) have been given very little knowledge”. Starting from the realization that our knowledge is limited leads to a radically different theory of knowledge than those dominant in the West. In particular, we reject the idea that knowledge is justified true belief, because most of our knowledge is conjectural, and cannot be justified or proven to be true.

Since religious wars created doubts about God, Descartes starts his theory of knowledge with ONTOLOGY: What exists? In particular, does God exist? The problem is that human reason is too weak and fallible to lead to certainty. For example, the argument of Descartes is clearly wrong, for two different reasons. Use of the word “I” already presumes existence. In general, deductive arguments can NEVER produce knowledge. Deductive logic consists of reasoning from premises to conclusions. So, A => B means that B is derived from knowledge contained in A. No new knowledge can be produced by this type of reasoning. Other types of reasoning (inductive, abductive) can go beyond the premises, and guess at higher level patterns and hidden structures of external reality. However, such reasoning is based on intuition and guesswork, and can never lead to certainty. Having been burnt by blind faith in Christianity, European intellectuals rejected these branches of reason. The second reason is deeper — it is the rejection of sense experience, and the testimony of the heart, as a basis for knowledge. The Quran frequently refers to the heart as an instrument for perception of the truth: (7:179) They have hearts with which they do not understand, they have eyes with which they do not see, and they have ears with which they do not hear. The quest for certainty created by the trauma of loss of faith placed impossible demands on European intellectuals. In particular, they rejected all modes of reasoning which could lead to the existence of God. Logical Positivism was a response to these demands.

Logical Positivism: This is a theory of knowledge which attempts to arrive at certainty. Only two valid sources of knowledge are acknowledged: our observations and logic. This leads to:

· Ontology: We can only be certain about existence of what we can perceive with our five senses.

· Epistemology: Knowledge comes solely from observations and logic.

To make decisions, we need to analyze our experiences, in order to guess at consequences which will follow our actions. Thus a theory of knowledge should tell us how to derive lessons from experience? This leads to:

Positivist Methodology: Knowledge comes from patterns in the observations that we see. Understanding comes from recognizing a pattern.

ALL of these epistemological principles are built into the foundations of statistics; more generally, they ground all of the intellectual efforts of European philosophers over the past few centuries. But, all of these principles are WRONG! The blinders created by Logical Positivism make it impossible to do statistics, as we now explain.

Central Concepts of Statistics are unobservable:

The trauma of loss of faith in God led European intellectuals to reject all unobservables, and rely only on observables as a basis for knowledge. This created fundamental problems for statistics as two key concepts required for statistics are both unobservable:

1. Probability: Correctly defined as being about what might have been.

2. Causality: Necessary connection, not accidental correlation.

Modern Statistics has NO satisfactory definition of either of these concepts. Both frequency theory and Bayesian (subjective) probability are seriously flawed. There is no way to deduce causality from data. Correlations are routinely used to deduce causation, because there is no alternative. Logical Positivism makes it impossible to come up with a correct definition of probability, and also to differentiate between correlation and causation. Both of these defects make it impossible to do statistics correctly.

Quantification of Knowledge:

Use of knowledge about physics as a model of all human knowledge has led to deep misunderstandings about the social sciences. Lord Kelvin’s Dictum may well be suitable for physics, where accurate measurements have led to advances in understanding:

When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.

This idea has been expanded to the social sciences. Psychologists attempt to measure and quantify intelligence, trust, corruption, and many other human characteristics which are normally considered to be qualitative and unmeasurable. Business Schools teach that “If you cannot measure it, you cannot manage it (or improve it)”. These misguided efforts come from the positivist views about knowledge. In our new approach, the adjective “Real” refers to the opposing realist view. It is illuminating to contrast these two views via a diagram:

Positivist Views consider the most precise view about knowledge to be about numbers

Positivist Knowledge is about Appearances

Realist views regard knowledge to be about hidden real world structures:

Realist Knowledge is about unobservable reality

These radically different views about the nature of knowledge translate into radical differences in methodology — the way we pursue knowledge. According to Positivist methodology, knowledge is about observables, and the most accurate forms of knowledge arise from quantification of the observables. Accordingly, statistics deals with analysis of data sets, which provide us with the most accurate forms of knowledge available. The aphorism that “Numbers don’t lie” captures the spirit of conventional statistics. This is dramatically different from the realist methodology, which considers knowledge to be about the unobservables, the hidden entities and mechanisms of the real world. The observables, and qualitative and quantitative efforts to measure the observables create data. The data provides imperfect and remote clues about the real world. Since essential knowledge is primarily about the unobservable real world, the data are never the primary object of interest. Data analysis can never confine itself to the numbers alone.

Concrete Example: It is useful to provide a concrete example to illustrate these abstractions. Suppose our goal is to improve the quality of research at XYZ University. The problem is that Quality of research is unobservable, unmeasurable, and unquantifiable. We consider the radical differences between a Positivist and a Realist approach to solving this problem.

Positivist Solution: No knowledge is possible about what cannot be observed. So we must translate the unobservable “quality” into observable indicators of quality, and use numbers to quantify these observables. This leads to counting publications in journals, and calculating the impact factor if these publications to get a numerical measure of quality. This approach is actually dominant around the world these days. The result has been a dramatic increase in quantity of low-quality articles published, a dramatic increase in fake journals which will publish anything, and a dramatic increase in fraudulent methods to increase impact factor counts. Departments around the world have increasing numbers of faculty who have learned how to play the numbers game, but have very little idea about how to do research.

Realist Solution: We start by thinking about the real world problem we are trying solve. Why do we want to improve quality of research? In the Real Statistics course, our goal is to provide service to mankind, for the sake of the love of God. With this goal in mind, quality of research will refer to how much that research will be able to help solve problems facing us collectively. Thus research should be judged on the basis of how useful it is in practical terms. We reject the theory/practice divide of conventional methodology. Instead, we will advocate development of theory as a solution to an actual real-world problem. It is only in this way that we can judge the utility of research. Thus, a real solution would be to advocate applied research, which develops theory and goes on to apply it to solution of concrete problems. Then the value of research could be judged (qualitatively) in terms of success in providing real solutions. Note how dramatically different this approach is from the positivist one.

Fisher’s Contribution: An Obstacle to Progress

Fisher developed a brilliant methodology for analyzing data sets using only the primitive computational capabilities of the early 20th century. This is now the dominant methodology, taught in all modern statistical textbooks. No one seems to have noticed that rapid increase in computational capabilities have rendered it obsolete. Fisher’s ingenious solution to the lack of computers was to IMAGINE that data arises as a random sample from a parent population which is characterized by a small number of parameters. Use data to estimate parameters of parent population. These few numbers capture all the information in the data. Fisher invented the theory of “sufficient statistics” and shows that parameter estimates based on the data — just a few numbers — captured all the information available in the data about the (imaginary) parent population. Fisher defines statistics as “reduction of data”. Fisher himself was aware that his methodology was due to lack of computational capabilities. His followers have built huge superstructures on this methodology without awareness that we no longer need to create an imaginary parent population, in order to reduce the data. Modern computers can deal with huge data sets directly, without needing to reduce them to a few numbers.

Modern statistics is severely handicapped by blind adherence to Fisherian methodology. There are (a few) cases where data is actually a random sample from a parent population. In these cases, Fisherian methodology is valid. For the vast majority of data sets, this is not true. The false assumption that the data is a random sample from a population causes massive distortions in statistical inference. For example, it is standard to assume, without justification, that the data comes from a Normal Distribution. In this situation, the mean and standard deviation are sufficient statistics. The imaginary parent population (not the actual data) are completely characterized by the mean and the standard deviation. BUT, if there is no parent population, or if the parent population is not normal, vastly distorted inference results. In particular, the mean and standard deviation, widely used as summary statistics for all data sets, are extremely poor as data summaries.

It is worth noting how Logical Positivism led to the development of the Fisherian methodology. The data measures observable and quantifiable aspects of hidden reality. Positivism tells us that the measurements are all that matter. We do not need to recover the hidden reality which is being partially captured by these measurements. Measurements by themselves are an incoherent jumble. They need to be organized, in order to be understood. Treating data as a random sample from an unknown parent population imposes a pattern on the data, and allows it to be understood. However, the imposed pattern may have no relation to the reality which generated the data. The data can have many different complex relationships with reality. Our imaginary model CAN potentially match reality, but rarely does. This is not part of statistical training, to try to match models to reality.

Real Models Attempt to match reality

Key takeaway

We need to understand how data relates to reality. We need to reconstruct hidden reality from clues provided by data. BOTH of these central projects of REAL Statistics are absent from IMAGINARY Statistics. Instead, we MAKE UP a model for the data in our imagination. We NEVER test this model for a match to unobservable reality. Real Statistics also requires a model for the data. BUT this is a hypothesis about how reality generates the data — the actual relationship between hidden reality and the observed data. Once we shift attention away from the data to the unobservable mechanisms which generate the data, an entirely new set of questions and concerns arise. This lead to a radically different methodology for analysis of data sets. Teaching this methodology is the central concern of our upcoming course on Real Statistics: A Radical Approach.

To register for this free online course, sign up on the Google document http://bit.ly/RSRA000 The first live lecture will be on Sunday June 26th at 9:00AM Eastern Standard Time, USA. All those who signup for the course will receive information about how to attend the live lecture, and also access to course materials. There are 12 chapters in the new textbook we plan to cover over six months. Each chapter has five sections, with recorded lectures for each section. After the first lecture, students will be asked to study the five recorded lectures on their own. Then, two week later, we will have another live lecture to discuss Chapter 1. Going through the 12 Chapters in this way, with live lectures for each chapter every two weeks, will take 24 weeks, or 6 months.

--

--

Asad Zaman

BS Math MIT 74, MS Stat 76 & Ph.D. Econ 78 Stanford