Data’s dangerous dogmas

The dogmas of data science are harmless for data-led companies whose value mechanisms are a by-product of their data. But companies whose data are a by-product of their value mechanisms need a more flexible approach.

Graeme Keith
5 min readJun 18, 2023
Photo by Guillaume Marques on Unsplash

Modern data analytics is characterized by three mythical dogmas

  • Data are the only natural starting point for all analysis
  • The way to ensure objectivity is to suppress speculation and “simply” gather data — “the facts” — with righteous impartiality
  • Data “speak”. That is, they furnish us with models and explanations, through which they are synthesized and more deeply understood, and on the basis of which we can make informed decisions.

The downside of data dogmatism

Many of the great successes in data analytics have taken place in companies that primarily deal with data. These are companies whose primary activity (though not value mechanism) is to generate data, and who then, after the fact, look for ways to monetize these data. These companies need to curate the way they generate value according to the insight they can extract from analyses of their data.

But with the revolution in artificial intelligence and machine learning, institutions that do not primarily deal with data are looking more and more to their data to explain the world in which they operate and to address the business and policy problems they face. These companies need to curate their data according to the insight they can extract from analysis of the way they generate value.

Here the very success of Data Science is its undoing because the principles that have been applied with so much success to data looking for a value proposition run aground on the sand banks of these false dogmas when applied to a value proposition looking for data.

Data are not the only natural starting point for analysis

For analysis of phenomena with a view to forming conclusions or furnishing explanations, hypothetical explanations are an excellent starting point.

For decision-making, the objectives you are trying to achieve and the decision levers with which you are trying to achieve those objectives are the natural starting point, and far and away the best place to start.

In both cases, data play an important role, motivating potential explanations and informing the connections between decisions and data, but the relationship is a dialogue, not a linear sequence starting with data.

Data are not impartial

The selection of data necessarily supposes relevance. But without some level of commitment to a pre-supposed explanation or model connecting the data to the problem at hand, we can not know what is relevant and what is not.

Processing and cleaning of data suffer the same necessity; you can not know what is signal and what is noise without some expectation with regard to the ultimate significance of the data, what you think they’re going to say and how they might relate to the problem at hand.

There are no data without conjecture. The notion of an objective fact is a chimera because the presentation of that datum as a fact already betrays a commitment to some more or less hidden conjecture that motivated the selection and presentation of exactly that datum in the first place.

Data are not necessarily sufficient

Not only are we silently guided by our suppressed conjectures in the selection of data, if we try to censor explanations, we have no way to know whether the data we have gathered are the data most relevant to understanding or solving the problems at hand. And we have no mechanism to motivate the search for further data that may provide confirmation, refutation or further insight.

Data do not speak, much less explain

No one has ever given any account of an objective, transparent process, much less a deductive process, by which an accumulation of data generates an explanatory theory, model or hypothesis.

Data do not speak. We can not even interpret data without recourse to some theoretical context.

But what, then, of objectivity?

This desire for impartial data (out of which models and explanations should somehow inevitably emerge) arises from the fear of a judicious selection and distortion of data in the service of a pet theory or a hidden agenda. Here’s Karl Popper

…if we are uncritical we shall always find what we want: we shall look for, and find, confirmation, and we shall look away from, and not see, whatever might be dangerous to our pet theories. In this way it is only too easy to obtain what appears to be overwhelming evidence in favour of a theory…

But, as Popper goes on to argue, the futile attempt to collect data impartially does not address the problem of objectivity. Best case, we are as biased as ever, but — convinced of our own good data hygiene — we have persuaded ourselves we are not. Worst case, we learn to game the gathering of data and the generation of hypotheses, and we present the process as neutral and impartial in a secret, deceitful servitude to hidden agendas and pet theories. In any case, theorizing is essential in the discovery, selection and presentation of data.

Conjecture and refutation

Popper’s solution is to ensure objectivity by conjecturing multiple explanatory hypotheses and then to shift the discussion of objectivity to the adjudication of a critical contest between those conjectures.

The way in which knowledge progresses, and especially our scientific knowledge, is by unjustified (and unjustifiable) anticipations, by guesses, by tentative solutions to our problems, by conjectures. These conjectures are controlled by criticism…

Disposing of the notion that data “speak” liberates us to conjecture creatively and puts hypothesis back in its rightful place, in a dialogue with data, not subservient to it.

Data motivate multiple conjectures, the attempt to distinguish between these guides us in the selection of data and motivates the search and discovery of new data. These in turn test the mettle of our conjectures in a crucible of critical discourse. Hypotheses are refined, rejected, confirmed, disconfirmed; but never proved, and never — more’s the pity for Popper — finally falsified either. We learn to live with that contingency.

Data are necessarily the nodes at which our conjectures intersect with the world. They are essential to the adjudication between theories and we must do everything in our power to ensure they are not willfully distorted or suppressed. But by shifting the focus of our pursuit of objectivity to the contest between multiple conjectures, we free data from the illusion of independence and open up for a much richer, more fruitful relationship between hypotheses and the data that inspire and regulate them.

--

--

Graeme Keith

Mathematical modelling for business and the business of mathematical modelling. See stochastic.dk/articles for a categorized list of all my articles on medium.