20 - What COVID-19 Can Teach Us About Data Science

This article was originally published on Towards Data Science (opens new window).

The COVID-19 pandemic has been a tragic, yet illuminating case study in how valuable data (and by extension statistics and forecasting models) can be in a crisis. It has also exposed numerous flaws in society’s relationship with data. While as a data scientist it has been exciting to see terms like “agent-based simulation” and “exponential growth rate” enter the mainstream, it has also been frustrating to watch statistics being applied in such an unscientific way. In this article, I will look at three data science lessons I have taken away from this pandemic and how we can apply them more broadly.

# 1. Most People Do Not Understand Probability

Our ability as humans to recognise patterns has no doubt been very useful to us as a species but it can also lead us astray. The truth is that a lot of statistics is not intuitive and when it comes to probability, common sense is often misleading. To illustrate this let us consider the act of shuffling a deck of cards (something that most people have done plenty of times and have built up some intuition around). I recently asked 10 of my friends to estimate the probability of shuffling a deck of cards into a completely new order (i.e. an order that has never been seen before with any deck of cards in human history). Most people guessed that the answer was 0 and reasoned that surely every possible combination of cards has been seen by now. Some, suspecting a trick, estimated that the probability might be higher but nobody proposed a probability above 0.2. The true answer is that every time you shuffle a deck of cards (assuming a truly random shuffle) you can almost guarantee that it is the first time any deck of cards has ever been in that order. The total number of possible unique arrangements for a deck of 52 cards is 52! ≈ 8x10⁶⁷ or

80000000000000000000000000000000000000000000000000000000000000000000

To put this into context, if every single person alive today had shuffled a deck of cards every second since the start of the universe, that would only amount to a comparatively tiny 3x10²⁷ shuffles or

3000000000000000000000000000

Another famous example of our inability to interpret probabilities is the 2016 US presidential election. Leading up to the election, Donald Trump was given roughly a 25% chance of defeating Hilary Clinton by political pollsters, but of course he was ultimately victorious. This led to public condemnation of polling companies and pundits for “getting it wrong”. Now to be clear, this election did raise some legitimate methodological questions about how polling is done but that doesn’t take away from the fact that we should never be surprised when something with a 0.25 probability of occurring does occur. Would flipping a coin twice and getting Heads both times (an event with the same odds) elicit similar outrage?

The upshot of our probabilistic blindspots is that we are great at identifying patterns but not so good at drawing conclusions from these patterns. In the context of the pandemic, this renders us particularly susceptible to political sophistry and misinformation about the spread of the SARS-CoV-2 virus. Over the last 6 months I have seen countless pieces of “analysis”, many of which have been published in major news outlets, that break fundamental rules of statistics.

# 2. The Theory Matters

As data science has exploded in popularity over the last few years, the barriers to entry have been lowered significantly. Now anybody with a rudimentary programming background can access huge public data sets and play around with deep learning models. While overall I see this democratisation as a positive trend, I do fear that people are neglecting the rigorous and arguably less exciting aspects of data science like statistics, probability theory and mathematics in general. I am reminded of the old adage that a little knowledge is a dangerous thing as without a good understanding of these topics, an amateur data scientist simply has more tools with which to find spurious correlations. The COVID-19 pandemic has shown that the misuse of statistics is not just unscientific, it can be dangerous. There isn’t room in this article to explore all of the fallacious reasoning I have seen in this pandemic but the two examples below are fairly typical.

# Example 1: Dr Phil

Image by Nathan Saad.

Dr Phil recently questioned the need for lockdown measures by comparing the number of deaths from COVID-19 to the 360,000 annual deaths from drowning in swimming pools. Now to start with, I have no idea where this number came from. Data from the Centers for Disease Control and Prevention suggest that the true number of annual unintentional drownings in the US is more like 3500 [1].

The second reason why Dr Phil’s reasoning is illogical is because unlike COVID-19, the risk of unintentional drowning is not multiplicative (i.e. it does not spread like a virus). If we observe 100 deaths from drowning in one week, there is no reason for us to expect that number to double the following week and double again the week after that. Without policies to curb transmission of the disease, this is the kind of growth you would expect to see with COVID-19.

The third error Dr Phil makes here is to imply that deaths from drowning and deaths from COVID-19 come from similar probability distributions. This is quite obviously untrue. As shown by Taleb, Bar-Yam, and Cirillo (2020), pandemics are fat-tailed and swimming pool deaths are not [2]. Consider this morbid hypothetical: If somebody told you that 10% of the US population had died in one year, would you think it was more likely they died from COVID-19 or accidental drowning? Drowning deaths (though tragic) do not pose an existential threat, pandemics do.

# Example 2: Arbitrarily Overlaying Curves

Another phenomenon I have observed is people arbitrarily overlaying trend charts and extrapolating wildly.

Warning NSW

Some analysis shows NSW on a worse path than VIC

This chart is concerning. Haven’t seen his work before but if you plot NSW path at same point as Vic .. Sydney and beyond are not well placed. pic.twitter.com/H9iUt5z0H1
— Rafael Epstein (@Raf_Epstein) July 23, 2020

The plot in this tweet compares the number of confirmed cases from community transmission in the Australian states of New South Wales and Victoria and was circulated widely. The apparent implication here is that SARS-CoV-2 is spreading through the community in NSW faster than it was a month earlier in VIC. Here are two of my concerns with this chart:

The 1 month lag is arbitrary and appears to have been chosen because the curves line up visually. My first instinct would have been to use a lag equal to the number of days between the first case of community transmission in VIC and the first case of community transmission in NSW. Perhaps there are better options than this, but you certainly shouldn’t manipulate the data to suit your hypothesis and then say “hey look how much the data supports my hypothesis!”.
The daily counts are far too small to draw meaningful conclusions about community spread. The creator of this chart is arguing that the true rate of spread in the community is higher in NSW than it was in VIC a month earlier, but half of the NSW 7-day average counts shown are less than 1 case per day! Even without completely understanding the mechanics of how the virus spreads through the community, we have observed that the daily counts are quite noisy, especially when these counts are low. Inference from noise is bad practice.

Below I have recreated the same chart but with the latest data included and we can see that the trends are nothing alike. Please note that I have used a log-scale on the y-axis for consistency but this actually masks how dissimilar these curves really are.

If viewing on a phone this plot may look better in landscape mode

# 3. Data Science is Not Magic

Even if we avoid all of these statistical fallacies, we have to be honest about how much we actually know about this virus; not very much. At the end of the day it doesn’t matter how sophisticated your model is or how advanced your technology stack is, if you don’t have good data about the system you are modelling your results will be compromised.

Most of the forecasts of case numbers and deaths I have observed have used statistical and/or epidemiological models with pretty low levels of success. An investigation into the IHME model in May (which has informed US policy decisions) found that “the level of uncertainty implied by the model casts doubt on its usefulness to drive the development of health, social and economic policies” [3]. The focus of this article is not on the technical details of modelling, but whatever methodology is chosen forecasters will inevitably need to make many assumptions when designing their model. This uncertainty around model design is partly due to knowledge gaps in the following areas.

# Mechanics Of The Virus

We are learning more every day about the nature of this virus but there is still a lot we don’t understand. At the time of writing there remain questions about the role of super-spreaders, asymptomatic spread, climate etc. in the transmission of the virus. Up until recently, most of the literature suggested that the virus spread predominantly on surfaces and that aerosol transmission was not as much of a concern. This has been reversed in the latest research.

# Human Behaviour

Since the virus is spread by people, modelling human behaviour will be at the heart of many forecasting models, but this is notoriously difficult, particularly in novel situations like a pandemic. Not only do modellers need to make assumptions about how people will react to growing case numbers and public health policy, the spread of the virus is highly sensitive to outliers. Consider South Korea where the virus was well contained until the infection of “Patient 31”. It is estimated that this individual was responsible for the infection of over 5000 people (an outcome that no model would have predicted) [4].

# Policy

If the virus existed in a static system, modelling spread from historical data would be relatively simple. Unfortunately this is far from the case. Government policy is shifting dynamically, as are people’s reactions to these policies, resulting in a modelling task akin to shooting at a moving target. When trying to anticipate the impact of policy X, one simple approach would be to look at the impact policy X had when applied in other countries. While it is definitely important to learn from past results, we also need to remember that regions can differ dramatically in many ways including population, geography, climate, testing rates, culture, policy implementation and policy compliance. Take for example the vastly different attitudes to masks seen in the US as compared to Asian countries like Japan.

To be clear, I am not at all criticising the important work these forecasters are doing. I simply want to explain why the modelling task is so difficult and why law-makers and the general public should not expect miracles. The good news is that the more research and data we collect, the better these models should become.

# Conclusion

The lessons outlined in this article are not only relevant to the COVID-19 pandemic. To wrap up I will briefly explain how I believe these insights can be applied more generally to data science work.

Most People Do Not Understand Probability: Data scientists frequently work with non-technical stakeholders and are required to communicate complex concepts. This pandemic has demonstrated how easily data can be misconstrued, so the onus is on us to be clear and honest whenever we report analytic results.
The Theory Matters: While statistics and probability may not be the most exciting facets of data science, all practitioners need to understand these topics. These days it is possible to train very complex models without really understanding how they work, but this doesn’t mean that you should! Without knowledge of concepts like statistical testing, confidence intervals, random variables and the law of large numbers you really are flying blind.
Data Science is Not Magic: Over the past decade there have been a torrent of enthusiastic media stories about advances in areas such as computer vision, recommendation engines and speech recognition. For better or worse, this has led to a perception that data scientists can perform miracles, but the truth is that a lot of projects are like modelling the COVID-19 pandemic: complex, lacking in data and time consuming. It is important to be upfront with stakeholders and to set realistic expectations.

# References

[1] Centers for Disease Control and Prevention, Unintentional Drowning: Get the Facts (opens new window) (2016), https://www.cdc.gov/homeandrecreationalsafety/water-safety/waterinjuries-factsheet.html

[2] N. Taleb, Y. Bar-Yam and P. Cirillo, On Single Point Forecasts for Fat-Tailed Variables (opens new window) (2020), International Journal of Forecasting

[3] R. Marchant, N. Samia, O. Rosen, M. Tanner and S. Cripps, Learning as We Go — An Examination of the Statistical Accuracy of COVID-19 Daily Death Count Predictions (opens new window) (2020), COVID-19 e-print

[4] Y. Shin, B. Berkowitz and M. Kim, How a South Korean church helped fuel the spread of the coronavirus (opens new window) (2020), The Washington Post