Understand Data Correctly

It’s impossible to escape anywhere either mentally or physically without hearing about the Coronavirus (COVID-19). If you came to read this to escape it then unfortunately I’ve misguided you. It all started when I came across a post on Twitter from someone that believes the virus is a manufactured hoax and their proof of data was a bar graph that compared these two data points: US Population 330,000,000 and COVID-19 cases at 35,000. My goal here isn’t to convince the reader one way or the other whether the virus is deadly or not, but rather to talk about the importance of data using a case study on the subject. I’ve been in business analytical roles for over a decade, so this article will discuss data and my experience with it.

The data comparison above is insignificant on so many levels. The first problem is that the two points of comparison are non-similar. One way to think of it is like comparing how many users are on Facebook against how many of those users like tacos. It’s good for gauging how many users like tacos, but we don’t understand what other foods are liked amongst the users. Likewise, it tells how many people have the virus in the total population, but it doesn’t say anything else about it.

This leads to what are we looking for? The problem with gathering data and sharing the results is that it has a strong potential for bias. It’s usually unintentionally caused; It’s a byproduct of trying to tell a story or confirming/denying a theory.

I saw another graph a while ago where it was comparing the causes of death from the coronavirus to other major causes of death like cancer, heart disease, and accidents. While this is certainly better than the US population against a single virus, there is one major concern: time.

According to the CDC, these are the most common causes of death in 2017:

Disclaimer: This was the most easily accessible data I could find, so it may be out of date.

When comparing the coronavirus to these deaths, it boils down to something you can find with a timestamp on this article: One list is an entire year’s worth of data and the coronavirus is only a few months. When comparing data points, it’s important to ensure that they’re on the same footing like a timeline. We don’t have a year’s worth of data for the coronavirus, but we do have a month: the most common denominator. If we were to compare deaths of the coronavirus against the most common methods of death, we would have to take the numbers from the CDC and divide by 12 months.

If we take a look at Heart Disease, it comes to 53,594 deaths per month. As of writing, deaths caused by Coronavirus are at 130,133 and have been tracked since mid-April, so a little less than four months. That puts it about 32,533 deaths per month. This comparison at least puts the data on the same playing field, but I wouldn’t say they’re playing the same game quite yet.

This is an unfair comparison for a few reasons, but I want to make sure that I clear the air with these before questioning the analysis:

  1. There’s a comparison of one year’s worth of data against a few months
  2. Death rates of the coronavirus are increasing daily
  3. We’re comparing a new viral infection that takes days to form against a disease that takes years to form

There is nothing we can do about point # 1, but it at least provides a starting ground. Point # 2 is also difficult to use in data analysis because heart disease is more than likely a consistent and persistent issue. Point # 3 is interesting because it starts to eliminate causes of death that would be an unfair comparison since the incubation period of heart disease is years.

None of the CDC most common causes of death outside of influenza are spreadable between people. The coronavirus is too. This is another common denominator and multiple data points should be used in analysis to verify that one set of data fairly matches another. Yet, even at a per-month basis for the coronavirus (33k), it’s relatively close to half of a year’s worth of data for influenza (55k). In just two months the virus exceeds influenza for an entire year. This already proves one point: the coronavirus has caused more deaths than influenza on average per year. There is not enough data to prove otherwise, and to be completely honest, not enough data to support that this is more deadly neither.

Why is there not enough data to support the deadliness? One reason is that doctor’s are starting to understand exactly what causes deaths from the coronavirus and how to treat them. Numbers are just a small part of a story. Comparing numbers can provide some statistical information and it’s useful if you’re looking at data over a timeline for something like income: Did income increase over time or not? However, if you’re comparing one set of data against another with a common denominator, then a ratio is a better option. This is especially true if there’s a coinciding cause and effect.

Ratios are a great referential point because data can have a strong comparative. If you were to tell me that New York City has 10,000 crimes per day, then it wouldn’t look as attractive to Denver, where it has only 100 crimes per day. With a ratio, though, we could say that 10,000 crimes is only 0.01% of the population as opposed to Denver’s 100 crimes is 5% of the population. When looking at ratios, Denver may actually be safer. Disclaimer: This is made up data for the sake of argument.

One problem with the published numbers for coronavirus cases is that most news media is publishing cases of the coronavirus over time. I suppose there’s nothing wrong with this in of itself, but testing is also increasing. When the virus first hit the US shores, you could only get a test if you were in contact with a COVID-19 positive person, or you traveled to China. Only then could you get a test if you were displaying any symptoms. Eventually, you could only get a test if you were hospitalized. As more tests have been produced, so have the test results. Thus, there’s a strong correlation between tests performed and COVID-19 positive people.

cumulative US tests 4 19
source: https://time.com/5800901/coronavirus-map/

When comparing data it’s important to ask two questions:

  1. Does this data have any correlation to another data set (i.e. cause and effect)?
  2. How can we make the data comparison fair?

More tests likely breeds more positive test cases. To make the data more fair, we need to compare another set of data that cannot be impacted by the cause/effect of more tests: hospitalizations.

In 2017, according to the CDC, there were 45,000,000 reported symptomatic illnesses caused by the flu. 810,000 of those were hospitalized. This is a ratio of 0.018.

The coronavirus has 3,000,000 symptomatic cases and 25l,499 hospitalizations. This is a ratio of 0.084. Using these ratios, we can determine that the coronavirus has a higher rate of hospitalizations than the flu.

Have we proven without a doubt that the coronavirus is serious? Potentially. I could further analyze the data, but using two different data points using ratios and statistical numbers is a good start and covers the purpose of this article.

In summary: when looking at data it is important to use proper comparisons and common denominators to ensure that data is fair. Otherwise, it leaves room for bias and the spread of possible misinformation and/or misunderstanding.