On Inference, Causation, Correlation, and Association: How Scientists Assign Outcomes of Research, and why it is Important.

Thanks to my friend and associate Michael Lo for his input on this.

chi-eq1Recently, the media seems intent on furthering the scientific ignorance that seems to be rampant in American culture. From Alternative facts to inconvenient truths, science is taking a beating at the hands of pseudoscientists, politicians, and others who have no business making scientific pronouncements. A pet peeve of mine is when those who don’t understand statistics start quoting statistics, particularly cause-and-effect. The most recent example of this is the statement that marijuana causes depression. But before we get into these ridiculous ideas, it is necessary to outline the terms of scientific research, as the media doesn’t seem to think this is important.

Without delving too far into multivariate analysis or regression, and in keeping with the brevity of this article, I shall keep short, limiting the discussion to the relationship between two variables. In the case given above, the use of marijuana causes in 100% of the cases depression. No one has ever used marijuana that did not suffer depression, and that that depression could be proved directly from the use of marijuana. You can easily see how ridiculous and unscientific such statements are.

This is because when we state there is a causal relationship between two variables, we are stating that one causes the other. Every time, even after adjusting for any other variable or modifier, we are stating empirically that in 100% of the cases, correcting for any bias, one variable creates or directly affects the other variable always. As you might imagine, cause is a term that scientists very rarely, if ever use. So what terms should you look for instead?

Correlation. When we say there is a correlation between two variables, this does not mean that they are somehow connected. A prime example is an increase in global mean temperature that corresponds with the reduction of the number of pirates. One need not be a scientist to say that, although there is a correlation between these variables, they are not likely related in any way. Few research scientists use the term correlation. Politicians use it frequently as it infers a connection between two variables however it truly says nothing. Clearly, there is a correlation between global mean temperature and the total number of pirates; however, to suggest that these two variables are somehow associated (one affecting the other) would be naïve at best, and deceptive at worst.

2000px-PiratesVsTemp(en).svgAssociation. Now we’re starting to talk the scientific lingo. When we examine two or more variables, and we find that one influences the other to a greater degree (based on a percentage or confidence interval to denote how certain we want to be about the relationship). In my dissertation, I examine the relationship between variables related to child abuse and chronic disease, among them type-2 diabetes, hypertension, and dyslipidemia. Because I wanted to be as certain as reasonably possible, I used a confidence interval of 95%. A confidence interval of 95% will give you a P-value <0.05. What this means is that there is a less than 5% chance that the results of your statistical test are merely chance.

While this is the standard in medical research, you may decide that you want to be as certain as possible that the relationship among your variables is really present. You would then use a 99% confidence interval, and you would expect a P-value of <.01. With a confidence interval (CI) of 99%, you are essentially saying that the outcome you found is as near to 100% as possible. In fact, even if you were to use a CI of 99.99999999999%, and the resulting P-value was <.0000000001 (with statistical software this is very easy to do), this would still not prove cause. You see how difficult it is for a scientist to say that one thing causes another? So you can imagine how ridiculous it sounds to scientists when politicians claim it.

Even if we find an association (we like to use the term statistically significant association) there maybe other variables that can account for or affect our outcome. For example, in my own research, where I studied the association between child abuse and chronic disease, I had to control for other variables, for example socioeconomics, family history of disease, behavioral variables (tobacco use, alcohol use, physical activity, physical fitness level, vocation, income, and several others) that may modify (affect in some way) my results.

To measure association, we use a few simple tests, among them the Chi-square test. Chi-square tests of association generally assess whether the observed association has less than a 5% (or less than 1%) chance of occurring due to NO effect of the other variable. To be certain our sample size is sufficient, we run a G*Power Analysis which will give us our minimum sample size. For example, if I wanted to test for association between two variables with a CI of 95%, we would need at least 34 subjects. Now, if we wish to be really, really certain, we wanted a CI of 99%, this would require just under 11,000 (10,881) subjects! If we have too small a sample group we can use a different association test called the Fisher’s Exact Test.

In the case of the article claiming that use of marijuana causes depression, I was unable to find any corresponding data to support the conclusions. Had the data been available (as a medical research scientist I have access to most studies conducted in the United States), locating the corresponding data to this study proved impossible. Had I been able to locate the data and review it, ensuring that the researcher did indeed control for other variables, I could conclude that they had performed their due diligence to ensure that the association between marijuana use and depression is supported. However, this is not the case.

One quick method of ascertaining whether or not a study has been conducted using the scientific method is to look for the data tables. If there are no data tables, then most likely there was no data. Another way is to look for the verbiage. Watch for terms such as causation, causes, or anything that seems inflammatory.

In a culture of alternative facts and scientific ignorance, the reader should be cognizant of what they’re reading and how to tell if it’s science or something else that begins with S.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s