bob the goat
November 6th, 2006, 02:08 PM
Due to a recent illness, I have done some digging for some medical facts, and have been like a kid in a candy store with all the new things that I didn’t even know that I didn’t know about. Most of the things were trivia, and probably not that useful to anyone in real life, but interesting to me none the less. The one thing that I did strike upon that riled me somewhat was the blatant abuse of statistics. To somewhat qualify myself, I have a Bachelors degree in Manufacturing Engineering, and an Associates in Mechanical Engineering. I have taken more statistics classes than any human should have to endure (so much so that I begun to like it, as disgusting as that thought is).
The thing that I want to talk about is the difference between correlation and cause.
First, a definition of Correlation: A measure of the strength of linear association between two variables. The relationship between two sets of data, that when one changes, the other is likely to make a corresponding change.
What drove me nuts when researching all these different things was how the people writing the article would interpret a correlation as a cause. Just because there is a correlation between two things does not mean that one is the cause of the other.
When doing research involving correlations there are two (or more) factors, and a cause for each of those factors. Ideally, you will discover that one factor is the cause of the other, allowing you to change one thing and affect both. i.e.
Let me give you some examples. I’m going to list the theory, then the data, then the conclusions...then I’m going to evaluate that conclusion based on sound statistical practice. Note, I’m making up numbers here, I didn’t actually do any of these experiments.
Situation 1:
Theory: There is a correlation between people that wear cotton jumpsuits and crime.
Data: I observed 500 people. 250 in a mall and 250 in the county jail. In the mall there was only one person in a cotton jumpsuit, but in the jail nearly 98% wore jumpsuits.
Conclusion: Clearly, criminals are made so by the wearing of cotton jumpsuits.
My analysis: The conclusion is wrong. This is a great example of how there can be a correlation (criminals are forced to wear jumpsuits), but the cause is neither of the factors. You are not a criminal because you wear a jumpsuit, and crime is not caused by jumpsuit usage.
Situation 2:
Theory: The number of people shoveling their driveway is correlated with how much snow is received.
Data: For 10 days I observed my neighborhood. 4 days no one was shoveling and there was no snow. 4 days there were 8 people shoveling and there was 2” of snow. 2 days there were 12 people shoveling and there was 6” of snow.
Conclusion: Clearly the more people that there are shoveling, then the more snow we get. Seeing as it begins to snow before the people begin to shovel, clearly nature has a way of sensing how many people are going to shovel, and increases snow output accordingly.
My analysis: Clearly there is a correlation. You could even go so far as to say that one causes the other, however the researcher mixed up one cause with another.
A great example of this type of mistake was when I looked up the dangers of Diet Coke. They did research observing people from different age brackets and consumption levels of pop. They found that there was a strong correlation between weight gain an diet pop consumption. Many diet sites listed this as a strong proof that diet pop makes you fat. The site that did the research said in their conclusions that it seemed more obvious when looked at with comparisons throughout time that people that were more overweight switched to diet soda. Therefore, there is a correlation but it is that fat people drink diet pop, not that diet pop makes you fat.
Situation 3:
Theory: The number on your thermostat and the temperature in your house are correlated.
Data: The thermostat was set at 50, and the house was 50. The thermostat was set to 80 and the house was 80.
Conclusion: There is a perfect correlation between the two, and the number on the thermostat is the cause of the temperature in the house. If you manipulate the number on the thermostat, you are thereby manipulating the temperature in the house.
My analysis: That is Good statistics. Bad statistics would be if you saw the correlation and came to the conclusion that the temperature in the house controlled the number on the thermostat.
Situation 4:
Theory: There is a correlation between video game violence and violent children.
Data: A high percentage of violent children play video games. A high percentage of non-violent children play video games. Overall, childhood violence is on a decline, despite the rise in violence in video games.
Conclusion: Given that there are equal percentages of violent and non-violent children that play video games, and that game usage is on the rise, and that violence levels are falling, there is an inverse correlation. That means that when one goes up, the other goes down (think temperature v.s. snowfall, the lower the temp, the more snow.).
My analysis: This is good statistics.
So. The moral of the story is:
When you see data from research identify what the factors are, and identify what the causes of those factors are. Do not assume that because there is a correlation that one automatically causes the other, and if it does, make sure to identify which factor controls which.
The thing that I want to talk about is the difference between correlation and cause.
First, a definition of Correlation: A measure of the strength of linear association between two variables. The relationship between two sets of data, that when one changes, the other is likely to make a corresponding change.
What drove me nuts when researching all these different things was how the people writing the article would interpret a correlation as a cause. Just because there is a correlation between two things does not mean that one is the cause of the other.
When doing research involving correlations there are two (or more) factors, and a cause for each of those factors. Ideally, you will discover that one factor is the cause of the other, allowing you to change one thing and affect both. i.e.
Let me give you some examples. I’m going to list the theory, then the data, then the conclusions...then I’m going to evaluate that conclusion based on sound statistical practice. Note, I’m making up numbers here, I didn’t actually do any of these experiments.
Situation 1:
Theory: There is a correlation between people that wear cotton jumpsuits and crime.
Data: I observed 500 people. 250 in a mall and 250 in the county jail. In the mall there was only one person in a cotton jumpsuit, but in the jail nearly 98% wore jumpsuits.
Conclusion: Clearly, criminals are made so by the wearing of cotton jumpsuits.
My analysis: The conclusion is wrong. This is a great example of how there can be a correlation (criminals are forced to wear jumpsuits), but the cause is neither of the factors. You are not a criminal because you wear a jumpsuit, and crime is not caused by jumpsuit usage.
Situation 2:
Theory: The number of people shoveling their driveway is correlated with how much snow is received.
Data: For 10 days I observed my neighborhood. 4 days no one was shoveling and there was no snow. 4 days there were 8 people shoveling and there was 2” of snow. 2 days there were 12 people shoveling and there was 6” of snow.
Conclusion: Clearly the more people that there are shoveling, then the more snow we get. Seeing as it begins to snow before the people begin to shovel, clearly nature has a way of sensing how many people are going to shovel, and increases snow output accordingly.
My analysis: Clearly there is a correlation. You could even go so far as to say that one causes the other, however the researcher mixed up one cause with another.
A great example of this type of mistake was when I looked up the dangers of Diet Coke. They did research observing people from different age brackets and consumption levels of pop. They found that there was a strong correlation between weight gain an diet pop consumption. Many diet sites listed this as a strong proof that diet pop makes you fat. The site that did the research said in their conclusions that it seemed more obvious when looked at with comparisons throughout time that people that were more overweight switched to diet soda. Therefore, there is a correlation but it is that fat people drink diet pop, not that diet pop makes you fat.
Situation 3:
Theory: The number on your thermostat and the temperature in your house are correlated.
Data: The thermostat was set at 50, and the house was 50. The thermostat was set to 80 and the house was 80.
Conclusion: There is a perfect correlation between the two, and the number on the thermostat is the cause of the temperature in the house. If you manipulate the number on the thermostat, you are thereby manipulating the temperature in the house.
My analysis: That is Good statistics. Bad statistics would be if you saw the correlation and came to the conclusion that the temperature in the house controlled the number on the thermostat.
Situation 4:
Theory: There is a correlation between video game violence and violent children.
Data: A high percentage of violent children play video games. A high percentage of non-violent children play video games. Overall, childhood violence is on a decline, despite the rise in violence in video games.
Conclusion: Given that there are equal percentages of violent and non-violent children that play video games, and that game usage is on the rise, and that violence levels are falling, there is an inverse correlation. That means that when one goes up, the other goes down (think temperature v.s. snowfall, the lower the temp, the more snow.).
My analysis: This is good statistics.
So. The moral of the story is:
When you see data from research identify what the factors are, and identify what the causes of those factors are. Do not assume that because there is a correlation that one automatically causes the other, and if it does, make sure to identify which factor controls which.